[
https://issues.apache.org/jira/browse/SAMZA-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107285#comment-14107285
]
Raul Castro Fernandez commented on SAMZA-390:
---------------------------------------------
Wrapping up early discussions with Chris about high-level languages for Samza:
Some options for High-Level Languages:
----------------------------------------------------
We discussed about two main options to write a high-level language for Samza:
- SQL-like
- Pig/Imperative style
Following I quickly summarize both approaches:
#SQL-like
SQL is intended to primarily access static data. Because Samza reads and
processes streams, and these are infinite, SQL is not enough. To process
streams, there are mainly two families of streaming-SQL versions, these are CQL
and Streambase. The fundamental differences between these two are related to
their window semantics:
###CQL
In this model, tuples (events, messages) are assumed to have a timestamp. The
processing engine thinks of windows as time-based windows, that is: "Count(A.b)
in a window of 30sec", where the window will be defined according to the
timestamp included in the events.
###StreamSQL
In this model, tuples(events, messages) also include a timestamp. However, this
model is more suitable for tuple-based windows, of the form: "Count(A.b) in a
window of 1000 tuples".
I think it is important to make the distinction, because that basically changes
how one should think about processing streams when windows are required. There
is a good paper on this, that explains the differences between both models in
detail, and that I think would be worth reading to decide on one, once there
are clear use cases for Samza:
"Towards a Streaming SQL Standard": http://cs.brown.edu/~ugur/streamsql.pdf
#Pig/Imperative style
In this case, the idea would be to offer an interface similar to Pig. An
imperative style way of writing transformations on streams. This model could
include a fixed set of operators of different categories. There would be
operators to extract/read data from the underlying system (e.g. kafka topics),
stateless operators to perform filtering/map/etc, and there could be stateful
operators or stateful constructs (these are basic operators that take 2 inputs,
the input stream and the state) that would operate on state defined by users.
Implementation:
---------------------
#SQL
A declarative representation of a query would demand 3 things:
1- Parse the query
2- Build an execution tree (that would roughly map to the dataflow graph)
3- Optimize (rewriting the query to preserve the semantics but make it cheaper)
Apache Optiq gives these three, however there are a bunch of things that should
be addressed before anticipating that it is a solution:
1- Parse the query
Queries for Samza would include windows, therefore, the parser would need to
understand window semantics as well
2- Execution tree
This would probably stay more or less the same
3- Optimization
This changes a lot. First of all, windows introduce new challenges when one
wants to rewrite the query, as pushing an operator upstream can change the
semantics, for example. There are a bunch of theory and more practical oriented
papers on this, but it seems like a big problem. All I am saying is that before
even start thinking on optimizing this, it would be good to have a clear
understanding of the requirements. On top of that, the nature of stream
processing may lead to network bottlenecks, which would also need to be
included in the optimization algorithm. And if you start thinking about
joins... very interesting!
#Imperative style
In this case it would be needed to write a language and a parser and compiler
for such language. This alone is a bunch of work. There is one alternative,
writing a DSL. By using a DSL, we get rid of many of the problems of the
parser/optimizer. One could think that the problem with DSLs is the efficiency,
but this is not that important in this case, where we will have an
"optimization" phase anyway.
So these are the two options that I see for DSL:
##Internal DSL
Meaning that it will basically be part of the parent language, i.e. a library.
This is good for fast prototyping and trying new operators, etc. I think it can
easily be used by building the appropriate tooling around the DSL.
##External DSL
In this case they have their own syntax. This means that they require a
lexer/parser. Luckily enough, there are a bunch of tools to fix this. There is
the YACC, javaCC faimily of tools for JVM languages. There is also a more
interesting concept (never tried tough) in Scala, (with Parser Combinators).
So I guess, that one first good step towards this would be to really think
which one of the two abstractions is preferable, i.e. stream-SQL or
imperative-style DSL. Note that regardless, once there is a vertical solution
for one of these---from query to run---it will be possible to write one for the
other case.
> High-Level Language for Samza
> -----------------------------
>
> Key: SAMZA-390
> URL: https://issues.apache.org/jira/browse/SAMZA-390
> Project: Samza
> Issue Type: New Feature
> Reporter: Raul Castro Fernandez
> Priority: Minor
>
> Discussion about high-level languages to define Samza queries. Queries are
> defined in this language and transformed to a dataflow graph where the nodes
> are Samza jobs.
--
This message was sent by Atlassian JIRA
(v6.2#6252)