[jira] [Commented] (SAMZA-390) High-Level Language for Samza

Chris Riccomini (JIRA) Fri, 05 Dec 2014 17:41:08 -0800

    [ 
https://issues.apache.org/jira/browse/SAMZA-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14236456#comment-14236456
 ]


Chris Riccomini commented on SAMZA-390:
---------------------------------------

[~nickpan47], I agree with pretty much everything you've described re: 
StreamSQL. Here are my notes:

# StreamSQL grammar reference is 
[here|http://www.streambase.com/developers/docs/latest/streamsql/]. SELECT docs 
are 
[here|http://www.streambase.com/developers/docs/latest/streamsql/select.html].
# Grammar supports SELECT ... INTO, which seems very desirable. With Samza, 
this would allow you to SELECT, and have the results go into a second stream. 
This translates nicely, since it allows us to express output partition keys: 
`SELECT foo, bar FROM baz INTO stream2 PARTITION BY foo;` Note that the 
PARTITION BY syntax is something that I just made up, but seems to fit well. 
Also supports ERROR INTO, which can be used to shunt failed rows into another 
stream (e.g. serialization errors).
# Oddly, they also have CREATE STREAM AS syntax (in addition to INTO syntax).
# Noticing that no grammars seem to have an ORDER BY. StreamSQL seems to be the 
first one that I've come across with this operator. Seems useful.
# Grammar borrows range syntax from CQL: `FROM TicksWithTime [SIZE 1 ON 
LocalTime PARTITION BY FeedName]` The expressiveness of this syntax seems 
pretty powerful. In this case, LocalTime are second values, and SIZE 1 means 
window just 1 second. It's unclear exactly what PARTITION BY does, since there 
is also a GROUP BY in the statement. I wonder if this defines the partitioning 
of the input stream (and repartitions if the stream is not physically 
partitioned this way already). Here's another one that is a sliding 1s window 
over 20 seconds: `[SIZE 20 ADVANCE 1 ON StartOfTimeSlice PARTITION BY FeedName]`
# Highly recommend having a look at the 
[tutorial|http://www.streambase.com/developers/docs/latest/streamsql/usingstreamsql.html].
# Stream SQL brings up the concept of hierarchical data, which is also useful 
since we support arbitrary serdes in Samza, which might include hierarchical 
JSON/AVRO/Protobuf schemas. See the [wild 
card|http://www.streambase.com/developers/docs/latest/streamsql/ssql-wildcard.html]
 page for a nice abstraction on how to expand a hierarchical field into a flat 
set of field names.
# Has the concept of a metronome/heartbeat, so tuples can be injected into the 
stream based on some fixed interval.

bq. Overall StreamSQL seems to be a candidate that is very close to what we 
want, in terms of SQL syntax (maybe more than needed)

Totally. Maybe we can project down to just what we need as an MVP?

bq. Note that I am not a big fan of a strong-typed schema here, since field 
names may be added or deleted over time.

Yea, the DDL stuff seems a bit strange to me. To fully support DDL, we'd need 
some metadata repository. A random straw man idea would be to have no DDL and 
just allow developers to say `SELECT foo, bar FROM baz;`. If foo and bar exist 
in incoming messages, then the query works. If bar doesn't exist for a row, 
then the query would fail. Two ideas for getting around the failing query would 
be to 1) ERROR INTO a new stream 2) provide a DEFAULT command (e.g. SELECT foo, 
bar DEFAULT NULL FROM baz).

bq. It has a heart beat build in the SQL syntax. It might be useful if we allow 
user to inject timer tokens.

The more I look at these grammars, the more I realize how dependent they are on 
strong timing. I think this is going to be one of the major challenges. 
MillWheel's approach to time seems the most viable, but it's annoying that you 
have to lose data to do it (they claim exactly once messaging, but that's only 
for messages that don't get dropped due to lag, which they claim is ~0.001%).

[~milinda], I think most of what [~yipan] and I are discussing tends to agree 
with your last comment, WRT partitioning, joins, windows, etc. Also, regarding 
your [link|http://www.cs.nyu.edu/rgrimm/papers/nedb11.pdf] on distributed CQL, 
this is very interesting. I think the most notable thing I got from their 
write-up was the need for an intermediate layer, which we discussed supporting 
via an intermediate relational layer. It very much matches what their 
presentation/write-up describe, except that they use [IBM's System 
S|http://researcher.watson.ibm.com/researcher/view_group_subpage.php?id=2534] 
as the runtime instead of Samza, and they refer to the relational layer as the 
IL (intermediate layer).

> High-Level Language for Samza
> -----------------------------
>
>                 Key: SAMZA-390
>                 URL: https://issues.apache.org/jira/browse/SAMZA-390
>             Project: Samza
>          Issue Type: New Feature
>            Reporter: Raul Castro Fernandez
>            Priority: Minor
>              Labels: project
>
> Discussion about high-level languages to define Samza queries. Queries are 
> defined in this language and transformed to a dataflow graph where the nodes 
> are Samza jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SAMZA-390) High-Level Language for Samza

Reply via email to