[
https://issues.apache.org/jira/browse/SAMZA-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14236456#comment-14236456
]
Chris Riccomini commented on SAMZA-390:
---------------------------------------
[~nickpan47], I agree with pretty much everything you've described re:
StreamSQL. Here are my notes:
# StreamSQL grammar reference is
[here|http://www.streambase.com/developers/docs/latest/streamsql/]. SELECT docs
are
[here|http://www.streambase.com/developers/docs/latest/streamsql/select.html].
# Grammar supports SELECT ... INTO, which seems very desirable. With Samza,
this would allow you to SELECT, and have the results go into a second stream.
This translates nicely, since it allows us to express output partition keys:
`SELECT foo, bar FROM baz INTO stream2 PARTITION BY foo;` Note that the
PARTITION BY syntax is something that I just made up, but seems to fit well.
Also supports ERROR INTO, which can be used to shunt failed rows into another
stream (e.g. serialization errors).
# Oddly, they also have CREATE STREAM AS syntax (in addition to INTO syntax).
# Noticing that no grammars seem to have an ORDER BY. StreamSQL seems to be the
first one that I've come across with this operator. Seems useful.
# Grammar borrows range syntax from CQL: `FROM TicksWithTime [SIZE 1 ON
LocalTime PARTITION BY FeedName]` The expressiveness of this syntax seems
pretty powerful. In this case, LocalTime are second values, and SIZE 1 means
window just 1 second. It's unclear exactly what PARTITION BY does, since there
is also a GROUP BY in the statement. I wonder if this defines the partitioning
of the input stream (and repartitions if the stream is not physically
partitioned this way already). Here's another one that is a sliding 1s window
over 20 seconds: `[SIZE 20 ADVANCE 1 ON StartOfTimeSlice PARTITION BY FeedName]`
# Highly recommend having a look at the
[tutorial|http://www.streambase.com/developers/docs/latest/streamsql/usingstreamsql.html].
# Stream SQL brings up the concept of hierarchical data, which is also useful
since we support arbitrary serdes in Samza, which might include hierarchical
JSON/AVRO/Protobuf schemas. See the [wild
card|http://www.streambase.com/developers/docs/latest/streamsql/ssql-wildcard.html]
page for a nice abstraction on how to expand a hierarchical field into a flat
set of field names.
# Has the concept of a metronome/heartbeat, so tuples can be injected into the
stream based on some fixed interval.
bq. Overall StreamSQL seems to be a candidate that is very close to what we
want, in terms of SQL syntax (maybe more than needed)
Totally. Maybe we can project down to just what we need as an MVP?
bq. Note that I am not a big fan of a strong-typed schema here, since field
names may be added or deleted over time.
Yea, the DDL stuff seems a bit strange to me. To fully support DDL, we'd need
some metadata repository. A random straw man idea would be to have no DDL and
just allow developers to say `SELECT foo, bar FROM baz;`. If foo and bar exist
in incoming messages, then the query works. If bar doesn't exist for a row,
then the query would fail. Two ideas for getting around the failing query would
be to 1) ERROR INTO a new stream 2) provide a DEFAULT command (e.g. SELECT foo,
bar DEFAULT NULL FROM baz).
bq. It has a heart beat build in the SQL syntax. It might be useful if we allow
user to inject timer tokens.
The more I look at these grammars, the more I realize how dependent they are on
strong timing. I think this is going to be one of the major challenges.
MillWheel's approach to time seems the most viable, but it's annoying that you
have to lose data to do it (they claim exactly once messaging, but that's only
for messages that don't get dropped due to lag, which they claim is ~0.001%).
[~milinda], I think most of what [~yipan] and I are discussing tends to agree
with your last comment, WRT partitioning, joins, windows, etc. Also, regarding
your [link|http://www.cs.nyu.edu/rgrimm/papers/nedb11.pdf] on distributed CQL,
this is very interesting. I think the most notable thing I got from their
write-up was the need for an intermediate layer, which we discussed supporting
via an intermediate relational layer. It very much matches what their
presentation/write-up describe, except that they use [IBM's System
S|http://researcher.watson.ibm.com/researcher/view_group_subpage.php?id=2534]
as the runtime instead of Samza, and they refer to the relational layer as the
IL (intermediate layer).
> High-Level Language for Samza
> -----------------------------
>
> Key: SAMZA-390
> URL: https://issues.apache.org/jira/browse/SAMZA-390
> Project: Samza
> Issue Type: New Feature
> Reporter: Raul Castro Fernandez
> Priority: Minor
> Labels: project
>
> Discussion about high-level languages to define Samza queries. Queries are
> defined in this language and transformed to a dataflow graph where the nodes
> are Samza jobs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)