[jira] [Commented] (SAMZA-390) High-Level Language for Samza

Chris Riccomini (JIRA) Tue, 02 Dec 2014 17:30:46 -0800

    [ 
https://issues.apache.org/jira/browse/SAMZA-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232437#comment-14232437
 ]


Chris Riccomini commented on SAMZA-390:
---------------------------------------

I took a look at the CQL paper. Here are my notes:

# The IStream/RStream/DStream model seems very elegant from an academic 
perspective, but I am a bit worried about how easy it will be for the average 
user to pick up and used. I found the linear-road example to be somewhat 
convoluted. It's certainly not the case that you could say, "If you know SQL, 
this is a breeze to pick up." I'm mostly comparing this against my mental model 
of what I expected. I want to compare the CQL syntax to the syntax of 
[StreamBase|http://www.streambase.com/developers/docs/latest/streamsql/], 
[Azure Stream 
Analytics|http://msdn.microsoft.com/en-us/library/dn834998.aspx?WT.mc_id=Blog_SQL_Announce_DI],
 [tigon.io|http://tigon.io/], and [Esper|http://esper.codehaus.org/]. I have a 
feeling that some of the alternatives might be much more intuitive.
# I really like having relations (tables) baked in as a first class citizen. In 
addition to the 3 "pros" listed in the paper, I think it's really compelling 
for doing joins against tables. If you were to bolt on external key-value 
stores (e.g. Voldemort, Cassandra, etc) for joins, it would fit naturally into 
this model as well. I think Jay had a similar comment above.
# I have some concerns about performance with \[Range Unbounded\], and RStreams 
over any range other than Now. It seems to me that this state could get 
prohibitively large.
# There's not much of a partitioning model in CQL. We'd need to bolt something 
on top for Samza/Kafka.
# Requirements on time seem very strong. We should compare with 
[Millwheel|http://research.google.com/pubs/pub41378.html] to see how they 
handle time. I think Millwheel defines some heuristics that give CQL's 
requirements (i.e. once you've moved past time T, you'll never get a message 
for time <= T again). It might be possible to follow Millwheel's strategy when 
using CQL in order to get strong time guarantees on top of a distributed 
system. I also found Ben's comments on a time topic to be interesting. I'll 
need to think more about that.
# The state model in STREAM is not nearly as isolated as it is in Samza. 
Operators may reference other operators' state to reduce state 
size/duplication. Exposing task state for other tasks within a query topology 
seems like another use case that we should add to SAMZA-316.

I find the CQL language to be pretty well thought out, but a bit cumbersome to 
reason about/implement in a distributed system. I'm going to dig into some of 
the other systems (listed above) to see how their grammars compare.

> High-Level Language for Samza
> -----------------------------
>
>                 Key: SAMZA-390
>                 URL: https://issues.apache.org/jira/browse/SAMZA-390
>             Project: Samza
>          Issue Type: New Feature
>            Reporter: Raul Castro Fernandez
>            Priority: Minor
>              Labels: project
>
> Discussion about high-level languages to define Samza queries. Queries are 
> defined in this language and transformed to a dataflow graph where the nodes 
> are Samza jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SAMZA-390) High-Level Language for Samza

Reply via email to