[jira] [Commented] (SAMZA-390) High-Level Language for Samza

Chris Riccomini (JIRA) Wed, 03 Dec 2014 13:48:12 -0800

    [ 
https://issues.apache.org/jira/browse/SAMZA-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14233567#comment-14233567
 ]


Chris Riccomini commented on SAMZA-390:
---------------------------------------

Thanks for the write-up [~nickpan47]. I'll have a look at Tigon.

There seem to be three main layers:

* SQL grammar.
* Relational algebra.
* Actual implementation of relational operators.

I agree with you that CQL's most interesting contribution seems to be its 
stream-relation model. I'm not crazy about its grammar, and it only provides a 
basic STREAM single-node unreplicated/unpartitioned reference implementation. 
If we buy that we want to use CQL's relational model, then the next questions I 
want to look at are:

# Can few find a better SQL grammar that still fits the same underlying 
relational model?
# Can we find streaming implementations that are distributed/partitioned, but 
provide the strong timing guarantees that CQL's relational model requires?

For (1), I've taken a look at Azure, and will also have a look at 
Tigon/StreamSQL.

For (2), MillWheel seems interesting. Will have to dig araound.

Here are my notes on Azure:

# I like the TIMESTAMP BY syntax in Azure. It seems more flexible than a rigid 
timestamp field enforced in the data model. It also means a single stream can 
have multiple timestamp fields, rather than having to re-materialize messages 
every time a new field should be used as the timestamp.
# Azure again uses the linear road example. Seems to be standard practice.
# Azure's SELECT has an explicit PARTITION BY clause 
(http://msdn.microsoft.com/en-us/library/dn835022.aspx).
# There seems to be a fully defined formal grammar for Azure in their reference 
docs (http://msdn.microsoft.com/en-us/library/dn835022.aspx).
# It's interesting that you can't SELECT \* in a join 
(http://msdn.microsoft.com/en-us/library/dn835026.aspx). I haven't thought 
about why.
# The data type supported seem to be quite primitive 
(http://msdn.microsoft.com/en-us/library/dn835065.aspx). The grid at the bottom 
of the page hints that our thinking about having a single data model and 
translating to/from underlying Serdes (SAMZA-484) is might be a reasonable 
approach. They outline how their data types are converted to/from Avro, JSON, 
etc.
# 
[This|http://azure.microsoft.com/en-us/documentation/articles/stream-analytics-scale-jobs/]
 page describes how Azure's partitioning works.
# The parallelism  model for Azure is manual. You have to manually define how 
many input stream partitions and "streaming units" (somewhat like containers) 
you want.
# Like Samza, Azure requires a user to be explicitly aware of partition count 
and key information: "If you are joining two streams, please ensure that the 
streams are partitioned by the partition key of the column that you do the 
joins, and you have the same number of partitions in both streams."
# Azure doesn't really have a concept of a row-based sliding window (e.g. last 
50 rows, or [Rows 50] in CQL). The closes that I can find is the [Sliding 
window|http://msdn.microsoft.com/en-us/library/dn835051.aspx], which operates 
at a 100ns interval, but still could theoretically jump from 4 rows to 6 during 
the epsilon hop.

Things I like:

* As expected, I found the grammar to be much more intuitive/approachable than 
CQL. Azure's grammar is a variation of T-SQL that introduces a few extra 
stream-related operators.
* TIMESTAMP BY seems like a nice way to support timestamps.

Things I don't like:

* Unlike CQL, seems to have no concept of tables. Seems to make joining a 
stream against a table impossible. Given Samza's state management, it seems 
that supporting tables explicitly in the grammar would be nice.
* Lack of row-based sliding window.
* Inability to specify a partition key from within the SELECT ... FROM 
statement. You're only allowed to partition by "PartitionId" right now. This is 
a missing feature that's yet to be implemented, but I presume will be shortly.

> High-Level Language for Samza
> -----------------------------
>
>                 Key: SAMZA-390
>                 URL: https://issues.apache.org/jira/browse/SAMZA-390
>             Project: Samza
>          Issue Type: New Feature
>            Reporter: Raul Castro Fernandez
>            Priority: Minor
>              Labels: project
>
> Discussion about high-level languages to define Samza queries. Queries are 
> defined in this language and transformed to a dataflow graph where the nodes 
> are Samza jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SAMZA-390) High-Level Language for Samza

Reply via email to