[
https://issues.apache.org/jira/browse/SAMZA-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14233567#comment-14233567
]
Chris Riccomini commented on SAMZA-390:
---------------------------------------
Thanks for the write-up [~nickpan47]. I'll have a look at Tigon.
There seem to be three main layers:
* SQL grammar.
* Relational algebra.
* Actual implementation of relational operators.
I agree with you that CQL's most interesting contribution seems to be its
stream-relation model. I'm not crazy about its grammar, and it only provides a
basic STREAM single-node unreplicated/unpartitioned reference implementation.
If we buy that we want to use CQL's relational model, then the next questions I
want to look at are:
# Can few find a better SQL grammar that still fits the same underlying
relational model?
# Can we find streaming implementations that are distributed/partitioned, but
provide the strong timing guarantees that CQL's relational model requires?
For (1), I've taken a look at Azure, and will also have a look at
Tigon/StreamSQL.
For (2), MillWheel seems interesting. Will have to dig araound.
Here are my notes on Azure:
# I like the TIMESTAMP BY syntax in Azure. It seems more flexible than a rigid
timestamp field enforced in the data model. It also means a single stream can
have multiple timestamp fields, rather than having to re-materialize messages
every time a new field should be used as the timestamp.
# Azure again uses the linear road example. Seems to be standard practice.
# Azure's SELECT has an explicit PARTITION BY clause
(http://msdn.microsoft.com/en-us/library/dn835022.aspx).
# There seems to be a fully defined formal grammar for Azure in their reference
docs (http://msdn.microsoft.com/en-us/library/dn835022.aspx).
# It's interesting that you can't SELECT \* in a join
(http://msdn.microsoft.com/en-us/library/dn835026.aspx). I haven't thought
about why.
# The data type supported seem to be quite primitive
(http://msdn.microsoft.com/en-us/library/dn835065.aspx). The grid at the bottom
of the page hints that our thinking about having a single data model and
translating to/from underlying Serdes (SAMZA-484) is might be a reasonable
approach. They outline how their data types are converted to/from Avro, JSON,
etc.
#
[This|http://azure.microsoft.com/en-us/documentation/articles/stream-analytics-scale-jobs/]
page describes how Azure's partitioning works.
# The parallelism model for Azure is manual. You have to manually define how
many input stream partitions and "streaming units" (somewhat like containers)
you want.
# Like Samza, Azure requires a user to be explicitly aware of partition count
and key information: "If you are joining two streams, please ensure that the
streams are partitioned by the partition key of the column that you do the
joins, and you have the same number of partitions in both streams."
# Azure doesn't really have a concept of a row-based sliding window (e.g. last
50 rows, or [Rows 50] in CQL). The closes that I can find is the [Sliding
window|http://msdn.microsoft.com/en-us/library/dn835051.aspx], which operates
at a 100ns interval, but still could theoretically jump from 4 rows to 6 during
the epsilon hop.
Things I like:
* As expected, I found the grammar to be much more intuitive/approachable than
CQL. Azure's grammar is a variation of T-SQL that introduces a few extra
stream-related operators.
* TIMESTAMP BY seems like a nice way to support timestamps.
Things I don't like:
* Unlike CQL, seems to have no concept of tables. Seems to make joining a
stream against a table impossible. Given Samza's state management, it seems
that supporting tables explicitly in the grammar would be nice.
* Lack of row-based sliding window.
* Inability to specify a partition key from within the SELECT ... FROM
statement. You're only allowed to partition by "PartitionId" right now. This is
a missing feature that's yet to be implemented, but I presume will be shortly.
> High-Level Language for Samza
> -----------------------------
>
> Key: SAMZA-390
> URL: https://issues.apache.org/jira/browse/SAMZA-390
> Project: Samza
> Issue Type: New Feature
> Reporter: Raul Castro Fernandez
> Priority: Minor
> Labels: project
>
> Discussion about high-level languages to define Samza queries. Queries are
> defined in this language and transformed to a dataflow graph where the nodes
> are Samza jobs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)