Hi all,

This is my first post to the Samza list. I heard from Chris and Jay
that you guys were looking into putting a SQL interface on Samza, so I
thought I'd take a look.

My background is in the SQL world, most recently with Apache Calcite,
(although I have quite a lot of experience with streaming too) so
forgive me if I am speaking a foreign language or seem to be coming at
this from a completely different direction. Also forgive me if I have
missed preceding discussions and I am opening up areas that have been
settled already.

I was surprised that one of the first goals is to create a SQL API.
SQL is a textual language; a lot of the nuance (e.g. scope of
identifiers) is lost when you convert it to a linear builder API. Now,
it definitely makes sense to have a SQL AST (abstract syntax tree),
that can be created by hand-written code or by a parser. And you can
create an AST builder, if you like. But there is not a simple mapping
between true SQL and a data-flow graph that you can execute. If you
imagine that there is a simple mapping, you will achieve great results
with simple SELECT-FROM-WHERE queries but hit the wall when you hit
the hard stuff. You will end up -- as so many others have -- with a
SQL-like language. Close but no cigar.

Case in point: Spark (and Spark-streaming) is a SQL-like language that
looks similar to the proposed Samza API, and now they are building
SparkSQL from the ground up.

I think the way to approach this is to have a SQL parser and a logical
algebra. The logical algebra looks very similar to relational algebra,
maybe with one or two extensions for streaming. (A lot of SQL features
-- such as query blocks, sub-queries, correlated variables, aliases,
views and the HAVING clause -- are not present in the algebra.)
Between the parser and the logical algebra is an AST, a validator, and
a translator from AST the the algebra. And then there is a physical
algebra, which is Samza of course.

Maybe the proposed SQL object model is in fact that logical algebra.
But I'd recommend that you not call it SQL; in fact it should be
non-goal that an end-user would use that API and think that they are
in any way creating a "SQL query".

Julian

Reply via email to