[DISCUSS] Drop features about GROUP BY, JOIN from Storm SQL Trident mode

Jungtaek Lim Fri, 21 Oct 2016 00:19:02 -0700

Hi devs,

Sorry to send multiple mails at once, I had been resolving issues
sequentially, and now stopped a bit and retrospect about the direction of
Storm SQL.


I'd like to propose destructive actions, dropping features about GROUP BY
and JOIN from Storm SQL which are fortunately not released yet.

The reason of dropping features is simple: This borrows Trident semantic
(within micro-batch, or stateful), and not making sense of true "streaming"
semantic.

Spark and Flink interpret "streaming" aggregation and join as windowed
operators. Since there's no SQL standard for streaming (even no de-facto),
they are adding the feature to its API (Structured Streaming for Spark, and
Table API for Flink), and don't address them to SQL side yet.

I was eager to add more features on Storm SQL to make progress (even Bobby
pointed out similarly), but after worked on these things, I change my mind
that letting users not confusing is more important than adding features.

Btw, Storm SQL "temporary" relies on Trident since we don't have
higher-level API on core and we don't want to build topology from ground
up. AFAIK, choosing Trident is not for living with micro-batch, and IMHO it
should run on per-tuple streaming manner instead of micro-batch.
Integrating streams API to Storm SQL could be great internal project for
POC of streams API. Exactly-once needs to be addressed before.

"GROUP BY" is also what SQE supports now (SQE aggregates this stateful and
exactly-once way), so I would like to hear our opinions regarding this.

Flink and Storm is waiting for Calcite to make progress on Streaming SQL:
https://calcite.apache.org/docs/stream.html (For now most of definitions
are not implemented yet.)
This means that we might not support Streaming SQL semantics in SQL
statement unless Calcite finishes their work. I think this is OK since
there're many other works left on Storm SQL, and Storm SQL is now in
experimental anyway (The state of Spark Structured Streaming and Flink
Streaming SQL are also alpha or experiment.)

While waiting, we might want to have LINQ style API like Table API and
address aggregate and join from there, but it requires huge amount of works
and it's a kind of duplicated works with streams API (STORM-1961
<https://issues.apache.org/jira/browse/STORM-1961>) in terms of adding
high-level API. IMHO, if streams API is well defined, it should be fairly
easy and not necessary need to have LINQ style API. (though someone feels
more convenient to use 'select', 'where', and so on.)

Please share your opinion about this. Especially I'd like to see JW Player
participating discussion, since aggregation is already supported by SQE.

Thanks for reading a quite long thread.

Thanks,
Jungtaek Lim (HeartSaVioR)

[DISCUSS] Drop features about GROUP BY, JOIN from Storm SQL Trident mode

Reply via email to