In the context of supporting a high level API and supporting Apache Beam, I
think we should also think about a possible SQL-like language that can
specify the stream processing with the concepts of session windowing,
watermarking, triggering, etc.
Taking from the this article at
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102:
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.triggering(
AtWatermark()
.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.withAllowedLateness(Duration.standardMinutes(1)))
.apply(Sum.integersPerKey());
We might want to design a SQL-like language users can write on the fly
without having to compile java code. Something like:
SELECT SUM(value) FROM input INTO output WHERE <filter condition> GROUP BY
key FIXED WINDOWS INTERVAL '2' MINUTE AT WATERMARK WITH EARLY FIRING AT
INTERVAL '1' MINUTE WITH LATE FIRINGS AT COUNT '1' WITH ALLOWED LATENESS
INTERVAL '1' MINUTE
We can also allow users to implement and register custom "SQL functions"
this way to allow custom processing.
David
On Mon, May 9, 2016 at 11:59 AM, Thomas Weise <[email protected]>
wrote:
> I just attended the talk by Julian Hyde at Apache Big Data. The
> presentation was similar to the one at the Hadoop Summit here:
>
> http://www.slideshare.net/julianhyde/querying-the-internet
> -of-things-streaming-sql-on-kafkasamza-and-stormtrident
>
> We discussed SQL at various occasions, Calcite integration seems the way to
> go. My current idea is that a SQL query can be applied on top of a base
> application or used to define a SubDAG, and introduce additional operators.
> It also plays into the interactive query capability (app data).
>
> There is an existing JIRA and maybe it is time to get to the next level of
> discussion:
>
> https://issues.apache.org/jira/browse/APEXMALHAR-1818
>
> Julian has expressed interest to work with us on this.
>
> Thanks,
> Thomas
>