+1 to support EMIT in Beam side first if we cannot include it in Calcite in short time(See #1, #2). I'm open to use any format, the one above or something as below. The tricky question is, what's the expected behavior for a complex query with more than 1 GBK operators?
EMIT <INTERVAL '1' MINUTE> | <INTERVAL '100' ROW> [ACCUMULATE|DISCARD] [INSERT INTO ...] SELECT ... #1. https://sematext.com/opensee/m/Calcite/FR3K9JVAl32VULr6?subj=Towards+a+spec+for+robust+streaming+SQL+Part+1 #2 https://sematext.com/opensee/m/Beam/gfKHFFDd4i1I3nZc2?subj=Towards+a+spec+for+robust+streaming+SQL+Part+2 On Mon, Aug 19, 2019 at 12:02 PM Rui Wang <[email protected]> wrote: > To update this idea, I think we can go a step further to support EMIT > syntax from one-sql-to-rule-them-all paper [1]. > > EMIT will allow periodic delay stream materialization. For stream view, it > means we will add support to sinks to keep generating a changelog table. > For view only, it means we will add support to sinks to generate a > compacted table form changelog table periodically. > > Regarding to SQL, a typical query like the following should run: > > > *WITH joined_table AS (SELECT * FROM S1 JOIN S2)* > *SELECT XX FROM HOP(joined_table)* > *EMTI [STREAM] AFTER DELAY INTERVAL '1' HOUR* > > > By doing so, retractions will be much useful for SQL from a product > scenario, in which we can have a meaningful end to end SQL pipeline. > > [1]: https://arxiv.org/pdf/1905.12133.pdf > > -Rui > > On Mon, Aug 12, 2019 at 11:30 PM Rui Wang <[email protected]> wrote: > >> Hi Community, >> >> BeamSQL currently does not support unbounded-unbounded join with >> non-default trigger. It is because: >> >> - Discarding mode does not work for outer joins because of lacking of >> ability to retract pre-emitted values. You can think about an example in >> which a tuple of (left_row, null) needed to be retracted if the matched >> right_row appears since last trigger fired. >> - Accumulating mode *theoretically* can support unbounded-unbounded join >> because it's supposed to always "overwrite" previous result. However in >> practice, for join use cases such overwriting is too expensive. It would be >> much more efficient if small changes in inputs of join only cause small >> changes to downstream to compute. >> - Both discarding mode and accumulating mode are not sufficient to refine >> materialized data. >> >> Meanwhile, [1] has kicked off a discussion on retractions in Beam model. >> I have been collecting people's feedback and generally speaking people >> agree that retractions are useful for some use cases. >> >> Thus I propose to combine SQL join with retractions to >> support multiple-triggering SQL Join. >> >> I think SQL join is a good start for supporting retraction in Beam with >> the following caveats: >> 1. multiple-triggering SQL Join is a useful feature. >> 2. SQL join is an opportunity for us to figure out implementation details >> of retraction by building it for a well defined use case. >> 3. Supporting retraction should not cause performance regression on >> existing pipelines, or require changes on existing pipelines. >> >> >> What do you think? >> >> [1]: >> https://lists.apache.org/thread.html/bb2d40b1bea8b21fbbb7caf599fabba823da357768ceca8ea2363789@%3Cdev.beam.apache.org%3E >> >> >> -Rui >> > -- ---- Mingmin
