To update this idea, I think we can go a step further to support EMIT syntax from one-sql-to-rule-them-all paper [1].
EMIT will allow periodic delay stream materialization. For stream view, it means we will add support to sinks to keep generating a changelog table. For view only, it means we will add support to sinks to generate a compacted table form changelog table periodically. Regarding to SQL, a typical query like the following should run: *WITH joined_table AS (SELECT * FROM S1 JOIN S2)* *SELECT XX FROM HOP(joined_table)* *EMTI [STREAM] AFTER DELAY INTERVAL '1' HOUR* By doing so, retractions will be much useful for SQL from a product scenario, in which we can have a meaningful end to end SQL pipeline. [1]: https://arxiv.org/pdf/1905.12133.pdf -Rui On Mon, Aug 12, 2019 at 11:30 PM Rui Wang <[email protected]> wrote: > Hi Community, > > BeamSQL currently does not support unbounded-unbounded join with > non-default trigger. It is because: > > - Discarding mode does not work for outer joins because of lacking of > ability to retract pre-emitted values. You can think about an example in > which a tuple of (left_row, null) needed to be retracted if the matched > right_row appears since last trigger fired. > - Accumulating mode *theoretically* can support unbounded-unbounded join > because it's supposed to always "overwrite" previous result. However in > practice, for join use cases such overwriting is too expensive. It would be > much more efficient if small changes in inputs of join only cause small > changes to downstream to compute. > - Both discarding mode and accumulating mode are not sufficient to refine > materialized data. > > Meanwhile, [1] has kicked off a discussion on retractions in Beam model. I > have been collecting people's feedback and generally speaking people agree > that retractions are useful for some use cases. > > Thus I propose to combine SQL join with retractions to > support multiple-triggering SQL Join. > > I think SQL join is a good start for supporting retraction in Beam with > the following caveats: > 1. multiple-triggering SQL Join is a useful feature. > 2. SQL join is an opportunity for us to figure out implementation details > of retraction by building it for a well defined use case. > 3. Supporting retraction should not cause performance regression on > existing pipelines, or require changes on existing pipelines. > > > What do you think? > > [1]: > https://lists.apache.org/thread.html/bb2d40b1bea8b21fbbb7caf599fabba823da357768ceca8ea2363789@%3Cdev.beam.apache.org%3E > > > -Rui >
