To update this idea, I think we can go a step further to support EMIT
syntax from one-sql-to-rule-them-all paper [1].

EMIT will allow periodic delay stream materialization. For stream view, it
means we will add support to sinks to keep generating a changelog table.
For view only, it means we will add support to sinks to generate a
compacted table form changelog table periodically.

Regarding to SQL, a typical query like the following should run:


*WITH joined_table AS (SELECT * FROM S1 JOIN S2)*
*SELECT XX FROM HOP(joined_table)*
*EMTI [STREAM] AFTER DELAY INTERVAL '1' HOUR*


By doing so, retractions will be much useful for SQL from a product
scenario, in which we can have a meaningful end to end SQL pipeline.

[1]: https://arxiv.org/pdf/1905.12133.pdf

-Rui

On Mon, Aug 12, 2019 at 11:30 PM Rui Wang <[email protected]> wrote:

> Hi Community,
>
> BeamSQL currently does not support unbounded-unbounded join with
> non-default trigger. It is because:
>
> - Discarding mode does not work for outer joins because of lacking of
> ability to retract pre-emitted values. You can think about an example in
> which a tuple of (left_row, null) needed to be retracted  if the matched
> right_row appears since last trigger fired.
> - Accumulating mode *theoretically* can support unbounded-unbounded join
> because it's supposed to always "overwrite" previous result. However in
> practice, for join use cases such overwriting is too expensive. It would be
> much more efficient if small changes in inputs of join only cause small
> changes to downstream to compute.
> - Both discarding mode and accumulating mode are not sufficient to refine
> materialized data.
>
> Meanwhile, [1] has kicked off a discussion on retractions in Beam model. I
> have been collecting people's feedback and generally speaking people agree
> that retractions are useful for some use cases.
>
> Thus I propose to combine SQL join with retractions to
> support multiple-triggering SQL Join.
>
> I think SQL join is a good start for supporting retraction in Beam with
> the following caveats:
> 1. multiple-triggering SQL Join is a useful feature.
> 2. SQL join is an opportunity for us to figure out implementation details
> of retraction by building it for a well defined use case.
> 3. Supporting retraction should not cause performance regression on
> existing pipelines, or require changes on existing pipelines.
>
>
> What do you think?
>
> [1]:
> https://lists.apache.org/thread.html/bb2d40b1bea8b21fbbb7caf599fabba823da357768ceca8ea2363789@%3Cdev.beam.apache.org%3E
>
>
> -Rui
>

Reply via email to