Re: [DISCUSS] Multiple-triggering SQL Join with retractions support

Rui Wang Mon, 19 Aug 2019 16:41:11 -0700

Hi Mingmin,

Thanks for adding "INSERT INTO" (which I missed from the example)


I am not sure if I understand the question:

1. multiple GBK with retraction is solved by [1].
2. In terms of SQL and its view, the output are defined by the last GBK.

[1]:
https://docs.google.com/document/d/14WRfxwk_iLUHGPty3C6ZenddPsp_d6jhmx0vuafXqmE/edit?usp=sharing


-Rui

On Mon, Aug 19, 2019 at 4:02 PM Mingmin Xu <[email protected]> wrote:

> +1 to support EMIT in Beam side first if we cannot include it in Calcite
> in short time(See #1, #2). I'm open to use any format, the one above or
> something as below. The tricky question is, what's the expected behavior
> for a complex query with more than 1 GBK operators?
>
> EMIT  <INTERVAL '1' MINUTE> | <INTERVAL '100' ROW> [ACCUMULATE|DISCARD]
> [INSERT INTO ...]
> SELECT ...
>
> #1.
> https://sematext.com/opensee/m/Calcite/FR3K9JVAl32VULr6?subj=Towards+a+spec+for+robust+streaming+SQL+Part+1
> #2
> https://sematext.com/opensee/m/Beam/gfKHFFDd4i1I3nZc2?subj=Towards+a+spec+for+robust+streaming+SQL+Part+2
>
> On Mon, Aug 19, 2019 at 12:02 PM Rui Wang <[email protected]> wrote:
>
>> To update this idea, I think we can go a step further to support EMIT
>> syntax from one-sql-to-rule-them-all paper [1].
>>
>> EMIT will allow periodic delay stream materialization. For stream view,
>> it means we will add support to sinks to keep generating a changelog table.
>> For view only, it means we will add support to sinks to generate a
>> compacted table form changelog table periodically.
>>
>> Regarding to SQL, a typical query like the following should run:
>>
>>
>> *WITH joined_table AS (SELECT * FROM S1 JOIN S2)*
>> *SELECT XX FROM HOP(joined_table)*
>> *EMTI [STREAM] AFTER DELAY INTERVAL '1' HOUR*
>>
>>
>> By doing so, retractions will be much useful for SQL from a product
>> scenario, in which we can have a meaningful end to end SQL pipeline.
>>
>> [1]: https://arxiv.org/pdf/1905.12133.pdf
>>
>> -Rui
>>
>> On Mon, Aug 12, 2019 at 11:30 PM Rui Wang <[email protected]> wrote:
>>
>>> Hi Community,
>>>
>>> BeamSQL currently does not support unbounded-unbounded join with
>>> non-default trigger. It is because:
>>>
>>> - Discarding mode does not work for outer joins because of lacking of
>>> ability to retract pre-emitted values. You can think about an example in
>>> which a tuple of (left_row, null) needed to be retracted  if the matched
>>> right_row appears since last trigger fired.
>>> - Accumulating mode *theoretically* can support unbounded-unbounded
>>> join because it's supposed to always "overwrite" previous result. However
>>> in practice, for join use cases such overwriting is too expensive. It would
>>> be much more efficient if small changes in inputs of join only cause small
>>> changes to downstream to compute.
>>> - Both discarding mode and accumulating mode are not sufficient to
>>> refine materialized data.
>>>
>>> Meanwhile, [1] has kicked off a discussion on retractions in Beam model.
>>> I have been collecting people's feedback and generally speaking people
>>> agree that retractions are useful for some use cases.
>>>
>>> Thus I propose to combine SQL join with retractions to
>>> support multiple-triggering SQL Join.
>>>
>>> I think SQL join is a good start for supporting retraction in Beam with
>>> the following caveats:
>>> 1. multiple-triggering SQL Join is a useful feature.
>>> 2. SQL join is an opportunity for us to figure out implementation
>>> details of retraction by building it for a well defined use case.
>>> 3. Supporting retraction should not cause performance regression on
>>> existing pipelines, or require changes on existing pipelines.
>>>
>>>
>>> What do you think?
>>>
>>> [1]:
>>> https://lists.apache.org/thread.html/bb2d40b1bea8b21fbbb7caf599fabba823da357768ceca8ea2363789@%3Cdev.beam.apache.org%3E
>>>
>>>
>>> -Rui
>>>
>>
>
> --
> ----
> Mingmin
>

Re: [DISCUSS] Multiple-triggering SQL Join with retractions support

Reply via email to