Re: [DISCUSS] Multiple-triggering SQL Join with retractions support

Mingmin Xu Mon, 19 Aug 2019 16:02:47 -0700

+1 to support EMIT in Beam side first if we cannot include it in Calcite in
short time(See #1, #2). I'm open to use any format, the one above or
something as below. The tricky question is, what's the expected behavior
for a complex query with more than 1 GBK operators?


EMIT  <INTERVAL '1' MINUTE> | <INTERVAL '100' ROW> [ACCUMULATE|DISCARD]
[INSERT INTO ...]
SELECT ...

#1.
https://sematext.com/opensee/m/Calcite/FR3K9JVAl32VULr6?subj=Towards+a+spec+for+robust+streaming+SQL+Part+1
#2
https://sematext.com/opensee/m/Beam/gfKHFFDd4i1I3nZc2?subj=Towards+a+spec+for+robust+streaming+SQL+Part+2

On Mon, Aug 19, 2019 at 12:02 PM Rui Wang <ruw...@google.com> wrote:

> To update this idea, I think we can go a step further to support EMIT
> syntax from one-sql-to-rule-them-all paper [1].
>
> EMIT will allow periodic delay stream materialization. For stream view, it
> means we will add support to sinks to keep generating a changelog table.
> For view only, it means we will add support to sinks to generate a
> compacted table form changelog table periodically.
>
> Regarding to SQL, a typical query like the following should run:
>
>
> *WITH joined_table AS (SELECT * FROM S1 JOIN S2)*
> *SELECT XX FROM HOP(joined_table)*
> *EMTI [STREAM] AFTER DELAY INTERVAL '1' HOUR*
>
>
> By doing so, retractions will be much useful for SQL from a product
> scenario, in which we can have a meaningful end to end SQL pipeline.
>
> [1]: https://arxiv.org/pdf/1905.12133.pdf
>
> -Rui
>
> On Mon, Aug 12, 2019 at 11:30 PM Rui Wang <ruw...@google.com> wrote:
>
>> Hi Community,
>>
>> BeamSQL currently does not support unbounded-unbounded join with
>> non-default trigger. It is because:
>>
>> - Discarding mode does not work for outer joins because of lacking of
>> ability to retract pre-emitted values. You can think about an example in
>> which a tuple of (left_row, null) needed to be retracted  if the matched
>> right_row appears since last trigger fired.
>> - Accumulating mode *theoretically* can support unbounded-unbounded join
>> because it's supposed to always "overwrite" previous result. However in
>> practice, for join use cases such overwriting is too expensive. It would be
>> much more efficient if small changes in inputs of join only cause small
>> changes to downstream to compute.
>> - Both discarding mode and accumulating mode are not sufficient to refine
>> materialized data.
>>
>> Meanwhile, [1] has kicked off a discussion on retractions in Beam model.
>> I have been collecting people's feedback and generally speaking people
>> agree that retractions are useful for some use cases.
>>
>> Thus I propose to combine SQL join with retractions to
>> support multiple-triggering SQL Join.
>>
>> I think SQL join is a good start for supporting retraction in Beam with
>> the following caveats:
>> 1. multiple-triggering SQL Join is a useful feature.
>> 2. SQL join is an opportunity for us to figure out implementation details
>> of retraction by building it for a well defined use case.
>> 3. Supporting retraction should not cause performance regression on
>> existing pipelines, or require changes on existing pipelines.
>>
>>
>> What do you think?
>>
>> [1]:
>> https://lists.apache.org/thread.html/bb2d40b1bea8b21fbbb7caf599fabba823da357768ceca8ea2363789@%3Cdev.beam.apache.org%3E
>>
>>
>> -Rui
>>
>

-- 
----
Mingmin

Re: [DISCUSS] Multiple-triggering SQL Join with retractions support

Reply via email to