Hi,

We are evaluating various streaming platforms to move some of our batch
pipelines to streaming pipelines.  I read the google data flow paper and
understands how it can handle unbounded data with materialization on
different time using watermark/trigger.

The example given in the paper was a sum aggregation example.  But for a
left outer join use case, e.g.

    INSERT INTO C
    SELECT A.value, B.value
        FROM A
        LEFT OUTER JOIN B
        ON A.key = B.key
    WHERE B.value is NULL

The output is going to another stream or table.

The records for A or B stream can come late, how will Google Data Flow handle
this case?  How do you buffer A or B stream and when can you emit to C
stream?

I can put a time bound, e.g. we are only going to wait 24 hours for the
late arrival event.

Thanks.

Reply via email to