Hi,
We are evaluating various streaming platforms to move some of our batch
pipelines to streaming pipelines. I read the google data flow paper and
understands how it can handle unbounded data with materialization on
different time using watermark/trigger.
The example given in the paper was a sum aggregation example. But for a
left outer join use case, e.g.
INSERT INTO C
SELECT A.value, B.value
FROM A
LEFT OUTER JOIN B
ON A.key = B.key
WHERE B.value is NULL
The output is going to another stream or table.
The records for A or B stream can come late, how will Google Data Flow handle
this case? How do you buffer A or B stream and when can you emit to C
stream?
I can put a time bound, e.g. we are only going to wait 24 hours for the
late arrival event.
Thanks.