Re: Joining data in Streaming

Xingcan Cui Tue, 30 Jan 2018 19:57:44 -0800

Hi Hayden,

To perform a full-history join on two streams has not been natively
supported now.

As a workaround, you may implement a CoProcessFunction and cache the
records from both sides in states until the stream with fewer data has been
fully cached. Then you could safely clear the cache for the "larger
stream", which should have produced completed results, and perform a nested
loop join (i.e., whenever comes a new record, join it with the fully cached
set).

Hope this helps.

Best,
Xingcan

On Tue, Jan 30, 2018 at 7:42 PM, Marchant, Hayden <hayden.march...@citi.com>
wrote:

> We have a use case where we have 2 data sets - One reasonable large data
> set (a few million entities), and a smaller set of data. We want to do a
> join between these data sets. We will be doing this join after both data
> sets are available.  In the world of batch processing, this is pretty
> straightforward - we'd load both data sets into an application and execute
> a join operator on them through a common key.   Is it possible to do such a
> join using the DataStream API? I would assume that I'd use the connect
> operator, though I'm not sure exactly how I should do the join - do I need
> one 'smaller' set to be completely loaded into state before I start flowing
> the large set? My concern is that if I read both data sets from streaming
> sources, since I can't be guaranteed of the order that the data is loaded,
> I may lose lots of potential joined entities since their pairs might not
> have been read yet.
>
>
> Thanks,
> Hayden Marchant
>
>
>

Re: Joining data in Streaming

Reply via email to