RE: Joining data in Streaming

Marchant, Hayden Tue, 30 Jan 2018 22:05:27 -0800

Stefan,

So are we essentially saying that in this case, for now, I should stick to 
DataSet / Batch Table API?


Thanks,
Hayden

-----Original Message-----
From: Stefan Richter [mailto:s.rich...@data-artisans.com] 
Sent: Tuesday, January 30, 2018 4:18 PM
To: Marchant, Hayden [ICG-IT] <hm97...@imceu.eu.ssmb.com>
Cc: user@flink.apache.org; Aljoscha Krettek <aljos...@apache.org>
Subject: Re: Joining data in Streaming

Hi,

as far as I know, this is not easily possible. What would be required is 
something like a CoFlatmap function, where one input stream is blocking until 
the second stream is fully consumed to build up the state to join against. 
Maybe Aljoscha (in CC) can comment on future plans to support this.

Best,
Stefan

> Am 30.01.2018 um 12:42 schrieb Marchant, Hayden <hayden.march...@citi.com>:
> 
> We have a use case where we have 2 data sets - One reasonable large data set 
> (a few million entities), and a smaller set of data. We want to do a join 
> between these data sets. We will be doing this join after both data sets are 
> available.  In the world of batch processing, this is pretty straightforward 
> - we'd load both data sets into an application and execute a join operator on 
> them through a common key.   Is it possible to do such a join using the 
> DataStream API? I would assume that I'd use the connect operator, though I'm 
> not sure exactly how I should do the join - do I need one 'smaller' set to be 
> completely loaded into state before I start flowing the large set? My concern 
> is that if I read both data sets from streaming sources, since I can't be 
> guaranteed of the order that the data is loaded, I may lose lots of potential 
> joined entities since their pairs might not have been read yet. 
> 
> 
> Thanks,
> Hayden Marchant
> 
>

RE: Joining data in Streaming

Reply via email to