We have a use case where we have 2 data sets - One reasonable large data set (a 
few million entities), and a smaller set of data. We want to do a join between 
these data sets. We will be doing this join after both data sets are available. 
 In the world of batch processing, this is pretty straightforward - we'd load 
both data sets into an application and execute a join operator on them through 
a common key.   Is it possible to do such a join using the DataStream API? I 
would assume that I'd use the connect operator, though I'm not sure exactly how 
I should do the join - do I need one 'smaller' set to be completely loaded into 
state before I start flowing the large set? My concern is that if I read both 
data sets from streaming sources, since I can't be guaranteed of the order that 
the data is loaded, I may lose lots of potential joined entities since their 
pairs might not have been read yet. 


Thanks,
Hayden Marchant


Reply via email to