There is also a discussion of side input https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API
I would load the smaller data set as static reference data set. Then you can just do single source streaming of the larger data set. On Wed, Jan 31, 2018 at 1:09 AM, Stefan Richter <s.rich...@data-artisans.com > wrote: > Hi, > > if the workarounds that Xingcan and me mentioned are no options for your > use-case, then I think this might currently be the better option. But I > would expect some better support for stream joins in the near future. > > Best, > Stefan > > > Am 31.01.2018 um 07:04 schrieb Marchant, Hayden < > hayden.march...@citi.com>: > > > > Stefan, > > > > So are we essentially saying that in this case, for now, I should stick > to DataSet / Batch Table API? > > > > Thanks, > > Hayden > > > > -----Original Message----- > > From: Stefan Richter [mailto:s.rich...@data-artisans.com] > > Sent: Tuesday, January 30, 2018 4:18 PM > > To: Marchant, Hayden [ICG-IT] <hm97...@imceu.eu.ssmb.com> > > Cc: user@flink.apache.org; Aljoscha Krettek <aljos...@apache.org> > > Subject: Re: Joining data in Streaming > > > > Hi, > > > > as far as I know, this is not easily possible. What would be required is > something like a CoFlatmap function, where one input stream is blocking > until the second stream is fully consumed to build up the state to join > against. Maybe Aljoscha (in CC) can comment on future plans to support this. > > > > Best, > > Stefan > > > >> Am 30.01.2018 um 12:42 schrieb Marchant, Hayden < > hayden.march...@citi.com>: > >> > >> We have a use case where we have 2 data sets - One reasonable large > data set (a few million entities), and a smaller set of data. We want to do > a join between these data sets. We will be doing this join after both data > sets are available. In the world of batch processing, this is pretty > straightforward - we'd load both data sets into an application and execute > a join operator on them through a common key. Is it possible to do such a > join using the DataStream API? I would assume that I'd use the connect > operator, though I'm not sure exactly how I should do the join - do I need > one 'smaller' set to be completely loaded into state before I start flowing > the large set? My concern is that if I read both data sets from streaming > sources, since I can't be guaranteed of the order that the data is loaded, > I may lose lots of potential joined entities since their pairs might not > have been read yet. > >> > >> > >> Thanks, > >> Hayden Marchant > >> > >> > > > >