Most streaming pipelines have been accessing the datastore directly or using stateful DoFns for this. Using a side input doesn't scale well even for Dataflow even though users such as yourself would benefit from it.
In this thread[1], the user has a similar problem and I describe the solution for a stateful DoFn approach. 1: https://lists.apache.org/thread.html/rdfae2bfe895e5ee38bbd5014e8c293a313b61e9d63a3484acd4e9864%40%3Cuser.beam.apache.org%3E On Wed, May 13, 2020 at 11:28 AM Kaymak, Tobias <[email protected]> wrote: > Hi, > > First of all thank you for the Webinar Beam Sessions this month. They are > super helpful especially for getting people excited and on-boarded with > Beam! > > We are currently trying to promote Beam with more use cases within our > company and tackling a problem, where we have to join a stream of articles > with asset-information. The asset information as a table has a size of 2 > TiB+ and therefore, we think the only way to enrich the stream would be by > having it in a fast lookup store, so that the (batched) RPC pattern could > be applied. (So in product terms of Google Cloud having it in something > like BigTable or a similar fast and big key/value store.) > > Is there an alternative that we could try? Maintaining that additional > data store would add overhead we are looking to avoid. :) > > Best, > Tobi > > > >
