I've been exploring the pattern that Luke's thread refers to (as I'm the one that asked the question in that thread) however I've been pulled away briefly from getting around to implementing it. Just to chime in, another pattern that I've seen recommended quite frequently when dealing with enrichment against large data sets (specifically those that don't fit into memory) is the "External Service pattern" which is detailed a bit more deeply in this related Dataflow documentation[1].
Obviously the use of this depends on a handful of things (i.e. is your data in an easily queryable store, can your scenario support any latency that these lookups may be introducing, does the data change frequently, etc.). Just thought I'd mention it since I've seen it come up frequently in my exploration of a similar problem. 1: https://cloud.google.com/blog/products/gcp/guide-to-common-cloud-dataflow-use-case-patterns-part-2 On 2020/05/13 19:41:39, Luke Cwik <[email protected]> wrote: > Most streaming pipelines have been accessing the datastore directly or > using stateful DoFns for this. Using a side input doesn't scale well even > for Dataflow even though users such as yourself would benefit from it. > > In this thread[1], the user has a similar problem and I describe the > solution for a stateful DoFn approach. > > 1: > https://lists.apache.org/thread.html/rdfae2bfe895e5ee38bbd5014e8c293a313b61e9d63a3484acd4e9864%40%3Cuser.beam.apache.org%3E > > > On Wed, May 13, 2020 at 11:28 AM Kaymak, Tobias <[email protected]> > wrote: > > > Hi, > > > > First of all thank you for the Webinar Beam Sessions this month. They are > > super helpful especially for getting people excited and on-boarded with > > Beam! > > > > We are currently trying to promote Beam with more use cases within our > > company and tackling a problem, where we have to join a stream of articles > > with asset-information. The asset information as a table has a size of 2 > > TiB+ and therefore, we think the only way to enrich the stream would be by > > having it in a fast lookup store, so that the (batched) RPC pattern could > > be applied. (So in product terms of Google Cloud having it in something > > like BigTable or a similar fast and big key/value store.) > > > > Is there an alternative that we could try? Maintaining that additional > > data store would add overhead we are looking to avoid. :) > > > > Best, > > Tobi > > > > > > > > >
