Use a stateful DoFn and buffer the elements in a bag state. You'll want to use a key that contains enough data to match your join condition you are trying to match. For example, if your trying to match on a customerId then you would do something like: element 1 -> ParDo(extract customer id) -> KV<customer id, element 1> -> stateful ParDo(buffer element 1 in bag state) ... element 5 -> ParDo(extract customer id) -> KV<customer id, element 5> -> stateful ParDo(output all element in bag)
If you are matching on cudomerId and eventId then you would use a composite key (customerId, eventId). You can always use a single global key but you will lose all parallelism during processing (for small pipelines this likely won't matter). On Fri, Jun 26, 2020 at 7:29 AM Praveen K Viswanathan < harish.prav...@gmail.com> wrote: > Hi All - I have a DoFn which generates data (KV pair) for each element > that it is processing. It also has to read from that KV for other elements > based on a key which means, the KV has to retain all the data that's > getting added to it while processing every element. I was thinking > about the "slow-caching side input pattern" but it is more of caching > outside the DoFn and then using it inside. It doesn't update the cache > inside a DoFn. Please share if anyone has thoughts on how to approach this > case. > > Element 1 > Add a record to a KV > ..... Element 5 > Used the value from > KV if there is a match in the key > > -- > Thanks, > Praveen K Viswanathan >