Hi Beam devs, I'm working on a streaming pipeline where we need to do a "most-recent" join between two PCollections. Specifically, something like:
out = pcoll1 | beam.Map(lambda a,b: (a,b), b=beam.pvalue.AsSingleton(pcoll2)) The goal is to join each value in pcoll1 with only the most recent value from pcoll2. (in this case pcoll2 is much more sparse than pcoll1) --- altay@ suggested using a global window for the side-input pcollection with a trigger on each element. I've been trying to simulate this behavior locally with beam.testing.TestStream but I've been running into some issues. Specifically, the Repeatedly(AfterCount(1)) trigger seems to work correctly, but the side input receives too many panes (even when using discarding accumulation). I've set up a minimal demo here: https://colab.research.google.com/drive/1K0EqcKWxa4UK3SrkLBeHs7HSynw_VfSZ?usp=sharing In this example, I'm trying to join values from pcollection "a" with pcollection "b". However each pane of pcollection "a" is able to "see" all of the panes from pcollection "b" which is not what I would expect. I am curious if anyone has advice for how to handle this type of problem or an alternative solution for the "most-recent" join. (side note: I was able to hack together an alternative solution that uses a custom window/windowing strategy but it was fairly complex and I think a strategy that uses GlobalWindows would be preferred). Sincerely, Harrison
