Hi Beam devs,

I'm working on a streaming pipeline where we need to do a "most-recent"
join between two PCollections. Specifically, something like:

out = pcoll1 | beam.Map(lambda a,b: (a,b),
b=beam.pvalue.AsSingleton(pcoll2))

The goal is to join each value in pcoll1 with only the most recent value
from pcoll2. (in this case pcoll2 is much more sparse than pcoll1)
---
altay@ suggested using a global window for the side-input pcollection with
a trigger on each element. I've been trying to simulate this behavior
locally with beam.testing.TestStream but I've been running into some issues.

Specifically, the Repeatedly(AfterCount(1)) trigger seems to work
correctly, but the side input receives too many panes (even when using
discarding accumulation). I've set up a minimal demo here:
https://colab.research.google.com/drive/1K0EqcKWxa4UK3SrkLBeHs7HSynw_VfSZ?usp=sharing
In this example, I'm trying to join values from pcollection "a" with
pcollection "b". However each pane of pcollection "a" is able to "see" all
of the panes from pcollection "b" which is not what I would expect.

I am curious if anyone has advice for how to handle this type of problem or
an alternative solution for the "most-recent" join. (side note: I was able
to hack together an alternative solution that uses a custom
window/windowing strategy but it was fairly complex and I think a strategy
that uses GlobalWindows would be preferred).

Sincerely,
Harrison

Reply via email to