Re: Consuming one PCollection before consuming another with Beam

2023-02-28 Thread Niel Markwick via dev
Regarding ordering; anything that requires inputs to be in a specific order in Beam will be problematic due the nature of parallel processing - you will always get race conditions. Assuming you are still intending to Flatten the bigQuery and PubSub PCollections, using Wait(on) before flattening t

Re: Consuming one PCollection before consuming another with Beam

2023-02-28 Thread Sahil Modak via dev
The number of keys/data in BQ would not be constant and grow with time. A rough estimate would be around 300k keys with an average size of 5kb per key. Both the count of the keys and the size of the key would be feature dependent (based on the upstream pipelines) and we won't have control over thi

Re: Consuming one PCollection before consuming another with Beam

2023-02-28 Thread Kenneth Knowles
I'm also curious how much you depend on order to get the state contents right. The ordering of the side input will be arbitrary, and even the streaming input can have plenty of out of order messages. So I want to think about what are the data dependencies that result in the requirement of order. Or

Re: Dependabot questions

2023-02-28 Thread Danny McCormick via dev
AFAIK Dependabot doesn't have a great replacement for this. I'm not sure why the dependency reports stopped, but we could probably try to fix them - looks like they stopped working in October - https://lists.apache.org/list?dev@beam.apache.org:2021-10:dependency%20report. We still have the job whic

Beam High Priority Issue Report (36)

2023-02-28 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need attention. See https://beam.apache.org/contribute/issue-priorities for the meaning and expectations around issue priorities. Unassigned P1 Issues: https://github.com/apache/beam/issues/24776 [Bug]: Race conditi

Re: Consuming one PCollection before consuming another with Beam

2023-02-28 Thread Jan Lukavský
In the case that the data is too large for side input, you could do the same by reassigning timestamps of the BQ input to BoundedWindow.TIMESTAMP_MIN_VALUE (you would have to do that in a stateful DoFn with a timer having outputTimestamp set to TIMESTAMP_MIN_VALUE to hold watermark, or using sp