Re: Using Beam to generate unique ids with unbounded sources

Jan Lukavský Thu, 22 Jul 2021 07:41:30 -0700

Hi Cristian,

I didn't try that, so I'm not 100% sure it would work, but you probablycould try using custom timestamp policy for the KafkaIO, which willshift the timestamp to BoundedWindow.TIMESTAMP_MAX_VALUE, once you knowyou reached head of the state topic. That would probably require readingthe end offsets before running the Pipeline. This should turn the sourceinto bounded source effectively.

Jan

[1]https://beam.apache.org/releases/javadoc/2.31.0/org/apache/beam/sdk/io/kafka/KafkaIO.Read.html#withTimestampPolicyFactory-org.apache.beam.sdk.io.kafka.TimestampPolicyFactory-


On 7/22/21 2:14 PM, Cristian Constantinescu wrote:

Hi All,
I would like to know if there's a suggested pattern for the belowscenario. TL;DR: reading state from Kafka.
I have a scenario where I'm listening to a kafka topic and generate aunique id based on the properties of the incoming item. Then, I outputthe result to another kafka topic. The tricky part is that when thepipeline is restarted, I have to read the output topic and build upthe ids state, this way if I see an item that was already given an id,I give the same id back and do not generate a new one.
For example:
Input topic -> Output topic
(A1, B1, C1) -> (A1, B1, C1, Random string "ID 1")
(A1, B1, C2) -> (A1, B1, C2, Random string "ID 2")
pipeline is restarted
(A3, B3, C3) -> (A3, B3, C3, Random string "ID 3")
(A1, B1, C1) -> (A1, B1, C1, Random string "ID 1") <-- because we'vealready seen (A1, B1, C1) before
I can't really use any type of windows except the global ones, as Ineed to join on all the items of the output topic (the one with thealready generated ids).
Right now, I flatten both input and output topics and I use a triggeron the global windowAfterProcessingTime.pastFirstElementInPane().plusDuration(Duration.standardSeconds(10)then group by properties (A,B,C). Once that is done, I look throughthe grouped rows and see if any one of them has an id alreadygenerated. If yes, all the other rows get this id and the id is savedin the ParDo's state for the future messages. If no, then generate anew id.
My solution seems to work. Kind of...
This puts a delay of 10s on all the incoming messages. I'd prefer itwouldn't be the case. I would like to read the output topic at thestart of the pipeline, build the state, then start processing theinput topic. Since the output topic will be stale until I startprocessing the input topic again, it effectively is abounded collection. Unfortunately because it's kafkaIO, it's stillconsidered an unbounded source, which mainly means that Wait.on() thiscollection waits forever. (Note: I've read the notes in thedocumentation [1] but either do not understand them or didn't take theappropriate steps for wait.on to trigger properly.)
I have also tried to window the output topic in a session window witha one second gap. Basically, if I don't get any item for 1 second, itmeans that I finished reading the output topic and can startprocessing the input topic. Unfortunately Wait.on() doesn't work forSession Windows.
Furthermore, I don't think side inputs work for this problem. Firstbecause I'm not sure how to create the side input from an unboundedsource. Second because the side input needs to be updated when a newid is generated.
I would appreciate any thoughts or ideas to elegantly solve this problem.

Thanks,
Cristian
[1]https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/transforms/Wait.html<https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/transforms/Wait.html>

Re: Using Beam to generate unique ids with unbounded sources

Reply via email to