Hi,

We have a Flink job were we are trying to window join two datastreams 
originating from two different Kafka topics, where one topic contains a lot 
more data per time instance than the other one.
We use event time processing, and this all works fine when running our pipeline 
live, i.e. data is consumed and processed as soon as it is ingested in Kafka.

The problem though occurs in the scenario when we are replaying with data 
stored in Kafka, then the watermarks of the “larger-stream” are lagging behind 
the “smaller-stream” since this stream has less data per time unit and then is 
advancing faster.
This leads to a large state at the join operation since data from the 
“smaller-stream” needs to be kept until the corresponding watermarks from the 
“larger-stream” have passed.
To avoid a very large state at the join operator, we have tried to increase the 
parallelism for the consumer of the “larger-stream” to make this keep up with 
the “smaller stream”, this decreases the size of the state to some extent. This 
seems though like a ugly way to get around the problem and will not work if the 
sizes of the two Kafka topics are changing over time.

Is there any way we can synchronize the reading of the Kafka sources based on 
the watermarks we have in the two streams, i.e. to pause the reading of the 
“smaller-topic” until the “larger-stream” has caught up? Any other ideas how to 
handle this replay-scenario?

Thanks in advance

Olle


        Olle Noren
Systems Engineer
Fleet Perception for Maintenance        
[cid:nd_logo_d020cb30-0d08-4da8-8390-474f0e5447c8.png]
        NIRA Dynamics AB
Wallenbergs gata 4
58330 Linköping
Sweden  Mobile: +46 709 748 304
olle.no...@niradynamics.se
www.niradynamics.se
        Together for smarter safety

Reply via email to