Hi, Is it possible to synchronize two kafka sources? So they can consume from different Kafka topics in close enough event times.
My use case is, I have two Kafka topics: A(very large) and B(large). There is a mapping of one to one or zero between A and B. Topology is simply join A and B in a tumbling time window and do aggregations on the joined data. In real time, there is not a problem. But when I start the job for last week it becomes very slow. Because, by the time source A consumes 1 minute of data from Kafka, source B consumes 1 hour of data from Kafka. Since watermark progresses with the smallest of the parent operators, source B generates many windows that will stay in the memory to be triggered in the future. That increases state size. Checkpoints gets bigger and bigger and the job becomes slower. I have tried to put an operator after sources which writes event times to an external source. If a source is far ahead than the other one, it sleeps for a short time then consume a little bit, then check and sleep again if it is necessary. This map operator increased checkpoint times much higher. I guess sleeping at an operator is not a good idea with checkpoint mechanism. Is there a way to make two or more sources consume in a synchonized way from Kafka using Flink?