Hi everyone, I'm currently prototyping on a project where we need to process a large number of Kafka input topics (say, a couple of hundred), all of which share the same DataType/Schema.
Our objective is to run the same Flink SQL on all of the input topics, but I am concerned about doing this in a single large Flink SQL application for fault isolation purposes. We'd like to limit the "blast radius" in cases of data issues or "poison pills" in any particular Kafka topic — meaning, if one topic runs into a problem, it shouldn’t compromise or halt the processing of the others. At the same time, we are concerned about the operational toil associated with managing hundreds of Flink jobs that are really one logical application. Has anyone here tackled a similar challenge? If so: 1. How did you design your solution to handle a vast number of topics without creating a heavy management burden? 2. What strategies or patterns have you found effective in isolating issues within a specific topic so that they do not affect the processing of others? 3. Are there specific configurations or tools within the Flink ecosystem that you'd recommend to efficiently manage this scenario? Any examples, suggestions, or references to relevant documentation would be helpful. Thank you in advance for your time and help!