Best Practices? Fault Isolation for Processing Large Number of Same-Shaped Input Kafka Topics in a Big Flink Job

Kevin Lam via user Mon, 13 May 2024 06:47:00 -0700

Hi everyone,

I'm currently prototyping on a project where we need to process a large
number of Kafka input topics (say, a couple of hundred), all of which share
the same DataType/Schema.


Our objective is to run the same Flink SQL on all of the input topics, but
I am concerned about doing this in a single large Flink SQL application for
fault isolation purposes. We'd like to limit the "blast radius" in cases of
data issues or "poison pills" in any particular Kafka topic — meaning, if
one topic runs into a problem, it shouldn’t compromise or halt the
processing of the others.

At the same time, we are concerned about the operational toil associated
with managing hundreds of Flink jobs that are really one logical
application.

Has anyone here tackled a similar challenge? If so:

   1. How did you design your solution to handle a vast number of topics
   without creating a heavy management burden?
   2. What strategies or patterns have you found effective in isolating
   issues within a specific topic so that they do not affect the processing of
   others?
   3. Are there specific configurations or tools within the Flink ecosystem
   that you'd recommend to efficiently manage this scenario?

Any examples, suggestions, or references to relevant documentation would be
helpful. Thank you in advance for your time and help!

Best Practices? Fault Isolation for Processing Large Number of Same-Shaped Input Kafka Topics in a Big Flink Job

Reply via email to