Each separate job would have its own consumer group hence they will read independently from the same topic and when checkpointing they will commit their own offsets. So if any job fails, it will not affect the progress of other jobs when reading from Kafka.
I am not sure of the impact of network load when multiple consumer groups are requesting data from the same topic. Multiple small jobs ensure that each job is scaled and monitored in an isolated way. Having an efficient serde can help a lot of the data we store in state, data forwarded to next steps and overall state management. Another thing you can look into is if your job step is keyed by some key, then make sure they are keyed as a string or any other Java primitive types since Object keys are much slower when reading from and writing to a state store. Thanks Sachin On Wed, May 15, 2024 at 7:58 AM longfeng Xu <xulongfeng2...@gmail.com> wrote: > Thank you . we will try . > > I‘m still confused about multiple jobs on a cluster (flink-session-yarn) > reading the same topic from kafka cluster, I understand that in this mode, > the number of times reading the topic has not decreased; it just shares the > TCP channel of the task manager, reducing the network load. Is my > understanding correct? > > Or are there any other advantages to it? Please advise. Thank you. > > Sachin Mittal <sjmit...@gmail.com> 于2024年5月15日周三 09:24写道: > >> We have the same scenario. >> We thought of having one big job with multiple branches but this leads to >> single point of failure as any issue with any branch would lead to the job >> failure and also all the sub branches would stop processing. >> >> Hence running multiple jobs on a cluster say yarn is better. >> >> Now to overcome serde issue try to use some of the more efficient schemes >> as recommended by flink. We are using POJO and it has yielded good results >> for us. >> >> >> On Wed, 15 May 2024 at 5:59 AM, longfeng Xu <xulongfeng2...@gmail.com> >> wrote: >> >>> hi >>> there are many flink jobs read one kafka topic in this scenario, >>> therefore CPU resources waste in serialization/deserialization and >>> network load is too heavy . Can you recommend a solution to avoid this >>> situation? e.g it can be more effectively using one large stream job with >>> multi branchs ? >>> >>> Best regards, >>> >>>