Hi, Ashish. Can you confirm whether, on the subtask label page of this sink materializer node, the input records for each subtask are approximately the same?
If the input records for subtask number 5 are significantly larger compared to the others, it signifies a serious data skew, and it would be necessary to modify the SQL appropriately to resolve this skew. If the differences among all subtasks are not significant, we might be encountering an IO bottleneck. In this case, we could try increasing the parallelism of this vertex, or, as Penny suggested, we could try to enhance the memory of tm. -- Best! Xuyang 在 2024-06-24 21:28:58,"Penny Rastogi" <walls.fl...@gmail.com> 写道: Hi Ashish, Can you check a few things. 1. Is your source broker count also 20 for both topics? 2. You can try increasing the state operation memory and reduce the disk I/O. Increase the number of CU resources in a single slot. Set optimization parameters: taskmanager.memory.managed.fraction=x state.backend.rocksdb.block.cache-size=x state.backend.rocksdb.writebuffer.size=x 3. If possible, try left window join for your streams Please, share what sink you are using. Also, the per-operator, source and sink throughput, if possible? On Mon, Jun 24, 2024 at 3:32 PM Ashish Khatkar via user <user@flink.apache.org> wrote: Hi all, We are facing backpressure in the flink sql job from the sink and the backpressure only comes from a single task. This causes the checkpoint to fail despite enabling unaligned checkpoints and using debloating buffers. We enabled flamegraph and the task spends most of the time doing rocksdb get and put. The sql job does a left join over two streams with a parallelism of 20. The total data the topics have is 540Gb for one topic and roughly 60Gb in the second topic. We are running 20 taskmanagers with 1 slot each with each taskmanager having 72G mem and 9 cpu. Can you provide any help on how to go about fixing the pipeline? We are using Flink 1.17.2. The issue is similar to this stackoverflow thread, instead of week it starts facing back pressure as soon as the lag comes down to 4-5%.