Hi, Ashish.
Can you confirm whether, on the subtask label page of this sink materializer
node, the input records for each subtask are approximately the same?
If the input records for subtask number 5 are significantly larger compared to
the others, it signifies a serious data skew, and it would be necessary to
modify the SQL appropriately to resolve this skew.
If the differences among all subtasks are not significant, we might be
encountering an IO bottleneck. In this case, we could try increasing the
parallelism of this vertex, or, as Penny suggested, we could try to enhance the
memory of tm.
--
Best!
Xuyang
在 2024-06-24 21:28:58,"Penny Rastogi" <[email protected]> 写道:
Hi Ashish,
Can you check a few things.
1. Is your source broker count also 20 for both topics?
2. You can try increasing the state operation memory and reduce the disk I/O.
Increase the number of CU resources in a single slot.
Set optimization parameters:
taskmanager.memory.managed.fraction=x
state.backend.rocksdb.block.cache-size=x
state.backend.rocksdb.writebuffer.size=x
3. If possible, try left window join for your streams
Please, share what sink you are using. Also, the per-operator, source and sink
throughput, if possible?
On Mon, Jun 24, 2024 at 3:32 PM Ashish Khatkar via user <[email protected]>
wrote:
Hi all,
We are facing backpressure in the flink sql job from the sink and the
backpressure only comes from a single task. This causes the checkpoint to fail
despite enabling unaligned checkpoints and using debloating buffers. We enabled
flamegraph and the task spends most of the time doing rocksdb get and put. The
sql job does a left join over two streams with a parallelism of 20. The total
data the topics have is 540Gb for one topic and roughly 60Gb in the second
topic. We are running 20 taskmanagers with 1 slot each with each taskmanager
having 72G mem and 9 cpu.
Can you provide any help on how to go about fixing the pipeline? We are using
Flink 1.17.2. The issue is similar to this stackoverflow thread, instead of
week it starts facing back pressure as soon as the lag comes down to 4-5%.