Hi all,

We are facing backpressure in the flink sql job from the sink and the
backpressure only comes from a single task. This causes the checkpoint to
fail despite enabling unaligned checkpoints and using debloating buffers.
We enabled flamegraph and the task spends most of the time doing rocksdb
get and put. The sql job does a left join over two streams with a
parallelism of 20. The total data the topics have is 540Gb for one topic
and roughly 60Gb in the second topic. We are running 20 taskmanagers with 1
slot each with each taskmanager having 72G mem and 9 cpu.
Can you provide any help on how to go about fixing the pipeline? We are
using Flink 1.17.2. The issue is similar to this stackoverflow thread
<https://stackoverflow.com/questions/77762119/flink-sql-job-stops-with-backpressure-after-a-week-of-execution>,
instead of week it starts facing back pressure as soon as the lag comes
down to 4-5%.

[image: image.png]

Reply via email to