[ https://issues.apache.org/jira/browse/FLINK-28030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yun Tang closed FLINK-28030. ---------------------------- Resolution: Duplicate > Checkpoint always hangs when running some jobs > ---------------------------------------------- > > Key: FLINK-28030 > URL: https://issues.apache.org/jira/browse/FLINK-28030 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.14.3 > Reporter: Pauli Gandhi > Priority: Major > > We have noticed that Flink jobs hangs and eventually times out after 2 hours > every time at the first checkpoint after it completes 15/23 acknowledgments > (65%). There is no cpu activity but yet there are number of tasks reporting > 100% back pressure. It is peculiar to this job and slight modifications to > this job. We have created many Flink jobs in the past and never encountered > the issue. > Here are the things we tried to narrow down the problem > * The job runs fine if checkpointing is disabled. > * Increasing the number of task managers and parallelism to 2 seems to help > the job complete. However, it stalled again when we sent a larger data set. > * Increased taskmanager memory from 4 GB to 16 GB and cpu from 1 to 4 but > didn't help. > * Sometimes restarting the job manager helps but at other times not. > * Breaking up the job into smaller parts helps the job to finish. > * Analyzed the the thread dump and it appears all threads are either in > sleeping or wait state. > Here are the environment details > * Flink version 1.14.3 > * Running Kubernetes > * Using RocksDB state backend. > * Checkpoint storage is S3 storage using the Presto library > * Exactly Once Semantics with unaligned checkpoints enabled. > * Checkpoint timeout 2 hours > * Maximum concurrent checkpoints is 1 > * Taskmanager CPU: 4, Slots: 1, Process Size: 12 GB > * Using Kafka for input and output > I have attached the task manager logs, thread dump, and screen shots of the > job graph and stalled checkpoint. -- This message was sent by Atlassian Jira (v8.20.7#820007)