[ https://issues.apache.org/jira/browse/BEAM-10927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17549012#comment-17549012 ]
Danny McCormick commented on BEAM-10927: ---------------------------------------- This issue has been migrated to https://github.com/apache/beam/issues/20622 > Beam Flink Runner 1.10 checkpoint failure > ----------------------------------------- > > Key: BEAM-10927 > URL: https://issues.apache.org/jira/browse/BEAM-10927 > Project: Beam > Issue Type: Bug > Components: runner-flink > Affects Versions: 2.23.0 > Reporter: Omkar Deshpande > Priority: P3 > > Recently upgraded to beam-runners-flink-1.10 v2.23.0 from > beam-runners-flink-1.9 v2.23.0. Also, upgraded the flink server to 1.10.2 > from 1.9.3. > The beam pipeline reads from kafkaio and writes to kafkaio and there is an > in-memory pardo between PBegin and PDone. The application is configured to > use s3 for checkpointing and the state backend is RocksDB. > This beam pipeline was working as expected with beam-runners-flink-1.9 as > expected. But after upgrading to beam-runners-flink-1.10 the checkpoints keep > timing out. I have tried increasing time out to several hours. But > checkpoints keep timing out. > There are no exceptions in the log. Based on the logs, both synchronous and > asynchronous phases of checkpointing are not happening. Usually "Trigger > checkpoint" log statement is followed by "Confirm checkpoint" when the > checkpoint succeeds. But with 1.10, I only see "Trigger checkpoint" and no > confirmation of completion or even indication of progress. There are enough > cpu and memory available and there is no deadlock. -- This message was sent by Atlassian Jira (v8.20.7#820007)