Omkar Deshpande created BEAM-10927:
--------------------------------------
Summary: Beam Flink Runner 1.10 checkpoint failure
Key: BEAM-10927
URL: https://issues.apache.org/jira/browse/BEAM-10927
Project: Beam
Issue Type: Bug
Components: runner-flink
Affects Versions: 2.23.0
Reporter: Omkar Deshpande
Recently upgraded to beam-runners-flink-1.10 v2.23.0 from
beam-runners-flink-1.9 v2.23.0. Also, upgraded the flink server to 1.10.2 from
1.9.3.
The beam pipeline reads from kafkaio and writes to kafkaio and there is an
in-memory pardo between PBegin and PDone. The application is configured to use
s3 for checkpointing and the state backend is RocksDB.
This beam pipeline was working as expected with beam-runners-flink-1.9 as
expected. But after upgrading to beam-runners-flink-1.10 the checkpoints keep
timing out. I have tried increasing time out to several hours. But checkpoints
keep timing out.
There are no exceptions in the log. Based on the logs, both synchronous and
asynchronous phases of checkpointing are not happening. Usually "Trigger
checkpoint" log statement is followed by "Confirm checkpoint" when the
checkpoint succeeds. But with 1.10, I only see "Trigger checkpoint" and no
confirmation of completion or even indication of progress. There are enough cpu
and memory available and there is no deadlock.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)