Omkar Deshpande created BEAM-10927:
--------------------------------------

             Summary: Beam Flink Runner 1.10 checkpoint failure
                 Key: BEAM-10927
                 URL: https://issues.apache.org/jira/browse/BEAM-10927
             Project: Beam
          Issue Type: Bug
          Components: runner-flink
    Affects Versions: 2.23.0
            Reporter: Omkar Deshpande


Recently upgraded to beam-runners-flink-1.10 v2.23.0 from 
beam-runners-flink-1.9 v2.23.0. Also, upgraded the flink server to 1.10.2 from 
1.9.3.

The beam pipeline reads from kafkaio and writes to kafkaio and there is an 
in-memory pardo between PBegin and PDone. The application is configured to use 
s3 for checkpointing and the state backend is RocksDB.

This beam pipeline was working as expected with beam-runners-flink-1.9 as 
expected. But after upgrading to beam-runners-flink-1.10 the checkpoints keep 
timing out. I have tried increasing time out to several hours. But checkpoints 
keep timing out.

There are no exceptions in the log. Based on the logs, both synchronous and 
asynchronous phases of checkpointing are not happening. Usually "Trigger 
checkpoint" log statement is followed by "Confirm checkpoint" when the 
checkpoint succeeds. But with 1.10, I only see "Trigger checkpoint" and no 
confirmation of completion or even indication of progress. There are enough cpu 
and memory available and there is no deadlock.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to