[jira] [Updated] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)

Steven Zhen Wu (Jira) Tue, 27 Oct 2020 13:42:23 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steven Zhen Wu updated FLINK-19816:
-----------------------------------
    Description: 
h2. Summary

Upon failure, it seems that Flink didn't restore from the last completed 
checkpoint. Instead, it restored from a very old checkpoint. As a result, Kafka 
offsets are invalid and caused the job to replay from the beginning as Kafka 
consumer "auto.offset.reset" was set to "EARLIEST".

This is an embarrassingly parallel stateless job. Parallelism is over 1,000. I 
have the full log file from jobmanager at INFO level available upon request.

h2. Sequence of events from the logs

Just before the failure, checkpoint *210768* completed.

{code}
2020-10-25 02:35:05,970 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator 
[jobmanager-future-thread-5] - Completed checkpoint 210768 for job 
233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms).
{code}

During restart, somehow it decided to restore from a very old checkpoint 
*203531*.
{code:java}
2020-10-25 02:36:03,301 INFO  
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
[cluster-io-thread-3]  - Start SessionDispatcherLeaderProcess.
2020-10-25 02:36:03,302 INFO  
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
[cluster-io-thread-5]  - Recover all persisted job graphs.
2020-10-25 02:36:03,304 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem              
           [cluster-io-thread-25]  - Deleting path: 
s3://<bucket>/checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6
2020-10-25 02:36:03,307 INFO  
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
[cluster-io-thread-5]  - Trying to recover job with job id 
233b4938179c06974e4535ac8a868675.

2020-10-25 02:36:03,381 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem              
           [cluster-io-thread-25]  - Deleting path: 
s3://<bucket>/checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd
...
2020-10-25 02:36:03,427 INFO  
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
[flink-akka.actor.default-dispatcher-82003]  - Recovering checkpoints from 
ZooKeeper.
2020-10-25 02:36:03,432 INFO  
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
[flink-akka.actor.default-dispatcher-82003]  - Found 0 checkpoints in ZooKeeper.
2020-10-25 02:36:03,432 INFO  
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
[flink-akka.actor.default-dispatcher-82003]  - Trying to fetch 0 checkpoints 
from storage.
2020-10-25 02:36:03,432 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    
[flink-akka.actor.default-dispatcher-82003]  - Starting job 
233b4938179c06974e4535ac8a868675 from savepoint 
s3://<bucket>/checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata
 ()
{code}



  was:
h2. Summary

Upon failure, it seems that Flink didn't restore from the last completed 
checkpoint. Instead, it restored from a very old checkpoint. As a result, Kafka 
offsets are invalid and caused the job to replay from the beginning as Kafka 
consumer "auto.offset.reset" was set to "EARLIEST".

This is an embarrassingly parallel stateless job. Parallelism is over 1,000. I 
have the full log file from jobmanager at INFO level.

h2. Sequence of events from the logs

Just before the failure, checkpoint *210768* completed.

{code}
2020-10-25 02:35:05,970 INFO 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator 
[jobmanager-future-thread-5] - Completed checkpoint 210768 for job 
233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms).
{code}

During restart, somehow it decided to restore from a very old checkpoint 
*203531*.
{code:java}
2020-10-25 02:36:03,301 INFO  
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
[cluster-io-thread-3]  - Start SessionDispatcherLeaderProcess.
2020-10-25 02:36:03,302 INFO  
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
[cluster-io-thread-5]  - Recover all persisted job graphs.
2020-10-25 02:36:03,304 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem              
           [cluster-io-thread-25]  - Deleting path: 
s3://<bucket>/checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6
2020-10-25 02:36:03,307 INFO  
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
[cluster-io-thread-5]  - Trying to recover job with job id 
233b4938179c06974e4535ac8a868675.

2020-10-25 02:36:03,381 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem              
           [cluster-io-thread-25]  - Deleting path: 
s3://<bucket>/checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd
...
2020-10-25 02:36:03,427 INFO  
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
[flink-akka.actor.default-dispatcher-82003]  - Recovering checkpoints from 
ZooKeeper.
2020-10-25 02:36:03,432 INFO  
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
[flink-akka.actor.default-dispatcher-82003]  - Found 0 checkpoints in ZooKeeper.
2020-10-25 02:36:03,432 INFO  
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
[flink-akka.actor.default-dispatcher-82003]  - Trying to fetch 0 checkpoints 
from storage.
2020-10-25 02:36:03,432 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    
[flink-akka.actor.default-dispatcher-82003]  - Starting job 
233b4938179c06974e4535ac8a868675 from savepoint 
s3://<bucket>/checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata
 ()
{code}




> Flink restored from a wrong checkpoint (a very old one and not the last 
> completed one)
> --------------------------------------------------------------------------------------
>
>                 Key: FLINK-19816
>                 URL: https://issues.apache.org/jira/browse/FLINK-19816
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.0
>            Reporter: Steven Zhen Wu
>            Priority: Major
>
> h2. Summary
> Upon failure, it seems that Flink didn't restore from the last completed 
> checkpoint. Instead, it restored from a very old checkpoint. As a result, 
> Kafka offsets are invalid and caused the job to replay from the beginning as 
> Kafka consumer "auto.offset.reset" was set to "EARLIEST".
> This is an embarrassingly parallel stateless job. Parallelism is over 1,000. 
> I have the full log file from jobmanager at INFO level available upon request.
> h2. Sequence of events from the logs
> Just before the failure, checkpoint *210768* completed.
> {code}
> 2020-10-25 02:35:05,970 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator 
> [jobmanager-future-thread-5] - Completed checkpoint 210768 for job 
> 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms).
> {code}
> During restart, somehow it decided to restore from a very old checkpoint 
> *203531*.
> {code:java}
> 2020-10-25 02:36:03,301 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-3]  - Start SessionDispatcherLeaderProcess.
> 2020-10-25 02:36:03,302 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Recover all persisted job graphs.
> 2020-10-25 02:36:03,304 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem            
>              [cluster-io-thread-25]  - Deleting path: 
> s3://<bucket>/checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6
> 2020-10-25 02:36:03,307 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Trying to recover job with job id 
> 233b4938179c06974e4535ac8a868675.
> 2020-10-25 02:36:03,381 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem            
>              [cluster-io-thread-25]  - Deleting path: 
> s3://<bucket>/checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd
> ...
> 2020-10-25 02:36:03,427 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Recovering checkpoints from 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Found 0 checkpoints in 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Trying to fetch 0 checkpoints 
> from storage.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    
> [flink-akka.actor.default-dispatcher-82003]  - Starting job 
> 233b4938179c06974e4535ac8a868675 from savepoint 
> s3://<bucket>/checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata
>  ()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)

Reply via email to