[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)

2021-09-02 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408801#comment-17408801
 ] 

Till Rohrmann commented on FLINK-19816:
---

Hi [~Paul Lin], I think you are not running into the concrete issue that is 
fixed with this ticket. Instead I believe that you are running into FLINK-11813 
that will be fixed with the next release. 

I think the following is happening: The job reaches a globally terminal state 
(FAILED). Then it tells the {{Dispatcher}} that triggers the clean up of HA 
information. After the cleanup has happened, the {{Dispatcher}} process loses 
the leadership and is restarted. Since you seem to use the application mode/per 
job mode, Flink will be started with the same job but with no checkpoint 
information since it has been cleaned up. This will ultimately result in a 
restart.

Part of the problem has been solved via FLINK-21979 but the last remaining 
piece is FLINK-11813 that will most likely introduce a {{JobResultStore}} that 
can outlive Flink. Only by persisting information about the job status that can 
survive a cluster failure, we are able to properly resolve the situation. As a 
consequence, there will be some bookkeeping information that needs to be taken 
care of by the user/operator of Flink.

> Flink restored from a wrong checkpoint (a very old one and not the last 
> completed one)
> --
>
> Key: FLINK-19816
> URL: https://issues.apache.org/jira/browse/FLINK-19816
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Steven Zhen Wu
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.11.3, 1.12.0
>
> Attachments: jm.log
>
>
> h2. Summary
> Upon failure, it seems that Flink didn't restore from the last completed 
> checkpoint. Instead, it restored from a very old checkpoint. As a result, 
> Kafka offsets are invalid and caused the job to replay from the beginning as 
> Kafka consumer "auto.offset.reset" was set to "EARLIEST".
> This is an embarrassingly parallel stateless job. Parallelism is over 1,000. 
> I have the full log file from jobmanager at INFO level available upon request.
> h2. Sequence of events from the logs
> Just before the failure, checkpoint *210768* completed.
> {code}
> 2020-10-25 02:35:05,970 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator 
> [jobmanager-future-thread-5] - Completed checkpoint 210768 for job 
> 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms).
> {code}
> During restart, somehow it decided to restore from a very old checkpoint 
> *203531*.
> {code:java}
> 2020-10-25 02:36:03,301 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-3]  - Start SessionDispatcherLeaderProcess.
> 2020-10-25 02:36:03,302 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Recover all persisted job graphs.
> 2020-10-25 02:36:03,304 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem
>  [cluster-io-thread-25]  - Deleting path: 
> s3:///checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6
> 2020-10-25 02:36:03,307 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Trying to recover job with job id 
> 233b4938179c06974e4535ac8a868675.
> 2020-10-25 02:36:03,381 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem
>  [cluster-io-thread-25]  - Deleting path: 
> s3:///checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd
> ...
> 2020-10-25 02:36:03,427 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Recovering checkpoints from 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Found 0 checkpoints in 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Trying to fetch 0 checkpoints 
> from storage.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
> [flink-akka.actor.default-dispatcher-82003]  - Starting job 
> 233b4938179c06974e4535ac8a868675 from savepoint 
> s3:///checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata
>  ()
> {code}



--
This message 

[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)

2021-09-02 Thread Paul Lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408693#comment-17408693
 ] 

Paul Lin commented on FLINK-19816:
--

Still getting this error with Flink 1.12.3. The jobmanager logs are attached, 
please take a look. [~trohrmann] [^jm.log]

> Flink restored from a wrong checkpoint (a very old one and not the last 
> completed one)
> --
>
> Key: FLINK-19816
> URL: https://issues.apache.org/jira/browse/FLINK-19816
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Steven Zhen Wu
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.11.3, 1.12.0
>
> Attachments: jm.log
>
>
> h2. Summary
> Upon failure, it seems that Flink didn't restore from the last completed 
> checkpoint. Instead, it restored from a very old checkpoint. As a result, 
> Kafka offsets are invalid and caused the job to replay from the beginning as 
> Kafka consumer "auto.offset.reset" was set to "EARLIEST".
> This is an embarrassingly parallel stateless job. Parallelism is over 1,000. 
> I have the full log file from jobmanager at INFO level available upon request.
> h2. Sequence of events from the logs
> Just before the failure, checkpoint *210768* completed.
> {code}
> 2020-10-25 02:35:05,970 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator 
> [jobmanager-future-thread-5] - Completed checkpoint 210768 for job 
> 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms).
> {code}
> During restart, somehow it decided to restore from a very old checkpoint 
> *203531*.
> {code:java}
> 2020-10-25 02:36:03,301 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-3]  - Start SessionDispatcherLeaderProcess.
> 2020-10-25 02:36:03,302 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Recover all persisted job graphs.
> 2020-10-25 02:36:03,304 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem
>  [cluster-io-thread-25]  - Deleting path: 
> s3:///checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6
> 2020-10-25 02:36:03,307 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Trying to recover job with job id 
> 233b4938179c06974e4535ac8a868675.
> 2020-10-25 02:36:03,381 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem
>  [cluster-io-thread-25]  - Deleting path: 
> s3:///checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd
> ...
> 2020-10-25 02:36:03,427 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Recovering checkpoints from 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Found 0 checkpoints in 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Trying to fetch 0 checkpoints 
> from storage.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
> [flink-akka.actor.default-dispatcher-82003]  - Starting job 
> 233b4938179c06974e4535ac8a868675 from savepoint 
> s3:///checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata
>  ()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)

2020-11-09 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228601#comment-17228601
 ] 

Till Rohrmann commented on FLINK-19816:
---

After fixing FLINK-20033, I've downgraded this issue to critical because the 
problem should now be very unlikely to happen. Still we should fix this problem 
asap.

> Flink restored from a wrong checkpoint (a very old one and not the last 
> completed one)
> --
>
> Key: FLINK-19816
> URL: https://issues.apache.org/jira/browse/FLINK-19816
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Steven Zhen Wu
>Assignee: Matthias
>Priority: Critical
> Fix For: 1.12.0, 1.11.3
>
>
> h2. Summary
> Upon failure, it seems that Flink didn't restore from the last completed 
> checkpoint. Instead, it restored from a very old checkpoint. As a result, 
> Kafka offsets are invalid and caused the job to replay from the beginning as 
> Kafka consumer "auto.offset.reset" was set to "EARLIEST".
> This is an embarrassingly parallel stateless job. Parallelism is over 1,000. 
> I have the full log file from jobmanager at INFO level available upon request.
> h2. Sequence of events from the logs
> Just before the failure, checkpoint *210768* completed.
> {code}
> 2020-10-25 02:35:05,970 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator 
> [jobmanager-future-thread-5] - Completed checkpoint 210768 for job 
> 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms).
> {code}
> During restart, somehow it decided to restore from a very old checkpoint 
> *203531*.
> {code:java}
> 2020-10-25 02:36:03,301 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-3]  - Start SessionDispatcherLeaderProcess.
> 2020-10-25 02:36:03,302 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Recover all persisted job graphs.
> 2020-10-25 02:36:03,304 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem
>  [cluster-io-thread-25]  - Deleting path: 
> s3:///checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6
> 2020-10-25 02:36:03,307 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Trying to recover job with job id 
> 233b4938179c06974e4535ac8a868675.
> 2020-10-25 02:36:03,381 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem
>  [cluster-io-thread-25]  - Deleting path: 
> s3:///checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd
> ...
> 2020-10-25 02:36:03,427 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Recovering checkpoints from 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Found 0 checkpoints in 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Trying to fetch 0 checkpoints 
> from storage.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
> [flink-akka.actor.default-dispatcher-82003]  - Starting job 
> 233b4938179c06974e4535ac8a868675 from savepoint 
> s3:///checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata
>  ()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)

2020-11-06 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227458#comment-17227458
 ] 

Till Rohrmann commented on FLINK-19816:
---

The new ticket for tracking the second problem is FLINK-20033.

> Flink restored from a wrong checkpoint (a very old one and not the last 
> completed one)
> --
>
> Key: FLINK-19816
> URL: https://issues.apache.org/jira/browse/FLINK-19816
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Steven Zhen Wu
>Assignee: Matthias
>Priority: Blocker
> Fix For: 1.12.0, 1.11.3
>
>
> h2. Summary
> Upon failure, it seems that Flink didn't restore from the last completed 
> checkpoint. Instead, it restored from a very old checkpoint. As a result, 
> Kafka offsets are invalid and caused the job to replay from the beginning as 
> Kafka consumer "auto.offset.reset" was set to "EARLIEST".
> This is an embarrassingly parallel stateless job. Parallelism is over 1,000. 
> I have the full log file from jobmanager at INFO level available upon request.
> h2. Sequence of events from the logs
> Just before the failure, checkpoint *210768* completed.
> {code}
> 2020-10-25 02:35:05,970 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator 
> [jobmanager-future-thread-5] - Completed checkpoint 210768 for job 
> 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms).
> {code}
> During restart, somehow it decided to restore from a very old checkpoint 
> *203531*.
> {code:java}
> 2020-10-25 02:36:03,301 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-3]  - Start SessionDispatcherLeaderProcess.
> 2020-10-25 02:36:03,302 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Recover all persisted job graphs.
> 2020-10-25 02:36:03,304 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem
>  [cluster-io-thread-25]  - Deleting path: 
> s3:///checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6
> 2020-10-25 02:36:03,307 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Trying to recover job with job id 
> 233b4938179c06974e4535ac8a868675.
> 2020-10-25 02:36:03,381 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem
>  [cluster-io-thread-25]  - Deleting path: 
> s3:///checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd
> ...
> 2020-10-25 02:36:03,427 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Recovering checkpoints from 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Found 0 checkpoints in 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Trying to fetch 0 checkpoints 
> from storage.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
> [flink-akka.actor.default-dispatcher-82003]  - Starting job 
> 233b4938179c06974e4535ac8a868675 from savepoint 
> s3:///checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata
>  ()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)

2020-11-06 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227446#comment-17227446
 ] 

Till Rohrmann commented on FLINK-19816:
---

It turns out that we actually have two problems:

1) What we have already described: A job reaching a globally terminal state 
while being suspended and the resulting race condition is a problem.

2) Stopping the JobMaster causing the job to fail and thereby to exceed the 
maximum allowed number of restarts causing it to go to {{FAILED}}. That's also 
what happened in your case [~stevenz3wu]. This problem has been unintentionally 
fixed via FLINK-19237 for {{1.12}}. I will create a new ticket to also fix it 
for {{1.10}} and {{1.11}}.

> Flink restored from a wrong checkpoint (a very old one and not the last 
> completed one)
> --
>
> Key: FLINK-19816
> URL: https://issues.apache.org/jira/browse/FLINK-19816
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Steven Zhen Wu
>Assignee: Matthias
>Priority: Blocker
> Fix For: 1.12.0, 1.11.3
>
>
> h2. Summary
> Upon failure, it seems that Flink didn't restore from the last completed 
> checkpoint. Instead, it restored from a very old checkpoint. As a result, 
> Kafka offsets are invalid and caused the job to replay from the beginning as 
> Kafka consumer "auto.offset.reset" was set to "EARLIEST".
> This is an embarrassingly parallel stateless job. Parallelism is over 1,000. 
> I have the full log file from jobmanager at INFO level available upon request.
> h2. Sequence of events from the logs
> Just before the failure, checkpoint *210768* completed.
> {code}
> 2020-10-25 02:35:05,970 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator 
> [jobmanager-future-thread-5] - Completed checkpoint 210768 for job 
> 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms).
> {code}
> During restart, somehow it decided to restore from a very old checkpoint 
> *203531*.
> {code:java}
> 2020-10-25 02:36:03,301 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-3]  - Start SessionDispatcherLeaderProcess.
> 2020-10-25 02:36:03,302 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Recover all persisted job graphs.
> 2020-10-25 02:36:03,304 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem
>  [cluster-io-thread-25]  - Deleting path: 
> s3:///checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6
> 2020-10-25 02:36:03,307 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Trying to recover job with job id 
> 233b4938179c06974e4535ac8a868675.
> 2020-10-25 02:36:03,381 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem
>  [cluster-io-thread-25]  - Deleting path: 
> s3:///checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd
> ...
> 2020-10-25 02:36:03,427 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Recovering checkpoints from 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Found 0 checkpoints in 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Trying to fetch 0 checkpoints 
> from storage.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
> [flink-akka.actor.default-dispatcher-82003]  - Starting job 
> 233b4938179c06974e4535ac8a868675 from savepoint 
> s3:///checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata
>  ()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)

2020-11-06 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227425#comment-17227425
 ] 

Xintong Song commented on FLINK-19816:
--

Alright, thanks for the explaination.

> Flink restored from a wrong checkpoint (a very old one and not the last 
> completed one)
> --
>
> Key: FLINK-19816
> URL: https://issues.apache.org/jira/browse/FLINK-19816
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Steven Zhen Wu
>Assignee: Matthias
>Priority: Blocker
> Fix For: 1.12.0, 1.11.3
>
>
> h2. Summary
> Upon failure, it seems that Flink didn't restore from the last completed 
> checkpoint. Instead, it restored from a very old checkpoint. As a result, 
> Kafka offsets are invalid and caused the job to replay from the beginning as 
> Kafka consumer "auto.offset.reset" was set to "EARLIEST".
> This is an embarrassingly parallel stateless job. Parallelism is over 1,000. 
> I have the full log file from jobmanager at INFO level available upon request.
> h2. Sequence of events from the logs
> Just before the failure, checkpoint *210768* completed.
> {code}
> 2020-10-25 02:35:05,970 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator 
> [jobmanager-future-thread-5] - Completed checkpoint 210768 for job 
> 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms).
> {code}
> During restart, somehow it decided to restore from a very old checkpoint 
> *203531*.
> {code:java}
> 2020-10-25 02:36:03,301 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-3]  - Start SessionDispatcherLeaderProcess.
> 2020-10-25 02:36:03,302 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Recover all persisted job graphs.
> 2020-10-25 02:36:03,304 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem
>  [cluster-io-thread-25]  - Deleting path: 
> s3:///checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6
> 2020-10-25 02:36:03,307 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Trying to recover job with job id 
> 233b4938179c06974e4535ac8a868675.
> 2020-10-25 02:36:03,381 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem
>  [cluster-io-thread-25]  - Deleting path: 
> s3:///checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd
> ...
> 2020-10-25 02:36:03,427 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Recovering checkpoints from 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Found 0 checkpoints in 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Trying to fetch 0 checkpoints 
> from storage.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
> [flink-akka.actor.default-dispatcher-82003]  - Starting job 
> 233b4938179c06974e4535ac8a868675 from savepoint 
> s3:///checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata
>  ()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)

2020-11-06 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227412#comment-17227412
 ] 

Till Rohrmann commented on FLINK-19816:
---

This problem is not new. However, it affects all HA setups and can effectively 
cause data duplication in exactly once sinks because we might simply run a job 
twice because of this bug. Hence, I believe that it is a true blocker which we 
must fix before releasing.

> Flink restored from a wrong checkpoint (a very old one and not the last 
> completed one)
> --
>
> Key: FLINK-19816
> URL: https://issues.apache.org/jira/browse/FLINK-19816
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Steven Zhen Wu
>Priority: Blocker
> Fix For: 1.12.0, 1.11.3
>
>
> h2. Summary
> Upon failure, it seems that Flink didn't restore from the last completed 
> checkpoint. Instead, it restored from a very old checkpoint. As a result, 
> Kafka offsets are invalid and caused the job to replay from the beginning as 
> Kafka consumer "auto.offset.reset" was set to "EARLIEST".
> This is an embarrassingly parallel stateless job. Parallelism is over 1,000. 
> I have the full log file from jobmanager at INFO level available upon request.
> h2. Sequence of events from the logs
> Just before the failure, checkpoint *210768* completed.
> {code}
> 2020-10-25 02:35:05,970 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator 
> [jobmanager-future-thread-5] - Completed checkpoint 210768 for job 
> 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms).
> {code}
> During restart, somehow it decided to restore from a very old checkpoint 
> *203531*.
> {code:java}
> 2020-10-25 02:36:03,301 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-3]  - Start SessionDispatcherLeaderProcess.
> 2020-10-25 02:36:03,302 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Recover all persisted job graphs.
> 2020-10-25 02:36:03,304 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem
>  [cluster-io-thread-25]  - Deleting path: 
> s3:///checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6
> 2020-10-25 02:36:03,307 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Trying to recover job with job id 
> 233b4938179c06974e4535ac8a868675.
> 2020-10-25 02:36:03,381 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem
>  [cluster-io-thread-25]  - Deleting path: 
> s3:///checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd
> ...
> 2020-10-25 02:36:03,427 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Recovering checkpoints from 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Found 0 checkpoints in 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Trying to fetch 0 checkpoints 
> from storage.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
> [flink-akka.actor.default-dispatcher-82003]  - Starting job 
> 233b4938179c06974e4535ac8a868675 from savepoint 
> s3:///checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata
>  ()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)

2020-11-06 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227266#comment-17227266
 ] 

Xintong Song commented on FLINK-19816:
--

Thanks for the discussion, [~stevenz3wu] & [~trohrmann].

[~trohrmann], is this a new problem introduced after 1.11.0? If not, maybe we 
can downgrade it to a critical issue, given that the problem only affects under 
zookeeper failures.

WDYT?

> Flink restored from a wrong checkpoint (a very old one and not the last 
> completed one)
> --
>
> Key: FLINK-19816
> URL: https://issues.apache.org/jira/browse/FLINK-19816
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Steven Zhen Wu
>Priority: Blocker
> Fix For: 1.12.0, 1.11.3
>
>
> h2. Summary
> Upon failure, it seems that Flink didn't restore from the last completed 
> checkpoint. Instead, it restored from a very old checkpoint. As a result, 
> Kafka offsets are invalid and caused the job to replay from the beginning as 
> Kafka consumer "auto.offset.reset" was set to "EARLIEST".
> This is an embarrassingly parallel stateless job. Parallelism is over 1,000. 
> I have the full log file from jobmanager at INFO level available upon request.
> h2. Sequence of events from the logs
> Just before the failure, checkpoint *210768* completed.
> {code}
> 2020-10-25 02:35:05,970 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator 
> [jobmanager-future-thread-5] - Completed checkpoint 210768 for job 
> 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms).
> {code}
> During restart, somehow it decided to restore from a very old checkpoint 
> *203531*.
> {code:java}
> 2020-10-25 02:36:03,301 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-3]  - Start SessionDispatcherLeaderProcess.
> 2020-10-25 02:36:03,302 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Recover all persisted job graphs.
> 2020-10-25 02:36:03,304 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem
>  [cluster-io-thread-25]  - Deleting path: 
> s3:///checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6
> 2020-10-25 02:36:03,307 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Trying to recover job with job id 
> 233b4938179c06974e4535ac8a868675.
> 2020-10-25 02:36:03,381 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem
>  [cluster-io-thread-25]  - Deleting path: 
> s3:///checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd
> ...
> 2020-10-25 02:36:03,427 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Recovering checkpoints from 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Found 0 checkpoints in 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Trying to fetch 0 checkpoints 
> from storage.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
> [flink-akka.actor.default-dispatcher-82003]  - Starting job 
> 233b4938179c06974e4535ac8a868675 from savepoint 
> s3:///checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata
>  ()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)

2020-11-05 Thread Steven Zhen Wu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226740#comment-17226740
 ] 

Steven Zhen Wu commented on FLINK-19816:


[~trohrmann] thanks a lot for taking a look. forwarded the logs to you. but it 
seems that we may not need them

> Flink restored from a wrong checkpoint (a very old one and not the last 
> completed one)
> --
>
> Key: FLINK-19816
> URL: https://issues.apache.org/jira/browse/FLINK-19816
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Steven Zhen Wu
>Priority: Blocker
> Fix For: 1.12.0, 1.11.3
>
>
> h2. Summary
> Upon failure, it seems that Flink didn't restore from the last completed 
> checkpoint. Instead, it restored from a very old checkpoint. As a result, 
> Kafka offsets are invalid and caused the job to replay from the beginning as 
> Kafka consumer "auto.offset.reset" was set to "EARLIEST".
> This is an embarrassingly parallel stateless job. Parallelism is over 1,000. 
> I have the full log file from jobmanager at INFO level available upon request.
> h2. Sequence of events from the logs
> Just before the failure, checkpoint *210768* completed.
> {code}
> 2020-10-25 02:35:05,970 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator 
> [jobmanager-future-thread-5] - Completed checkpoint 210768 for job 
> 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms).
> {code}
> During restart, somehow it decided to restore from a very old checkpoint 
> *203531*.
> {code:java}
> 2020-10-25 02:36:03,301 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-3]  - Start SessionDispatcherLeaderProcess.
> 2020-10-25 02:36:03,302 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Recover all persisted job graphs.
> 2020-10-25 02:36:03,304 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem
>  [cluster-io-thread-25]  - Deleting path: 
> s3:///checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6
> 2020-10-25 02:36:03,307 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Trying to recover job with job id 
> 233b4938179c06974e4535ac8a868675.
> 2020-10-25 02:36:03,381 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem
>  [cluster-io-thread-25]  - Deleting path: 
> s3:///checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd
> ...
> 2020-10-25 02:36:03,427 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Recovering checkpoints from 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Found 0 checkpoints in 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Trying to fetch 0 checkpoints 
> from storage.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
> [flink-akka.actor.default-dispatcher-82003]  - Starting job 
> 233b4938179c06974e4535ac8a868675 from savepoint 
> s3:///checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata
>  ()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)

2020-11-05 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226609#comment-17226609
 ] 

Till Rohrmann commented on FLINK-19816:
---

Looking at the code, I believe that the scenario you are describing can 
actually happen. If the job is about to complete successfully, it will notify 
the {{Dispatcher}} and then remove all checkpoints from the 
{{CompletedCheckpointStore}}. If at the same time, the Dispatcher loses the 
leadership, it will stop all running jobs but w/o cleaning up the persisted 
jobs. Depending on which action is executed first (losing the leadership and 
stopping jobs w/o cleaning up the persisted JobGraph or completing successfully 
and cleaning up the persisted JobGraph) one can end up with a cluster where one 
has deleted all checkpoint data but not the {{JobGraph}} itself.

I think the problem is that we don't look at the final job status when deciding 
whether to clean up the persisted {{JobGraph}} or not but we decide on 
{{cleanupHA}} when the {{Dispatcher}} is being stopped or when the job reaches 
a globally terminal state. In the former case, this can lead to ignoring a 
successful termination of the job.

> Flink restored from a wrong checkpoint (a very old one and not the last 
> completed one)
> --
>
> Key: FLINK-19816
> URL: https://issues.apache.org/jira/browse/FLINK-19816
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0
>Reporter: Steven Zhen Wu
>Priority: Major
>
> h2. Summary
> Upon failure, it seems that Flink didn't restore from the last completed 
> checkpoint. Instead, it restored from a very old checkpoint. As a result, 
> Kafka offsets are invalid and caused the job to replay from the beginning as 
> Kafka consumer "auto.offset.reset" was set to "EARLIEST".
> This is an embarrassingly parallel stateless job. Parallelism is over 1,000. 
> I have the full log file from jobmanager at INFO level available upon request.
> h2. Sequence of events from the logs
> Just before the failure, checkpoint *210768* completed.
> {code}
> 2020-10-25 02:35:05,970 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator 
> [jobmanager-future-thread-5] - Completed checkpoint 210768 for job 
> 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms).
> {code}
> During restart, somehow it decided to restore from a very old checkpoint 
> *203531*.
> {code:java}
> 2020-10-25 02:36:03,301 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-3]  - Start SessionDispatcherLeaderProcess.
> 2020-10-25 02:36:03,302 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Recover all persisted job graphs.
> 2020-10-25 02:36:03,304 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem
>  [cluster-io-thread-25]  - Deleting path: 
> s3:///checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6
> 2020-10-25 02:36:03,307 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Trying to recover job with job id 
> 233b4938179c06974e4535ac8a868675.
> 2020-10-25 02:36:03,381 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem
>  [cluster-io-thread-25]  - Deleting path: 
> s3:///checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd
> ...
> 2020-10-25 02:36:03,427 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Recovering checkpoints from 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Found 0 checkpoints in 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Trying to fetch 0 checkpoints 
> from storage.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
> [flink-akka.actor.default-dispatcher-82003]  - Starting job 
> 233b4938179c06974e4535ac8a868675 from savepoint 
> s3:///checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata
>  ()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)

2020-11-05 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226595#comment-17226595
 ] 

Till Rohrmann commented on FLINK-19816:
---

Hi [~stevenz3wu], thanks a lot for this analysis. Have you gotten hold of the 
complete logs? I will go through the code to see whether I see a problem and 
the logs might speed this process up a bit. In any case, this sounds like a bug 
in Flink to me.

> Flink restored from a wrong checkpoint (a very old one and not the last 
> completed one)
> --
>
> Key: FLINK-19816
> URL: https://issues.apache.org/jira/browse/FLINK-19816
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0
>Reporter: Steven Zhen Wu
>Priority: Major
>
> h2. Summary
> Upon failure, it seems that Flink didn't restore from the last completed 
> checkpoint. Instead, it restored from a very old checkpoint. As a result, 
> Kafka offsets are invalid and caused the job to replay from the beginning as 
> Kafka consumer "auto.offset.reset" was set to "EARLIEST".
> This is an embarrassingly parallel stateless job. Parallelism is over 1,000. 
> I have the full log file from jobmanager at INFO level available upon request.
> h2. Sequence of events from the logs
> Just before the failure, checkpoint *210768* completed.
> {code}
> 2020-10-25 02:35:05,970 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator 
> [jobmanager-future-thread-5] - Completed checkpoint 210768 for job 
> 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms).
> {code}
> During restart, somehow it decided to restore from a very old checkpoint 
> *203531*.
> {code:java}
> 2020-10-25 02:36:03,301 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-3]  - Start SessionDispatcherLeaderProcess.
> 2020-10-25 02:36:03,302 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Recover all persisted job graphs.
> 2020-10-25 02:36:03,304 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem
>  [cluster-io-thread-25]  - Deleting path: 
> s3:///checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6
> 2020-10-25 02:36:03,307 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Trying to recover job with job id 
> 233b4938179c06974e4535ac8a868675.
> 2020-10-25 02:36:03,381 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem
>  [cluster-io-thread-25]  - Deleting path: 
> s3:///checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd
> ...
> 2020-10-25 02:36:03,427 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Recovering checkpoints from 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Found 0 checkpoints in 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Trying to fetch 0 checkpoints 
> from storage.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
> [flink-akka.actor.default-dispatcher-82003]  - Starting job 
> 233b4938179c06974e4535ac8a868675 from savepoint 
> s3:///checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata
>  ()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)

2020-11-04 Thread Steven Zhen Wu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226460#comment-17226460
 ] 

Steven Zhen Wu commented on FLINK-19816:


This happened again for the same job in production. I noticed both failures 
started zookeeper failure. Here are some observations on the sequence of 
events. My hypothesis is that the race condition / interactions between 
recovering from zk failure and failure-rate restart-strategy caused this 
problem of restoring a wrong and very old checkpoint. [~trohrmann] 

1. initially, there were some problems with zookeeper that caused the job to 
fail

{code}
2020-10-25 02:35:59,266 WARN  
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn 
[main-SendThread(ip-100-81-150-64.us-west-2.compute.internal:2181)]  - Client 
session timed out, have not heard from server in 26676
ms for sessionid 0x363850d0d034d9d
2020-10-25 02:35:59,268 INFO  
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn 
[main-SendThread(ip-100-81-150-64.us-west-2.compute.internal:2181)]  - Client 
session timed out, have not heard from server in 26676
ms for sessionid 0x363850d0d034d9d, closing socket connection and attempting 
reconnect
2020-10-25 02:35:59,282 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem  
   [cluster-io-thread-25]  - Deleting path: 
s3://us-west-2.spaas.prod/checkpoints/r7E1/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4
535ac8a868675/chk-210758/16d4e138-4199-4fd2-a014-4b394189f72b
2020-10-25 02:35:59,369 INFO  
org.apache.flink.shaded.curator4.org.apache.curator.framework.state.ConnectionStateManager
 [main-EventThread]  - State change: SUSPENDED
{code}


2. This job is configured with `restart-strategy=failure-rate`. and there are 
enough task restarts to trigger the terminal condition canRestart() to return 
false. This should eventually lead the Flink job to halt/terminal state.

{code}
2020-10-25 02:35:59,641 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   
[flink-akka.actor.default-dispatcher-81991]  - Job clapp-avro-nontvui 
(233b4938179c06974e4535ac8a868675) switched from state FAILING to FA
ILED.
org.apache.flink.runtime.JobException: Recovery is suppressed by 
FailureRateRestartBackoffTimeStrategy(FailureRateRestartBackoffTimeStrategy(failuresIntervalMS=180,backoffTimeMS=3,maxFailuresPerInterval=20)
at 
org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
at 
org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78)
at 
org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:203)
at 
org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185)
at 
org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179)
at 
org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:508)
at 
org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49)
at 
org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1725)
at 
org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1287)
at 
org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1255)
at 
org.apache.flink.runtime.executiongraph.Execution.fail(Execution.java:954)
at 
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.signalPayloadRelease(SingleLogicalSlot.java:173)
at 
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.release(SingleLogicalSlot.java:165)
at 
org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:732)
at 
org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
at 
org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:149)
at 
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseTaskManagerInternal(SlotPoolImpl.java:818)
at 
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseTaskManager(SlotPoolImpl.java:777)
at 
org.apache.flink.runtime.jobmaster.JobMaster.disconnectTaskManager(JobMaster.java:435)
at 
org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:352)
at 
org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:216)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:514)
at