[jira] [Commented] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart
[ https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323042#comment-17323042 ] Flink Jira Bot commented on FLINK-16931: This issue is assigned but has not received an update in 7 days so it has been labeled "stale-assigned". If you are still working on the issue, please give an update and remove the label. If you are no longer working on the issue, please unassign so someone else may work on it. In 7 days the issue will be automatically unassigned. > Large _metadata file lead to JobManager not responding when restart > --- > > Key: FLINK-16931 > URL: https://issues.apache.org/jira/browse/FLINK-16931 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0, 1.11.0, 1.12.0 >Reporter: Lu Niu >Assignee: Lu Niu >Priority: Minor > Labels: stale-assigned > > When _metadata file is big, JobManager could never recover from checkpoint. > It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is > related log: > {code:java} > 2020-04-01 17:08:25,689 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Recovering checkpoints from ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found > 3 checkpoints in ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to fetch 3 checkpoints from storage. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 50. > 2020-04-01 17:08:48,589 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 51. > 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The > heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out. > {code} > Digging into the code, looks like ExecutionGraph::restart runs in JobMaster > main thread and finally calls > ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download > file form DFS. The main thread is basically blocked for a while because of > this. One possible solution is to making the downloading part async. More > things might need to consider as the original change tries to make it > single-threaded. [https://github.com/apache/flink/pull/7568] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart
[ https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248878#comment-17248878 ] Roman Khachatryan commented on FLINK-16931: --- I'm downgrading the priority of this ticket as the download should only happen in the "cold start" scenario after FLINK-19401. [~qqibrow], please feel free to post the results and upgrade the priority if you're still affected. > Large _metadata file lead to JobManager not responding when restart > --- > > Key: FLINK-16931 > URL: https://issues.apache.org/jira/browse/FLINK-16931 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0, 1.11.0, 1.12.0 >Reporter: Lu Niu >Assignee: Lu Niu >Priority: Critical > Fix For: 1.13.0 > > > When _metadata file is big, JobManager could never recover from checkpoint. > It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is > related log: > {code:java} > 2020-04-01 17:08:25,689 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Recovering checkpoints from ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found > 3 checkpoints in ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to fetch 3 checkpoints from storage. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 50. > 2020-04-01 17:08:48,589 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 51. > 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The > heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out. > {code} > Digging into the code, looks like ExecutionGraph::restart runs in JobMaster > main thread and finally calls > ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download > file form DFS. The main thread is basically blocked for a while because of > this. One possible solution is to making the downloading part async. More > things might need to consider as the original change tries to make it > single-threaded. [https://github.com/apache/flink/pull/7568] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart
[ https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246552#comment-17246552 ] Roman Khachatryan commented on FLINK-16931: --- I think the issue was resolved by FLINK-19401: when JM already has the same checkpoints in memory as in ZK it wouldn't download them from DFS. That is if it is not failing over or restoring from a savepoint. I tried to verify it locally with an artificial failure on CP handle retrieval. *Without* the fix, I see infinite load attempts failing at: {code:java} 9642 [flink-akka.actor.default-dispatcher-6] INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Recovering checkpoints from ZooKeeperStateHandleStore{namespace='flink/default/checkpoints/70e47a445aa53ff9bdcba9c79f6a58fa'}. 9644 [flink-akka.actor.default-dispatcher-6] INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Found 1 checkpoints in ZooKeeperStateHandleStore{namespace='flink/default/checkpoints/70e47a445aa53ff9bdcba9c79f6a58fa'}. 9644 [flink-akka.actor.default-dispatcher-6] INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to fetch 1 checkpoints from storage. 9644 [flink-akka.actor.default-dispatcher-6] INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to retrieve checkpoint 4. 9644 [flink-akka.actor.default-dispatcher-6] WARN org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Could not retrieve checkpoint, not adding to list of recovered checkpoints. java.lang.RuntimeException: test at org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore.retrieveCompletedCheckpoint(DefaultCompletedCheckpointStore.java:322) ~[classes/:?] at org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore.recover(DefaultCompletedCheckpointStore.java:165) ~[classes/:?] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedStateInternal(CheckpointCoordinator.java:1374) ~[classes/:?] at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedStateToAll(CheckpointCoordinator.java:1321) ~[classes/:?] at org.apache.flink.runtime.scheduler.SchedulerBase.restoreState(SchedulerBase.java:380) ~[classes/:?] at org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$restartTasks$2(DefaultScheduler.java:291) ~[classes/:?] {code} *With* a fix, I see {code:java} 5859 [flink-akka.actor.default-dispatcher-6] INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Recovering checkpoints from ZooKeeperStateHandleStore{namespace='flink/default/checkpoints/f6f58c166e273321b03789d1d5211855'}. 5864 [flink-akka.actor.default-dispatcher-6] INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Found 1 checkpoints in ZooKeeperStateHandleStore{namespace='flink/default/checkpoints/f6f58c166e273321b03789d1d5211855'}. 5864 [flink-akka.actor.default-dispatcher-6] INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - All 1 checkpoints found are already downloaded. 5864 [flink-akka.actor.default-dispatcher-6] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring job f6f58c166e273321b03789d1d5211855 from Checkpoint 7 @ 1607521794905 for f6f58c166e273321b03789d1d5211855 located at . 5873 [flink-akka.actor.default-dispatcher-6] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - No master state to restore {code} [~qqibrow] can you confirm from your side that the fix solves the problem? (1.12.0 / 1.10.3 / 1.11.3) > Large _metadata file lead to JobManager not responding when restart > --- > > Key: FLINK-16931 > URL: https://issues.apache.org/jira/browse/FLINK-16931 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0, 1.11.0, 1.12.0 >Reporter: Lu Niu >Assignee: Lu Niu >Priority: Critical > Fix For: 1.13.0 > > > When _metadata file is big, JobManager could never recover from checkpoint. > It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is > related log: > {code:java} > 2020-04-01 17:08:25,689 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Recovering checkpoints from ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found > 3 checkpoints in ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to fetch 3 checkpoints from storage. > 2020-04-01 17:08:25,698
[jira] [Commented] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart
[ https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228632#comment-17228632 ] Till Rohrmann commented on FLINK-16931: --- I believe that we won't solve this issue in the {{1.12.0}} release. Moving it to {{1.13.0}}. cc [~pnowojski]. > Large _metadata file lead to JobManager not responding when restart > --- > > Key: FLINK-16931 > URL: https://issues.apache.org/jira/browse/FLINK-16931 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0, 1.11.0 >Reporter: Lu Niu >Assignee: Lu Niu >Priority: Critical > Fix For: 1.12.0 > > > When _metadata file is big, JobManager could never recover from checkpoint. > It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is > related log: > {code:java} > 2020-04-01 17:08:25,689 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Recovering checkpoints from ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found > 3 checkpoints in ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to fetch 3 checkpoints from storage. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 50. > 2020-04-01 17:08:48,589 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 51. > 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The > heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out. > {code} > Digging into the code, looks like ExecutionGraph::restart runs in JobMaster > main thread and finally calls > ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download > file form DFS. The main thread is basically blocked for a while because of > this. One possible solution is to making the downloading part async. More > things might need to consider as the original change tries to make it > single-threaded. [https://github.com/apache/flink/pull/7568] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart
[ https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177671#comment-17177671 ] Matthias commented on FLINK-16931: -- {quote} I would propose to move this issue to 1.12.0. Do you agree Piotr Nowojski? {quote} [~pnowojski]: Could you share your opinion on that? > Large _metadata file lead to JobManager not responding when restart > --- > > Key: FLINK-16931 > URL: https://issues.apache.org/jira/browse/FLINK-16931 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0, 1.11.0 >Reporter: Lu Niu >Assignee: Lu Niu >Priority: Critical > Fix For: 1.12.0 > > > When _metadata file is big, JobManager could never recover from checkpoint. > It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is > related log: > {code:java} > 2020-04-01 17:08:25,689 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Recovering checkpoints from ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found > 3 checkpoints in ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to fetch 3 checkpoints from storage. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 50. > 2020-04-01 17:08:48,589 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 51. > 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The > heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out. > {code} > Digging into the code, looks like ExecutionGraph::restart runs in JobMaster > main thread and finally calls > ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download > file form DFS. The main thread is basically blocked for a while because of > this. One possible solution is to making the downloading part async. More > things might need to consider as the original change tries to make it > single-threaded. [https://github.com/apache/flink/pull/7568] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart
[ https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110320#comment-17110320 ] Till Rohrmann commented on FLINK-16931: --- I would propose to move this issue to {{1.12.0}}. Do you agree [~pnowojski]? > Large _metadata file lead to JobManager not responding when restart > --- > > Key: FLINK-16931 > URL: https://issues.apache.org/jira/browse/FLINK-16931 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0, 1.11.0 >Reporter: Lu Niu >Assignee: Lu Niu >Priority: Critical > Fix For: 1.11.0 > > > When _metadata file is big, JobManager could never recover from checkpoint. > It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is > related log: > {code:java} > 2020-04-01 17:08:25,689 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Recovering checkpoints from ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found > 3 checkpoints in ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to fetch 3 checkpoints from storage. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 50. > 2020-04-01 17:08:48,589 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 51. > 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The > heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out. > {code} > Digging into the code, looks like ExecutionGraph::restart runs in JobMaster > main thread and finally calls > ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download > file form DFS. The main thread is basically blocked for a while because of > this. One possible solution is to making the downloading part async. More > things might need to consider as the original change tries to make it > single-threaded. [https://github.com/apache/flink/pull/7568] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart
[ https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088282#comment-17088282 ] Lu Niu commented on FLINK-16931: [~trohrmann] Thanks for the advice. Certainly there are much more need to be considered. [~pnowojski] , If you think we could collaborate on this, please let me know the action plan. thanks! > Large _metadata file lead to JobManager not responding when restart > --- > > Key: FLINK-16931 > URL: https://issues.apache.org/jira/browse/FLINK-16931 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0, 1.11.0 >Reporter: Lu Niu >Assignee: Lu Niu >Priority: Critical > Fix For: 1.11.0 > > > When _metadata file is big, JobManager could never recover from checkpoint. > It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is > related log: > {code:java} > 2020-04-01 17:08:25,689 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Recovering checkpoints from ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found > 3 checkpoints in ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to fetch 3 checkpoints from storage. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 50. > 2020-04-01 17:08:48,589 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 51. > 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The > heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out. > {code} > Digging into the code, looks like ExecutionGraph::restart runs in JobMaster > main thread and finally calls > ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download > file form DFS. The main thread is basically blocked for a while because of > this. One possible solution is to making the downloading part async. More > things might need to consider as the original change tries to make it > single-threaded. [https://github.com/apache/flink/pull/7568] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart
[ https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087849#comment-17087849 ] Till Rohrmann commented on FLINK-16931: --- Sorry for my late response. After looking at this issue I believe that the fix is far from trivial and requires a very thorough design. There are actually two components which will be affected by this change. 1. Making the restore operation asynchronous in the {{CheckpointCoordinator}} 2. Enabling the scheduler to use an asynchronous state restore operation The latter is strictly blocked on the {{CheckpointCoordinator}} work. I think that making the restore operation work asynchronously requires to handle other {{CheckpointCoordinator}} operations accordingly. For example, what happens to pending checkpoints which are concurrently completed while a restore operation happens? There is already a discussion about exactly this problem in FLINK-16770. Once the {{CheckpointCoordinator}} can asynchronously retrieve the state to restore, it needs to be integrated into the new scheduler. Here the challenge is to handle concurrent scheduling operations properly. For example, while one waits for state to restore, a concurrent failover operation could be triggered. How is this handled and how is the potential scheduling conflict resolved? It would be awesome if [~pnowojski] could take the lead on the {{CheckpointCoordinator}} changes since he was already involved in FLINK-13698. While working on the restore state method, it would actually be a good opportunity to change the {{CheckpointCoordinator}} so that it rather returns a set of state handles instead of directly working on {{Executions}}. Once this is done, I will help to apply the scheduler changes. Fortunately, the scheduler already consists of multiple asynchronous stages which need to resolve conflicts originating from concurrent operations. Hence, I hope that another asynchronous stage of state restore might not be too difficult. > Large _metadata file lead to JobManager not responding when restart > --- > > Key: FLINK-16931 > URL: https://issues.apache.org/jira/browse/FLINK-16931 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0, 1.11.0 >Reporter: Lu Niu >Assignee: Lu Niu >Priority: Critical > Fix For: 1.11.0 > > > When _metadata file is big, JobManager could never recover from checkpoint. > It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is > related log: > {code:java} > 2020-04-01 17:08:25,689 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Recovering checkpoints from ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found > 3 checkpoints in ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to fetch 3 checkpoints from storage. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 50. > 2020-04-01 17:08:48,589 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 51. > 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The > heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out. > {code} > Digging into the code, looks like ExecutionGraph::restart runs in JobMaster > main thread and finally calls > ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download > file form DFS. The main thread is basically blocked for a while because of > this. One possible solution is to making the downloading part async. More > things might need to consider as the original change tries to make it > single-threaded. [https://github.com/apache/flink/pull/7568] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart
[ https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087284#comment-17087284 ] Lu Niu commented on FLINK-16931: Thanks! No hurries > Large _metadata file lead to JobManager not responding when restart > --- > > Key: FLINK-16931 > URL: https://issues.apache.org/jira/browse/FLINK-16931 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0, 1.11.0 >Reporter: Lu Niu >Assignee: Lu Niu >Priority: Critical > Fix For: 1.11.0 > > > When _metadata file is big, JobManager could never recover from checkpoint. > It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is > related log: > {code:java} > 2020-04-01 17:08:25,689 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Recovering checkpoints from ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found > 3 checkpoints in ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to fetch 3 checkpoints from storage. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 50. > 2020-04-01 17:08:48,589 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 51. > 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The > heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out. > {code} > Digging into the code, looks like ExecutionGraph::restart runs in JobMaster > main thread and finally calls > ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download > file form DFS. The main thread is basically blocked for a while because of > this. One possible solution is to making the downloading part async. More > things might need to consider as the original change tries to make it > single-threaded. [https://github.com/apache/flink/pull/7568] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart
[ https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085579#comment-17085579 ] Biao Liu commented on FLINK-16931: -- Hi [~qqibrow], thanks for opening the PR. I'll try to find some time next week to take a look. Too many things this week, sadly :( > Large _metadata file lead to JobManager not responding when restart > --- > > Key: FLINK-16931 > URL: https://issues.apache.org/jira/browse/FLINK-16931 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0, 1.11.0 >Reporter: Lu Niu >Assignee: Lu Niu >Priority: Critical > Fix For: 1.11.0 > > > When _metadata file is big, JobManager could never recover from checkpoint. > It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is > related log: > {code:java} > 2020-04-01 17:08:25,689 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Recovering checkpoints from ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found > 3 checkpoints in ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to fetch 3 checkpoints from storage. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 50. > 2020-04-01 17:08:48,589 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 51. > 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The > heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out. > {code} > Digging into the code, looks like ExecutionGraph::restart runs in JobMaster > main thread and finally calls > ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download > file form DFS. The main thread is basically blocked for a while because of > this. One possible solution is to making the downloading part async. More > things might need to consider as the original change tries to make it > single-threaded. [https://github.com/apache/flink/pull/7568] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart
[ https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084463#comment-17084463 ] Lu Niu commented on FLINK-16931: [~SleePy] [~pnowojski] Could you help review the design [https://github.com/apache/flink/pull/11762] ? The overall goal is to make execution async in ExecutionGraph::restart. The left piece is to make execution in CheckpointCoordinator::restoreLatestCheckpointedState async. > Large _metadata file lead to JobManager not responding when restart > --- > > Key: FLINK-16931 > URL: https://issues.apache.org/jira/browse/FLINK-16931 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0, 1.11.0 >Reporter: Lu Niu >Assignee: Lu Niu >Priority: Critical > Fix For: 1.11.0 > > > When _metadata file is big, JobManager could never recover from checkpoint. > It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is > related log: > {code:java} > 2020-04-01 17:08:25,689 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Recovering checkpoints from ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found > 3 checkpoints in ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to fetch 3 checkpoints from storage. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 50. > 2020-04-01 17:08:48,589 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 51. > 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The > heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out. > {code} > Digging into the code, looks like ExecutionGraph::restart runs in JobMaster > main thread and finally calls > ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download > file form DFS. The main thread is basically blocked for a while because of > this. One possible solution is to making the downloading part async. More > things might need to consider as the original change tries to make it > single-threaded. [https://github.com/apache/flink/pull/7568] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart
[ https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17075023#comment-17075023 ] Lu Niu commented on FLINK-16931: Thanks in advance! Will share with the plan later. > Large _metadata file lead to JobManager not responding when restart > --- > > Key: FLINK-16931 > URL: https://issues.apache.org/jira/browse/FLINK-16931 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0, 1.11.0 >Reporter: Lu Niu >Assignee: Lu Niu >Priority: Critical > Fix For: 1.11.0 > > > When _metadata file is big, JobManager could never recover from checkpoint. > It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is > related log: > {code:java} > 2020-04-01 17:08:25,689 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Recovering checkpoints from ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found > 3 checkpoints in ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to fetch 3 checkpoints from storage. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 50. > 2020-04-01 17:08:48,589 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 51. > 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The > heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out. > {code} > Digging into the code, looks like ExecutionGraph::restart runs in JobMaster > main thread and finally calls > ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download > file form DFS. The main thread is basically blocked for a while because of > this. One possible solution is to making the downloading part async. More > things might need to consider as the original change tries to make it > single-threaded. [https://github.com/apache/flink/pull/7568] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart
[ https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074552#comment-17074552 ] Piotr Nowojski commented on FLINK-16931: Hi [~qqibrow] :) I'm assigning the ticket to you. If you have some design questions feel free to reach out to [~SleePy] or me. > Large _metadata file lead to JobManager not responding when restart > --- > > Key: FLINK-16931 > URL: https://issues.apache.org/jira/browse/FLINK-16931 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0, 1.11.0 >Reporter: Lu Niu >Priority: Critical > Fix For: 1.11.0 > > > When _metadata file is big, JobManager could never recover from checkpoint. > It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is > related log: > {code:java} > 2020-04-01 17:08:25,689 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Recovering checkpoints from ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found > 3 checkpoints in ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to fetch 3 checkpoints from storage. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 50. > 2020-04-01 17:08:48,589 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 51. > 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The > heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out. > {code} > Digging into the code, looks like ExecutionGraph::restart runs in JobMaster > main thread and finally calls > ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download > file form DFS. The main thread is basically blocked for a while because of > this. One possible solution is to making the downloading part async. More > things might need to consider as the original change tries to make it > single-threaded. [https://github.com/apache/flink/pull/7568] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart
[ https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073927#comment-17073927 ] Biao Liu commented on FLINK-16931: -- [~trohrmann], my pleasure :) I could share some context here. We have discussed this scenario refactoring the whole threading model of {{CheckpointCoordinator}}, see FLINK-13497 and FLINK-13698. Although this scenario is not the cause of FLINK-13497, we think there is risk of heartbeat timeout. At that time, we decided to treat it as a follow-up issue. However we haven't file any ticket for it yet. After FLINK-13698, most parts of the non-IO operations of {{CheckpointCoordinator}} are executed in main thread executor, except the initialization part which causes this problem. One of the final targets is putting all IO operations of {{CheckpointCoordinator}} into IO thread executor, others are executed in main thread executor. To achieve this, some synchronous operations must be refactored into asynchronous ways. I think that's what we need to do here. > Large _metadata file lead to JobManager not responding when restart > --- > > Key: FLINK-16931 > URL: https://issues.apache.org/jira/browse/FLINK-16931 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0, 1.11.0 >Reporter: Lu Niu >Priority: Critical > Fix For: 1.11.0 > > > When _metadata file is big, JobManager could never recover from checkpoint. > It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is > related log: > {code:java} > 2020-04-01 17:08:25,689 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Recovering checkpoints from ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found > 3 checkpoints in ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to fetch 3 checkpoints from storage. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 50. > 2020-04-01 17:08:48,589 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 51. > 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The > heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out. > {code} > Digging into the code, looks like ExecutionGraph::restart runs in JobMaster > main thread and finally calls > ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download > file form DFS. The main thread is basically blocked for a while because of > this. One possible solution is to making the downloading part async. More > things might need to consider as the original change tries to make it > single-threaded. [https://github.com/apache/flink/pull/7568] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart
[ https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073885#comment-17073885 ] Till Rohrmann commented on FLINK-16931: --- [~pnowojski] do you think you can help [~qqibrow] fixing this problem? Maybe [~SleePy] could help with guidance as well. > Large _metadata file lead to JobManager not responding when restart > --- > > Key: FLINK-16931 > URL: https://issues.apache.org/jira/browse/FLINK-16931 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0, 1.11.0 >Reporter: Lu Niu >Priority: Critical > Fix For: 1.11.0 > > > When _metadata file is big, JobManager could never recover from checkpoint. > It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is > related log: > {code:java} > 2020-04-01 17:08:25,689 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Recovering checkpoints from ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found > 3 checkpoints in ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to fetch 3 checkpoints from storage. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 50. > 2020-04-01 17:08:48,589 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 51. > 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The > heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out. > {code} > Digging into the code, looks like ExecutionGraph::restart runs in JobMaster > main thread and finally calls > ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download > file form DFS. The main thread is basically blocked for a while because of > this. One possible solution is to making the downloading part async. More > things might need to consider as the original change tries to make it > single-threaded. [https://github.com/apache/flink/pull/7568] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart
[ https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073879#comment-17073879 ] Lu Niu commented on FLINK-16931: Hi, [~trohrmann] and [~pnowojski] Could you assign this to me to fix it? Please share more context if any. thanks! BTW, [~pnowojski] glad to see you again from presto community to flink community :) [~azagrebin] the large size is because a combination of high parallelism and state.backend.fs.memory-threshold > Large _metadata file lead to JobManager not responding when restart > --- > > Key: FLINK-16931 > URL: https://issues.apache.org/jira/browse/FLINK-16931 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0, 1.11.0 >Reporter: Lu Niu >Priority: Critical > Fix For: 1.11.0 > > > When _metadata file is big, JobManager could never recover from checkpoint. > It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is > related log: > {code:java} > 2020-04-01 17:08:25,689 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Recovering checkpoints from ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found > 3 checkpoints in ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to fetch 3 checkpoints from storage. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 50. > 2020-04-01 17:08:48,589 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 51. > 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The > heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out. > {code} > Digging into the code, looks like ExecutionGraph::restart runs in JobMaster > main thread and finally calls > ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download > file form DFS. The main thread is basically blocked for a while because of > this. One possible solution is to making the downloading part async. More > things might need to consider as the original change tries to make it > single-threaded. [https://github.com/apache/flink/pull/7568] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart
[ https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073823#comment-17073823 ] Piotr Nowojski commented on FLINK-16931: We didn't plan for any follow ups, except of fixing the FLINK-13497 ([~SleePy] is currently looking into it) bug that triggered the whole refactor. > Large _metadata file lead to JobManager not responding when restart > --- > > Key: FLINK-16931 > URL: https://issues.apache.org/jira/browse/FLINK-16931 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0, 1.11.0 >Reporter: Lu Niu >Priority: Critical > Fix For: 1.11.0 > > > When _metadata file is big, JobManager could never recover from checkpoint. > It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is > related log: > {code:java} > 2020-04-01 17:08:25,689 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Recovering checkpoints from ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found > 3 checkpoints in ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to fetch 3 checkpoints from storage. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 50. > 2020-04-01 17:08:48,589 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 51. > 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The > heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out. > {code} > Digging into the code, looks like ExecutionGraph::restart runs in JobMaster > main thread and finally calls > ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download > file form DFS. The main thread is basically blocked for a while because of > this. One possible solution is to making the downloading part async. More > things might need to consider as the original change tries to make it > single-threaded. [https://github.com/apache/flink/pull/7568] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart
[ https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073521#comment-17073521 ] Till Rohrmann commented on FLINK-16931: --- I think this issue is actually related to FLINK-13698 and a follow-up issue. [~pnowojski] what are the plans finishing the follow ups of FLINK-13698? Concretely making the state restore non-blocking? > Large _metadata file lead to JobManager not responding when restart > --- > > Key: FLINK-16931 > URL: https://issues.apache.org/jira/browse/FLINK-16931 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Coordination >Affects Versions: 1.9.2, 1.10.0, 1.11.0 >Reporter: Lu Niu >Priority: Critical > Fix For: 1.11.0 > > > When _metadata file is big, JobManager could never recover from checkpoint. > It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is > related log: > {code:java} > 2020-04-01 17:08:25,689 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Recovering checkpoints from ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found > 3 checkpoints in ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to fetch 3 checkpoints from storage. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 50. > 2020-04-01 17:08:48,589 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 51. > 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The > heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out. > {code} > Digging into the code, looks like ExecutionGraph::restart runs in JobMaster > main thread and finally calls > ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download > file form DFS. The main thread is basically blocked for a while because of > this. One possible solution is to making the downloading part async. More > things might need to consider as the original change tries to make it > single-threaded. [https://github.com/apache/flink/pull/7568] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart
[ https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073514#comment-17073514 ] Till Rohrmann commented on FLINK-16931: --- Thanks for reporting this issue [~qqibrow]. You are right that we should not run blocking operations in the rpc endpoint's main thread. We should fix this. > Large _metadata file lead to JobManager not responding when restart > --- > > Key: FLINK-16931 > URL: https://issues.apache.org/jira/browse/FLINK-16931 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Coordination >Reporter: Lu Niu >Priority: Critical > > When _metadata file is big, JobManager could never recover from checkpoint. > It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is > related log: > {code:java} > 2020-04-01 17:08:25,689 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Recovering checkpoints from ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found > 3 checkpoints in ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to fetch 3 checkpoints from storage. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 50. > 2020-04-01 17:08:48,589 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 51. > 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The > heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out. > {code} > Digging into the code, looks like ExecutionGraph::restart runs in JobMaster > main thread and finally calls > ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download > file form DFS. The main thread is basically blocked for a while because of > this. One possible solution is to making the downloading part async. More > things might need to consider as the original change tries to make it > single-threaded. [https://github.com/apache/flink/pull/7568] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart
[ https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073470#comment-17073470 ] Andrey Zagrebin commented on FLINK-16931: - Thanks for reporting this [~qqibrow] Have you checked what makes the _metadata so big? > Large _metadata file lead to JobManager not responding when restart > --- > > Key: FLINK-16931 > URL: https://issues.apache.org/jira/browse/FLINK-16931 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Coordination >Reporter: Lu Niu >Priority: Major > > When _metadata file is big, JobManager could never recover from checkpoint. > It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is > related log: > {code:java} > 2020-04-01 17:08:25,689 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Recovering checkpoints from ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found > 3 checkpoints in ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to fetch 3 checkpoints from storage. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 50. > 2020-04-01 17:08:48,589 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 51. > 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The > heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out. > {code} > Digging into the code, looks like ExecutionGraph::restart runs in JobMaster > main thread and finally calls > ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download > file form DFS. The main thread is basically blocked for a while because of > this. One possible solution is to making the downloading part async. More > things might need to consider as the original change tries to make it > single-threaded. [https://github.com/apache/flink/pull/7568] -- This message was sent by Atlassian Jira (v8.3.4#803005)