[jira] [Updated] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
[ https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lincoln lee updated FLINK-31249: Fix Version/s: (was: 1.19.0) > Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck > --- > > Key: FLINK-31249 > URL: https://issues.apache.org/jira/browse/FLINK-31249 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.11.6, 1.16.0 >Reporter: Renxiang Zhou >Priority: Major > Fix For: 1.20.0 > > Attachments: image-2023-02-28-11-25-03-637.png, > image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png > > > When jobmanager receives all ACKs of tasks, it will finalize the pending > checkpoint to a completed checkpoint. Currently JM finalizes the pending > checkpoint with holding the checkpoint coordinator lock. > When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at > finalizing the pending checkpoint. > !image-2023-02-28-12-17-19-607.png|width=1010,height=244! > And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread > waits for the lock to be released. > !image-2023-02-28-11-25-03-637.png|width=1144,height=248! > If the previous checkpoint times out, the {{Checkpoint Timer}} will not > execute the timeout event since it is blocked at waiting for the lock. As a > result, the previous checkpoint cannot be cancelled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
[ https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lincoln lee updated FLINK-31249: Fix Version/s: 1.20.0 > Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck > --- > > Key: FLINK-31249 > URL: https://issues.apache.org/jira/browse/FLINK-31249 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.11.6, 1.16.0 >Reporter: Renxiang Zhou >Priority: Major > Fix For: 1.19.0, 1.20.0 > > Attachments: image-2023-02-28-11-25-03-637.png, > image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png > > > When jobmanager receives all ACKs of tasks, it will finalize the pending > checkpoint to a completed checkpoint. Currently JM finalizes the pending > checkpoint with holding the checkpoint coordinator lock. > When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at > finalizing the pending checkpoint. > !image-2023-02-28-12-17-19-607.png|width=1010,height=244! > And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread > waits for the lock to be released. > !image-2023-02-28-11-25-03-637.png|width=1144,height=248! > If the previous checkpoint times out, the {{Checkpoint Timer}} will not > execute the timeout event since it is blocked at waiting for the lock. As a > result, the previous checkpoint cannot be cancelled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
[ https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Ge updated FLINK-31249: Fix Version/s: 1.19.0 (was: 1.18.0) > Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck > --- > > Key: FLINK-31249 > URL: https://issues.apache.org/jira/browse/FLINK-31249 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.11.6, 1.16.0 >Reporter: Renxiang Zhou >Priority: Major > Fix For: 1.19.0 > > Attachments: image-2023-02-28-11-25-03-637.png, > image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png > > > When jobmanager receives all ACKs of tasks, it will finalize the pending > checkpoint to a completed checkpoint. Currently JM finalizes the pending > checkpoint with holding the checkpoint coordinator lock. > When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at > finalizing the pending checkpoint. > !image-2023-02-28-12-17-19-607.png|width=1010,height=244! > And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread > waits for the lock to be released. > !image-2023-02-28-11-25-03-637.png|width=1144,height=248! > If the previous checkpoint times out, the {{Checkpoint Timer}} will not > execute the timeout event since it is blocked at waiting for the lock. As a > result, the previous checkpoint cannot be cancelled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
[ https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] renxiang zhou updated FLINK-31249: -- Description: When jobmanager receives all ACKs of tasks, it will finalize the pending checkpoint to a completed checkpoint. Currently JM finalizes the pending checkpoint with holding the checkpoint coordinator lock. When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at finalizing the pending checkpoint. !image-2023-02-28-12-17-19-607.png|width=1010,height=244! And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread waits for the lock to be released. !image-2023-02-28-11-25-03-637.png|width=1144,height=248! If the previous checkpoint times out, the {{Checkpoint Timer}} will not execute the timeout event since it is blocked at waiting for the lock. As a result, the previous checkpoint cannot be cancelled. was: The {{jobmanager-future}} thread may be blocked at writing metadata to DFS caused by a DFS failure, and the {{CheckpointCoordinator Lock}} is hold by this thread. When the next checkpoint is triggered, the {{Checkpoint Timer}} thread waits for the lock to be released. If the previous checkpoint times out, the {{Checkpoint Timer}} will not execute the timeout event since it is blocked at waiting for the lock. As a result, the previous checkpoint cannot be cancelled. !image-2023-02-28-11-25-03-637.png|width=1144,height=248! > Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck > --- > > Key: FLINK-31249 > URL: https://issues.apache.org/jira/browse/FLINK-31249 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.11.6, 1.16.0 >Reporter: renxiang zhou >Priority: Major > Fix For: 1.18.0 > > Attachments: image-2023-02-28-11-25-03-637.png, > image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png > > > When jobmanager receives all ACKs of tasks, it will finalize the pending > checkpoint to a completed checkpoint. Currently JM finalizes the pending > checkpoint with holding the checkpoint coordinator lock. > When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at > finalizing the pending checkpoint. > !image-2023-02-28-12-17-19-607.png|width=1010,height=244! > And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread > waits for the lock to be released. > !image-2023-02-28-11-25-03-637.png|width=1144,height=248! > If the previous checkpoint times out, the {{Checkpoint Timer}} will not > execute the timeout event since it is blocked at waiting for the lock. As a > result, the previous checkpoint cannot be cancelled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
[ https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] renxiang zhou updated FLINK-31249: -- Attachment: image-2023-02-28-12-17-19-607.png > Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck > --- > > Key: FLINK-31249 > URL: https://issues.apache.org/jira/browse/FLINK-31249 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.11.6, 1.16.0 >Reporter: renxiang zhou >Priority: Major > Fix For: 1.18.0 > > Attachments: image-2023-02-28-11-25-03-637.png, > image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png > > > The {{jobmanager-future}} thread may be blocked at writing metadata to DFS > caused by a DFS failure, and the {{CheckpointCoordinator Lock}} is hold by > this thread. > When the next checkpoint is triggered, the {{Checkpoint Timer}} thread waits > for the lock to be released. If the previous checkpoint times out, the > {{Checkpoint Timer}} will not execute the timeout event since it is blocked > at waiting for the lock. As a result, the previous checkpoint cannot be > cancelled. > !image-2023-02-28-11-25-03-637.png|width=1144,height=248! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
[ https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] renxiang zhou updated FLINK-31249: -- Attachment: image-2023-02-28-12-04-35-178.png > Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck > --- > > Key: FLINK-31249 > URL: https://issues.apache.org/jira/browse/FLINK-31249 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.11.6, 1.16.0 >Reporter: renxiang zhou >Priority: Major > Fix For: 1.18.0 > > Attachments: image-2023-02-28-11-25-03-637.png, > image-2023-02-28-12-04-35-178.png > > > The {{jobmanager-future}} thread may be blocked at writing metadata to DFS > caused by a DFS failure, and the {{CheckpointCoordinator Lock}} is hold by > this thread. > When the next checkpoint is triggered, the {{Checkpoint Timer}} thread waits > for the lock to be released. If the previous checkpoint times out, the > {{Checkpoint Timer}} will not execute the timeout event since it is blocked > at waiting for the lock. As a result, the previous checkpoint cannot be > cancelled. > !image-2023-02-28-11-25-03-637.png|width=1144,height=248! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
[ https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] renxiang zhou updated FLINK-31249: -- Summary: Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck (was: Checkpoint timeout mechanism fails when completePendingCheckpoint is stuck) > Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck > --- > > Key: FLINK-31249 > URL: https://issues.apache.org/jira/browse/FLINK-31249 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.11.6, 1.16.0 >Reporter: renxiang zhou >Priority: Major > Fix For: 1.18.0 > > Attachments: image-2023-02-28-11-25-03-637.png > > > The {{jobmanager-future}} thread may be blocked at writing metadata to DFS > caused by a DFS failure, and the {{CheckpointCoordinator Lock}} is hold by > this thread. > When the next checkpoint is triggered, the {{Checkpoint Timer}} thread waits > for the lock to be released. If the previous checkpoint times out, the > {{Checkpoint Timer}} will not execute the timeout event since it is blocked > at waiting for the lock. As a result, the previous checkpoint cannot be > cancelled. > !image-2023-02-28-11-25-03-637.png|width=1144,height=248! -- This message was sent by Atlassian Jira (v8.20.10#820010)