[jira] [Updated] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck

2024-03-11 Thread lincoln lee (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lincoln lee updated FLINK-31249:

Fix Version/s: (was: 1.19.0)

> Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
> ---
>
> Key: FLINK-31249
> URL: https://issues.apache.org/jira/browse/FLINK-31249
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.6, 1.16.0
>Reporter: Renxiang Zhou
>Priority: Major
> Fix For: 1.20.0
>
> Attachments: image-2023-02-28-11-25-03-637.png, 
> image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png
>
>
> When jobmanager receives all ACKs of tasks, it will finalize the pending 
> checkpoint to a completed checkpoint. Currently JM finalizes the pending 
> checkpoint with holding the checkpoint coordinator lock.
> When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at 
> finalizing the pending checkpoint.
> !image-2023-02-28-12-17-19-607.png|width=1010,height=244!
> And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread 
> waits for the lock to be released. 
> !image-2023-02-28-11-25-03-637.png|width=1144,height=248!
> If the previous checkpoint times out, the {{Checkpoint Timer}} will not 
> execute the timeout event since it is blocked at waiting for the lock. As a 
> result, the previous checkpoint cannot be cancelled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck

2024-03-11 Thread lincoln lee (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lincoln lee updated FLINK-31249:

Fix Version/s: 1.20.0

> Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
> ---
>
> Key: FLINK-31249
> URL: https://issues.apache.org/jira/browse/FLINK-31249
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.6, 1.16.0
>Reporter: Renxiang Zhou
>Priority: Major
> Fix For: 1.19.0, 1.20.0
>
> Attachments: image-2023-02-28-11-25-03-637.png, 
> image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png
>
>
> When jobmanager receives all ACKs of tasks, it will finalize the pending 
> checkpoint to a completed checkpoint. Currently JM finalizes the pending 
> checkpoint with holding the checkpoint coordinator lock.
> When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at 
> finalizing the pending checkpoint.
> !image-2023-02-28-12-17-19-607.png|width=1010,height=244!
> And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread 
> waits for the lock to be released. 
> !image-2023-02-28-11-25-03-637.png|width=1144,height=248!
> If the previous checkpoint times out, the {{Checkpoint Timer}} will not 
> execute the timeout event since it is blocked at waiting for the lock. As a 
> result, the previous checkpoint cannot be cancelled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck

2023-10-13 Thread Jing Ge (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Ge updated FLINK-31249:

Fix Version/s: 1.19.0
   (was: 1.18.0)

> Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
> ---
>
> Key: FLINK-31249
> URL: https://issues.apache.org/jira/browse/FLINK-31249
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.6, 1.16.0
>Reporter: Renxiang Zhou
>Priority: Major
> Fix For: 1.19.0
>
> Attachments: image-2023-02-28-11-25-03-637.png, 
> image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png
>
>
> When jobmanager receives all ACKs of tasks, it will finalize the pending 
> checkpoint to a completed checkpoint. Currently JM finalizes the pending 
> checkpoint with holding the checkpoint coordinator lock.
> When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at 
> finalizing the pending checkpoint.
> !image-2023-02-28-12-17-19-607.png|width=1010,height=244!
> And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread 
> waits for the lock to be released. 
> !image-2023-02-28-11-25-03-637.png|width=1144,height=248!
> If the previous checkpoint times out, the {{Checkpoint Timer}} will not 
> execute the timeout event since it is blocked at waiting for the lock. As a 
> result, the previous checkpoint cannot be cancelled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck

2023-02-27 Thread renxiang zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

renxiang zhou updated FLINK-31249:
--
Description: 
When jobmanager receives all ACKs of tasks, it will finalize the pending 
checkpoint to a completed checkpoint. Currently JM finalizes the pending 
checkpoint with holding the checkpoint coordinator lock.

When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at 
finalizing the pending checkpoint.

!image-2023-02-28-12-17-19-607.png|width=1010,height=244!

And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread 
waits for the lock to be released. 

!image-2023-02-28-11-25-03-637.png|width=1144,height=248!

If the previous checkpoint times out, the {{Checkpoint Timer}} will not execute 
the timeout event since it is blocked at waiting for the lock. As a result, the 
previous checkpoint cannot be cancelled.

  was:
The {{jobmanager-future}} thread may be blocked at writing metadata to DFS 
caused by a DFS failure, and the {{CheckpointCoordinator Lock}} is hold by this 
thread. 

When the next checkpoint is triggered, the {{Checkpoint Timer}} thread waits 
for the lock to be released.  If the previous checkpoint times out, the 
{{Checkpoint Timer}} will not execute the timeout event since it is blocked at 
waiting for the lock. As a result, the previous checkpoint cannot be cancelled.

!image-2023-02-28-11-25-03-637.png|width=1144,height=248!


> Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
> ---
>
> Key: FLINK-31249
> URL: https://issues.apache.org/jira/browse/FLINK-31249
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.6, 1.16.0
>Reporter: renxiang zhou
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-02-28-11-25-03-637.png, 
> image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png
>
>
> When jobmanager receives all ACKs of tasks, it will finalize the pending 
> checkpoint to a completed checkpoint. Currently JM finalizes the pending 
> checkpoint with holding the checkpoint coordinator lock.
> When a DFS failure occurs, the {{jobmanager-future}} thread may be blocked at 
> finalizing the pending checkpoint.
> !image-2023-02-28-12-17-19-607.png|width=1010,height=244!
> And then the next checkpoint is triggered, the {{Checkpoint Timer}} thread 
> waits for the lock to be released. 
> !image-2023-02-28-11-25-03-637.png|width=1144,height=248!
> If the previous checkpoint times out, the {{Checkpoint Timer}} will not 
> execute the timeout event since it is blocked at waiting for the lock. As a 
> result, the previous checkpoint cannot be cancelled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck

2023-02-27 Thread renxiang zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

renxiang zhou updated FLINK-31249:
--
Attachment: image-2023-02-28-12-17-19-607.png

> Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
> ---
>
> Key: FLINK-31249
> URL: https://issues.apache.org/jira/browse/FLINK-31249
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.6, 1.16.0
>Reporter: renxiang zhou
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-02-28-11-25-03-637.png, 
> image-2023-02-28-12-04-35-178.png, image-2023-02-28-12-17-19-607.png
>
>
> The {{jobmanager-future}} thread may be blocked at writing metadata to DFS 
> caused by a DFS failure, and the {{CheckpointCoordinator Lock}} is hold by 
> this thread. 
> When the next checkpoint is triggered, the {{Checkpoint Timer}} thread waits 
> for the lock to be released.  If the previous checkpoint times out, the 
> {{Checkpoint Timer}} will not execute the timeout event since it is blocked 
> at waiting for the lock. As a result, the previous checkpoint cannot be 
> cancelled.
> !image-2023-02-28-11-25-03-637.png|width=1144,height=248!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck

2023-02-27 Thread renxiang zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

renxiang zhou updated FLINK-31249:
--
Attachment: image-2023-02-28-12-04-35-178.png

> Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
> ---
>
> Key: FLINK-31249
> URL: https://issues.apache.org/jira/browse/FLINK-31249
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.6, 1.16.0
>Reporter: renxiang zhou
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-02-28-11-25-03-637.png, 
> image-2023-02-28-12-04-35-178.png
>
>
> The {{jobmanager-future}} thread may be blocked at writing metadata to DFS 
> caused by a DFS failure, and the {{CheckpointCoordinator Lock}} is hold by 
> this thread. 
> When the next checkpoint is triggered, the {{Checkpoint Timer}} thread waits 
> for the lock to be released.  If the previous checkpoint times out, the 
> {{Checkpoint Timer}} will not execute the timeout event since it is blocked 
> at waiting for the lock. As a result, the previous checkpoint cannot be 
> cancelled.
> !image-2023-02-28-11-25-03-637.png|width=1144,height=248!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31249) Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck

2023-02-27 Thread renxiang zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

renxiang zhou updated FLINK-31249:
--
Summary: Checkpoint timeout mechanism fails when finalizeCheckpoint is 
stuck  (was: Checkpoint timeout mechanism fails when completePendingCheckpoint 
is stuck)

> Checkpoint timeout mechanism fails when finalizeCheckpoint is stuck
> ---
>
> Key: FLINK-31249
> URL: https://issues.apache.org/jira/browse/FLINK-31249
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.6, 1.16.0
>Reporter: renxiang zhou
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-02-28-11-25-03-637.png
>
>
> The {{jobmanager-future}} thread may be blocked at writing metadata to DFS 
> caused by a DFS failure, and the {{CheckpointCoordinator Lock}} is hold by 
> this thread. 
> When the next checkpoint is triggered, the {{Checkpoint Timer}} thread waits 
> for the lock to be released.  If the previous checkpoint times out, the 
> {{Checkpoint Timer}} will not execute the timeout event since it is blocked 
> at waiting for the lock. As a result, the previous checkpoint cannot be 
> cancelled.
> !image-2023-02-28-11-25-03-637.png|width=1144,height=248!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)