[jira] [Commented] (FLINK-26049) The tolerable-failed-checkpoints logic is invalid when checkpoint trigger failed

2022-03-04 Thread fanrui (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501262#comment-17501262
 ] 

fanrui commented on FLINK-26049:


[~pnowojski]  [~akalashnikov]  Thank you for your advice and review. :)

> The tolerable-failed-checkpoints logic is invalid when checkpoint trigger 
> failed
> 
>
> Key: FLINK-26049
> URL: https://issues.apache.org/jira/browse/FLINK-26049
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.5, 1.14.3
>Reporter: fanrui
>Assignee: fanrui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.15.0, 1.14.4
>
> Attachments: image-2022-02-09-18-08-17-868.png, 
> image-2022-02-09-18-08-34-992.png, image-2022-02-09-18-08-42-920.png, 
> image-2022-02-18-11-28-53-337.png, image-2022-02-18-11-33-28-232.png, 
> image-2022-02-18-11-44-52-745.png, image-2022-02-22-10-27-43-731.png, 
> image-2022-02-22-10-31-05-012.png
>
>
> After triggerCheckpoint, if checkpoint failed, flink will execute the 
> tolerable-failed-checkpoints logic. But if triggerCheckpoint failed, flink 
> won't execute the tolerable-failed-checkpoints logic.
> h1. How to reproduce this issue?
> In our online env, hdfs sre deletes the flink base dir by mistake, and flink 
> job don't have permission to create checkpoint dir. So cause flink trigger 
> checkpoint failed.
> There are some didn't meet expectations:
>  * JM just log _"Failed to trigger checkpoint for job 
> 6f09d4a15dad42b24d52c987f5471f18 since Trigger checkpoint failure" ._ Don't 
> show the root cause or exception.
>  * user set tolerable-failed-checkpoints=0, but if triggerCheckpoint failed, 
> flink won't execute the tolerable-failed-checkpoints logic. 
>  * When triggerCheckpoint failed, numberOfFailedCheckpoints is always 0
>  * When triggerCheckpoint failed, we can't find checkpoint info in checkpoint 
> history page.
>  
> !image-2022-02-09-18-08-17-868.png!
>  
> !image-2022-02-09-18-08-34-992.png!
> !image-2022-02-09-18-08-42-920.png!
>  
> h3. *All metrics are normal, so the next day we found out that the checkpoint 
> failed, and the checkpoint has been failing for a day. it's not acceptable to 
> the flink user.*
> I have some ideas:
>  # Should tolerable-failed-checkpoints logic be executed when 
> triggerCheckpoint fails?
>  # When triggerCheckpoint failed, should increase numberOfFailedCheckpoints?
>  # When triggerCheckpoint failed, should show checkpoint info in checkpoint 
> history page?
>  # JM just show "Failed to trigger checkpoint", should we show detailed 
> exception to easy find the root cause?
>  
> Masters, could we do these changes? Please correct me if I'm wrong.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-26049) The tolerable-failed-checkpoints logic is invalid when checkpoint trigger failed

2022-03-03 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500892#comment-17500892
 ] 

Piotr Nowojski commented on FLINK-26049:


merged commit ffe353a into apache:master

> The tolerable-failed-checkpoints logic is invalid when checkpoint trigger 
> failed
> 
>
> Key: FLINK-26049
> URL: https://issues.apache.org/jira/browse/FLINK-26049
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.5, 1.14.3
>Reporter: fanrui
>Assignee: fanrui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.15.0
>
> Attachments: image-2022-02-09-18-08-17-868.png, 
> image-2022-02-09-18-08-34-992.png, image-2022-02-09-18-08-42-920.png, 
> image-2022-02-18-11-28-53-337.png, image-2022-02-18-11-33-28-232.png, 
> image-2022-02-18-11-44-52-745.png, image-2022-02-22-10-27-43-731.png, 
> image-2022-02-22-10-31-05-012.png
>
>
> After triggerCheckpoint, if checkpoint failed, flink will execute the 
> tolerable-failed-checkpoints logic. But if triggerCheckpoint failed, flink 
> won't execute the tolerable-failed-checkpoints logic.
> h1. How to reproduce this issue?
> In our online env, hdfs sre deletes the flink base dir by mistake, and flink 
> job don't have permission to create checkpoint dir. So cause flink trigger 
> checkpoint failed.
> There are some didn't meet expectations:
>  * JM just log _"Failed to trigger checkpoint for job 
> 6f09d4a15dad42b24d52c987f5471f18 since Trigger checkpoint failure" ._ Don't 
> show the root cause or exception.
>  * user set tolerable-failed-checkpoints=0, but if triggerCheckpoint failed, 
> flink won't execute the tolerable-failed-checkpoints logic. 
>  * When triggerCheckpoint failed, numberOfFailedCheckpoints is always 0
>  * When triggerCheckpoint failed, we can't find checkpoint info in checkpoint 
> history page.
>  
> !image-2022-02-09-18-08-17-868.png!
>  
> !image-2022-02-09-18-08-34-992.png!
> !image-2022-02-09-18-08-42-920.png!
>  
> h3. *All metrics are normal, so the next day we found out that the checkpoint 
> failed, and the checkpoint has been failing for a day. it's not acceptable to 
> the flink user.*
> I have some ideas:
>  # Should tolerable-failed-checkpoints logic be executed when 
> triggerCheckpoint fails?
>  # When triggerCheckpoint failed, should increase numberOfFailedCheckpoints?
>  # When triggerCheckpoint failed, should show checkpoint info in checkpoint 
> history page?
>  # JM just show "Failed to trigger checkpoint", should we show detailed 
> exception to easy find the root cause?
>  
> Masters, could we do these changes? Please correct me if I'm wrong.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-26049) The tolerable-failed-checkpoints logic is invalid when checkpoint trigger failed

2022-02-23 Thread fanrui (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17496559#comment-17496559
 ] 

fanrui commented on FLINK-26049:


Hi [~pnowojski] [~akalashnikov] , I have updated the 
[PR|[https://github.com/apache/flink/pull/18852],] could you help to review in 
your free time, please? Thanks a lot.

> The tolerable-failed-checkpoints logic is invalid when checkpoint trigger 
> failed
> 
>
> Key: FLINK-26049
> URL: https://issues.apache.org/jira/browse/FLINK-26049
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.5, 1.14.3
>Reporter: fanrui
>Assignee: fanrui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.15.0
>
> Attachments: image-2022-02-09-18-08-17-868.png, 
> image-2022-02-09-18-08-34-992.png, image-2022-02-09-18-08-42-920.png, 
> image-2022-02-18-11-28-53-337.png, image-2022-02-18-11-33-28-232.png, 
> image-2022-02-18-11-44-52-745.png, image-2022-02-22-10-27-43-731.png, 
> image-2022-02-22-10-31-05-012.png
>
>
> After triggerCheckpoint, if checkpoint failed, flink will execute the 
> tolerable-failed-checkpoints logic. But if triggerCheckpoint failed, flink 
> won't execute the tolerable-failed-checkpoints logic.
> h1. How to reproduce this issue?
> In our online env, hdfs sre deletes the flink base dir by mistake, and flink 
> job don't have permission to create checkpoint dir. So cause flink trigger 
> checkpoint failed.
> There are some didn't meet expectations:
>  * JM just log _"Failed to trigger checkpoint for job 
> 6f09d4a15dad42b24d52c987f5471f18 since Trigger checkpoint failure" ._ Don't 
> show the root cause or exception.
>  * user set tolerable-failed-checkpoints=0, but if triggerCheckpoint failed, 
> flink won't execute the tolerable-failed-checkpoints logic. 
>  * When triggerCheckpoint failed, numberOfFailedCheckpoints is always 0
>  * When triggerCheckpoint failed, we can't find checkpoint info in checkpoint 
> history page.
>  
> !image-2022-02-09-18-08-17-868.png!
>  
> !image-2022-02-09-18-08-34-992.png!
> !image-2022-02-09-18-08-42-920.png!
>  
> h3. *All metrics are normal, so the next day we found out that the checkpoint 
> failed, and the checkpoint has been failing for a day. it's not acceptable to 
> the flink user.*
> I have some ideas:
>  # Should tolerable-failed-checkpoints logic be executed when 
> triggerCheckpoint fails?
>  # When triggerCheckpoint failed, should increase numberOfFailedCheckpoints?
>  # When triggerCheckpoint failed, should show checkpoint info in checkpoint 
> history page?
>  # JM just show "Failed to trigger checkpoint", should we show detailed 
> exception to easy find the root cause?
>  
> Masters, could we do these changes? Please correct me if I'm wrong.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-26049) The tolerable-failed-checkpoints logic is invalid when checkpoint trigger failed

2022-02-22 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17495982#comment-17495982
 ] 

Piotr Nowojski commented on FLINK-26049:


Ok, thanks for sharing the stack trace. Indeed we can treat this as a bug and I 
agree that this should have been checked against the failure counter just as 
other IOExceptions in the CheckpointCoordinator.

> The tolerable-failed-checkpoints logic is invalid when checkpoint trigger 
> failed
> 
>
> Key: FLINK-26049
> URL: https://issues.apache.org/jira/browse/FLINK-26049
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.5, 1.14.3
>Reporter: fanrui
>Assignee: fanrui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.15.0
>
> Attachments: image-2022-02-09-18-08-17-868.png, 
> image-2022-02-09-18-08-34-992.png, image-2022-02-09-18-08-42-920.png, 
> image-2022-02-18-11-28-53-337.png, image-2022-02-18-11-33-28-232.png, 
> image-2022-02-18-11-44-52-745.png, image-2022-02-22-10-27-43-731.png, 
> image-2022-02-22-10-31-05-012.png
>
>
> After triggerCheckpoint, if checkpoint failed, flink will execute the 
> tolerable-failed-checkpoints logic. But if triggerCheckpoint failed, flink 
> won't execute the tolerable-failed-checkpoints logic.
> h1. How to reproduce this issue?
> In our online env, hdfs sre deletes the flink base dir by mistake, and flink 
> job don't have permission to create checkpoint dir. So cause flink trigger 
> checkpoint failed.
> There are some didn't meet expectations:
>  * JM just log _"Failed to trigger checkpoint for job 
> 6f09d4a15dad42b24d52c987f5471f18 since Trigger checkpoint failure" ._ Don't 
> show the root cause or exception.
>  * user set tolerable-failed-checkpoints=0, but if triggerCheckpoint failed, 
> flink won't execute the tolerable-failed-checkpoints logic. 
>  * When triggerCheckpoint failed, numberOfFailedCheckpoints is always 0
>  * When triggerCheckpoint failed, we can't find checkpoint info in checkpoint 
> history page.
>  
> !image-2022-02-09-18-08-17-868.png!
>  
> !image-2022-02-09-18-08-34-992.png!
> !image-2022-02-09-18-08-42-920.png!
>  
> h3. *All metrics are normal, so the next day we found out that the checkpoint 
> failed, and the checkpoint has been failing for a day. it's not acceptable to 
> the flink user.*
> I have some ideas:
>  # Should tolerable-failed-checkpoints logic be executed when 
> triggerCheckpoint fails?
>  # When triggerCheckpoint failed, should increase numberOfFailedCheckpoints?
>  # When triggerCheckpoint failed, should show checkpoint info in checkpoint 
> history page?
>  # JM just show "Failed to trigger checkpoint", should we show detailed 
> exception to easy find the root cause?
>  
> Masters, could we do these changes? Please correct me if I'm wrong.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-26049) The tolerable-failed-checkpoints logic is invalid when checkpoint trigger failed

2022-02-21 Thread fanrui (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17495826#comment-17495826
 ] 

fanrui commented on FLINK-26049:


Hi [~pnowojski], this is exception stack. We meet the hdfs permission problem 
when initializeLocationForCheckpoint and 
org.apache.hadoop.security.AccessControlException is a subclass of IOException. 

This exception occurs before create PendingCheckpoint. So 
numberOfFailedCheckpoints can't be increased. And community flink version also 
does't print the 
Exception([code|https://github.com/apache/flink/blob/ac3ad139fbad02b2de241d5eef7b1e3ce6007b82/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L934]),
  it's just show the throwable.getMessage() :"An Exception occurred while 
triggering the checkpoint. IO-problem detected."

 

 

!image-2022-02-22-10-27-43-731.png!

 

!image-2022-02-22-10-31-05-012.png!

 

> The tolerable-failed-checkpoints logic is invalid when checkpoint trigger 
> failed
> 
>
> Key: FLINK-26049
> URL: https://issues.apache.org/jira/browse/FLINK-26049
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.5, 1.14.3
>Reporter: fanrui
>Assignee: fanrui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.15.0
>
> Attachments: image-2022-02-09-18-08-17-868.png, 
> image-2022-02-09-18-08-34-992.png, image-2022-02-09-18-08-42-920.png, 
> image-2022-02-18-11-28-53-337.png, image-2022-02-18-11-33-28-232.png, 
> image-2022-02-18-11-44-52-745.png, image-2022-02-22-10-27-43-731.png, 
> image-2022-02-22-10-31-05-012.png
>
>
> After triggerCheckpoint, if checkpoint failed, flink will execute the 
> tolerable-failed-checkpoints logic. But if triggerCheckpoint failed, flink 
> won't execute the tolerable-failed-checkpoints logic.
> h1. How to reproduce this issue?
> In our online env, hdfs sre deletes the flink base dir by mistake, and flink 
> job don't have permission to create checkpoint dir. So cause flink trigger 
> checkpoint failed.
> There are some didn't meet expectations:
>  * JM just log _"Failed to trigger checkpoint for job 
> 6f09d4a15dad42b24d52c987f5471f18 since Trigger checkpoint failure" ._ Don't 
> show the root cause or exception.
>  * user set tolerable-failed-checkpoints=0, but if triggerCheckpoint failed, 
> flink won't execute the tolerable-failed-checkpoints logic. 
>  * When triggerCheckpoint failed, numberOfFailedCheckpoints is always 0
>  * When triggerCheckpoint failed, we can't find checkpoint info in checkpoint 
> history page.
>  
> !image-2022-02-09-18-08-17-868.png!
>  
> !image-2022-02-09-18-08-34-992.png!
> !image-2022-02-09-18-08-42-920.png!
>  
> h3. *All metrics are normal, so the next day we found out that the checkpoint 
> failed, and the checkpoint has been failing for a day. it's not acceptable to 
> the flink user.*
> I have some ideas:
>  # Should tolerable-failed-checkpoints logic be executed when 
> triggerCheckpoint fails?
>  # When triggerCheckpoint failed, should increase numberOfFailedCheckpoints?
>  # When triggerCheckpoint failed, should show checkpoint info in checkpoint 
> history page?
>  # JM just show "Failed to trigger checkpoint", should we show detailed 
> exception to easy find the root cause?
>  
> Masters, could we do these changes? Please correct me if I'm wrong.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-26049) The tolerable-failed-checkpoints logic is invalid when checkpoint trigger failed

2022-02-21 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17495566#comment-17495566
 ] 

Piotr Nowojski commented on FLINK-26049:


Could you [~fanrui] post an example stack trace of an exception that caused 
this problem?

> The tolerable-failed-checkpoints logic is invalid when checkpoint trigger 
> failed
> 
>
> Key: FLINK-26049
> URL: https://issues.apache.org/jira/browse/FLINK-26049
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.5, 1.14.3
>Reporter: fanrui
>Assignee: fanrui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.15.0
>
> Attachments: image-2022-02-09-18-08-17-868.png, 
> image-2022-02-09-18-08-34-992.png, image-2022-02-09-18-08-42-920.png, 
> image-2022-02-18-11-28-53-337.png, image-2022-02-18-11-33-28-232.png, 
> image-2022-02-18-11-44-52-745.png
>
>
> After triggerCheckpoint, if checkpoint failed, flink will execute the 
> tolerable-failed-checkpoints logic. But if triggerCheckpoint failed, flink 
> won't execute the tolerable-failed-checkpoints logic.
> h1. How to reproduce this issue?
> In our online env, hdfs sre deletes the flink base dir by mistake, and flink 
> job don't have permission to create checkpoint dir. So cause flink trigger 
> checkpoint failed.
> There are some didn't meet expectations:
>  * JM just log _"Failed to trigger checkpoint for job 
> 6f09d4a15dad42b24d52c987f5471f18 since Trigger checkpoint failure" ._ Don't 
> show the root cause or exception.
>  * user set tolerable-failed-checkpoints=0, but if triggerCheckpoint failed, 
> flink won't execute the tolerable-failed-checkpoints logic. 
>  * When triggerCheckpoint failed, numberOfFailedCheckpoints is always 0
>  * When triggerCheckpoint failed, we can't find checkpoint info in checkpoint 
> history page.
>  
> !image-2022-02-09-18-08-17-868.png!
>  
> !image-2022-02-09-18-08-34-992.png!
> !image-2022-02-09-18-08-42-920.png!
>  
> h3. *All metrics are normal, so the next day we found out that the checkpoint 
> failed, and the checkpoint has been failing for a day. it's not acceptable to 
> the flink user.*
> I have some ideas:
>  # Should tolerable-failed-checkpoints logic be executed when 
> triggerCheckpoint fails?
>  # When triggerCheckpoint failed, should increase numberOfFailedCheckpoints?
>  # When triggerCheckpoint failed, should show checkpoint info in checkpoint 
> history page?
>  # JM just show "Failed to trigger checkpoint", should we show detailed 
> exception to easy find the root cause?
>  
> Masters, could we do these changes? Please correct me if I'm wrong.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-26049) The tolerable-failed-checkpoints logic is invalid when checkpoint trigger failed

2022-02-21 Thread fanrui (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17495464#comment-17495464
 ] 

fanrui commented on FLINK-26049:


Hi, [~pnowojski] , thanks for your reply.

I think check IOExceptions should be enough. In our production environment, we 
met hdfs  permission problem. It's a sub class of IOException. So IOException 
can cover our Exception.

To summarize, this jira may need to do three things:
1. Optimize log and display root cause
2. initializeCheckpointLocation after create PendingCheckpoint
3. In onTriggerFailure, if checkpoint == null and 
CheckpointFailureReason==IO_EXCEPTION, increase the numberOfFailedCheckpoints 
metric

 

How do you think?

> The tolerable-failed-checkpoints logic is invalid when checkpoint trigger 
> failed
> 
>
> Key: FLINK-26049
> URL: https://issues.apache.org/jira/browse/FLINK-26049
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.5, 1.14.3
>Reporter: fanrui
>Assignee: fanrui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.15.0
>
> Attachments: image-2022-02-09-18-08-17-868.png, 
> image-2022-02-09-18-08-34-992.png, image-2022-02-09-18-08-42-920.png, 
> image-2022-02-18-11-28-53-337.png, image-2022-02-18-11-33-28-232.png, 
> image-2022-02-18-11-44-52-745.png
>
>
> After triggerCheckpoint, if checkpoint failed, flink will execute the 
> tolerable-failed-checkpoints logic. But if triggerCheckpoint failed, flink 
> won't execute the tolerable-failed-checkpoints logic.
> h1. How to reproduce this issue?
> In our online env, hdfs sre deletes the flink base dir by mistake, and flink 
> job don't have permission to create checkpoint dir. So cause flink trigger 
> checkpoint failed.
> There are some didn't meet expectations:
>  * JM just log _"Failed to trigger checkpoint for job 
> 6f09d4a15dad42b24d52c987f5471f18 since Trigger checkpoint failure" ._ Don't 
> show the root cause or exception.
>  * user set tolerable-failed-checkpoints=0, but if triggerCheckpoint failed, 
> flink won't execute the tolerable-failed-checkpoints logic. 
>  * When triggerCheckpoint failed, numberOfFailedCheckpoints is always 0
>  * When triggerCheckpoint failed, we can't find checkpoint info in checkpoint 
> history page.
>  
> !image-2022-02-09-18-08-17-868.png!
>  
> !image-2022-02-09-18-08-34-992.png!
> !image-2022-02-09-18-08-42-920.png!
>  
> h3. *All metrics are normal, so the next day we found out that the checkpoint 
> failed, and the checkpoint has been failing for a day. it's not acceptable to 
> the flink user.*
> I have some ideas:
>  # Should tolerable-failed-checkpoints logic be executed when 
> triggerCheckpoint fails?
>  # When triggerCheckpoint failed, should increase numberOfFailedCheckpoints?
>  # When triggerCheckpoint failed, should show checkpoint info in checkpoint 
> history page?
>  # JM just show "Failed to trigger checkpoint", should we show detailed 
> exception to easy find the root cause?
>  
> Masters, could we do these changes? Please correct me if I'm wrong.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-26049) The tolerable-failed-checkpoints logic is invalid when checkpoint trigger failed

2022-02-21 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17495420#comment-17495420
 ] 

Piotr Nowojski commented on FLINK-26049:


Hi [~fanrui], I would suggest maybe to slow down here a bit and think more 
about how do we want to treat failures on the {{CheckpointCoordinator}}. Is 
this really a bug? So far we only committed ourselves to check IOExceptions on 
the CheckpointCoordinator against the tolerable failed checkpoints counter. We 
have never claimed that any other types of exceptions will be treated the same 
way.

> The tolerable-failed-checkpoints logic is invalid when checkpoint trigger 
> failed
> 
>
> Key: FLINK-26049
> URL: https://issues.apache.org/jira/browse/FLINK-26049
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.5, 1.14.3
>Reporter: fanrui
>Assignee: fanrui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.15.0
>
> Attachments: image-2022-02-09-18-08-17-868.png, 
> image-2022-02-09-18-08-34-992.png, image-2022-02-09-18-08-42-920.png, 
> image-2022-02-18-11-28-53-337.png, image-2022-02-18-11-33-28-232.png, 
> image-2022-02-18-11-44-52-745.png
>
>
> After triggerCheckpoint, if checkpoint failed, flink will execute the 
> tolerable-failed-checkpoints logic. But if triggerCheckpoint failed, flink 
> won't execute the tolerable-failed-checkpoints logic.
> h1. How to reproduce this issue?
> In our online env, hdfs sre deletes the flink base dir by mistake, and flink 
> job don't have permission to create checkpoint dir. So cause flink trigger 
> checkpoint failed.
> There are some didn't meet expectations:
>  * JM just log _"Failed to trigger checkpoint for job 
> 6f09d4a15dad42b24d52c987f5471f18 since Trigger checkpoint failure" ._ Don't 
> show the root cause or exception.
>  * user set tolerable-failed-checkpoints=0, but if triggerCheckpoint failed, 
> flink won't execute the tolerable-failed-checkpoints logic. 
>  * When triggerCheckpoint failed, numberOfFailedCheckpoints is always 0
>  * When triggerCheckpoint failed, we can't find checkpoint info in checkpoint 
> history page.
>  
> !image-2022-02-09-18-08-17-868.png!
>  
> !image-2022-02-09-18-08-34-992.png!
> !image-2022-02-09-18-08-42-920.png!
>  
> h3. *All metrics are normal, so the next day we found out that the checkpoint 
> failed, and the checkpoint has been failing for a day. it's not acceptable to 
> the flink user.*
> I have some ideas:
>  # Should tolerable-failed-checkpoints logic be executed when 
> triggerCheckpoint fails?
>  # When triggerCheckpoint failed, should increase numberOfFailedCheckpoints?
>  # When triggerCheckpoint failed, should show checkpoint info in checkpoint 
> history page?
>  # JM just show "Failed to trigger checkpoint", should we show detailed 
> exception to easy find the root cause?
>  
> Masters, could we do these changes? Please correct me if I'm wrong.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-26049) The tolerable-failed-checkpoints logic is invalid when checkpoint trigger failed

2022-02-18 Thread fanrui (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17494897#comment-17494897
 ] 

fanrui commented on FLINK-26049:


Hi [~akalashnikov] . Actually, I'm working on this JIRA, I'm pleasant to do it. 
Could you assign to me, please? Thanks a lot.

> The tolerable-failed-checkpoints logic is invalid when checkpoint trigger 
> failed
> 
>
> Key: FLINK-26049
> URL: https://issues.apache.org/jira/browse/FLINK-26049
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.5, 1.14.3
>Reporter: fanrui
>Priority: Major
> Fix For: 1.15.0
>
> Attachments: image-2022-02-09-18-08-17-868.png, 
> image-2022-02-09-18-08-34-992.png, image-2022-02-09-18-08-42-920.png, 
> image-2022-02-18-11-28-53-337.png, image-2022-02-18-11-33-28-232.png, 
> image-2022-02-18-11-44-52-745.png
>
>
> After triggerCheckpoint, if checkpoint failed, flink will execute the 
> tolerable-failed-checkpoints logic. But if triggerCheckpoint failed, flink 
> won't execute the tolerable-failed-checkpoints logic.
> h1. How to reproduce this issue?
> In our online env, hdfs sre deletes the flink base dir by mistake, and flink 
> job don't have permission to create checkpoint dir. So cause flink trigger 
> checkpoint failed.
> There are some didn't meet expectations:
>  * JM just log _"Failed to trigger checkpoint for job 
> 6f09d4a15dad42b24d52c987f5471f18 since Trigger checkpoint failure" ._ Don't 
> show the root cause or exception.
>  * user set tolerable-failed-checkpoints=0, but if triggerCheckpoint failed, 
> flink won't execute the tolerable-failed-checkpoints logic. 
>  * When triggerCheckpoint failed, numberOfFailedCheckpoints is always 0
>  * When triggerCheckpoint failed, we can't find checkpoint info in checkpoint 
> history page.
>  
> !image-2022-02-09-18-08-17-868.png!
>  
> !image-2022-02-09-18-08-34-992.png!
> !image-2022-02-09-18-08-42-920.png!
>  
> h3. *All metrics are normal, so the next day we found out that the checkpoint 
> failed, and the checkpoint has been failing for a day. it's not acceptable to 
> the flink user.*
> I have some ideas:
>  # Should tolerable-failed-checkpoints logic be executed when 
> triggerCheckpoint fails?
>  # When triggerCheckpoint failed, should increase numberOfFailedCheckpoints?
>  # When triggerCheckpoint failed, should show checkpoint info in checkpoint 
> history page?
>  # JM just show "Failed to trigger checkpoint", should we show detailed 
> exception to easy find the root cause?
>  
> Masters, could we do these changes? Please correct me if I'm wrong.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-26049) The tolerable-failed-checkpoints logic is invalid when checkpoint trigger failed

2022-02-18 Thread Anton Kalashnikov (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17494689#comment-17494689
 ] 

Anton Kalashnikov commented on FLINK-26049:
---

Hi [~fanrui], Do you work on this task right now? I mean I am ready to take 
this task now if you are not working on it yet. About the question - it seems 
approximately the right place but I need to look deeper. So I will answer a 
little later.

> The tolerable-failed-checkpoints logic is invalid when checkpoint trigger 
> failed
> 
>
> Key: FLINK-26049
> URL: https://issues.apache.org/jira/browse/FLINK-26049
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.5, 1.14.3
>Reporter: fanrui
>Priority: Major
> Fix For: 1.15.0
>
> Attachments: image-2022-02-09-18-08-17-868.png, 
> image-2022-02-09-18-08-34-992.png, image-2022-02-09-18-08-42-920.png, 
> image-2022-02-18-11-28-53-337.png, image-2022-02-18-11-33-28-232.png, 
> image-2022-02-18-11-44-52-745.png
>
>
> After triggerCheckpoint, if checkpoint failed, flink will execute the 
> tolerable-failed-checkpoints logic. But if triggerCheckpoint failed, flink 
> won't execute the tolerable-failed-checkpoints logic.
> h1. How to reproduce this issue?
> In our online env, hdfs sre deletes the flink base dir by mistake, and flink 
> job don't have permission to create checkpoint dir. So cause flink trigger 
> checkpoint failed.
> There are some didn't meet expectations:
>  * JM just log _"Failed to trigger checkpoint for job 
> 6f09d4a15dad42b24d52c987f5471f18 since Trigger checkpoint failure" ._ Don't 
> show the root cause or exception.
>  * user set tolerable-failed-checkpoints=0, but if triggerCheckpoint failed, 
> flink won't execute the tolerable-failed-checkpoints logic. 
>  * When triggerCheckpoint failed, numberOfFailedCheckpoints is always 0
>  * When triggerCheckpoint failed, we can't find checkpoint info in checkpoint 
> history page.
>  
> !image-2022-02-09-18-08-17-868.png!
>  
> !image-2022-02-09-18-08-34-992.png!
> !image-2022-02-09-18-08-42-920.png!
>  
> h3. *All metrics are normal, so the next day we found out that the checkpoint 
> failed, and the checkpoint has been failing for a day. it's not acceptable to 
> the flink user.*
> I have some ideas:
>  # Should tolerable-failed-checkpoints logic be executed when 
> triggerCheckpoint fails?
>  # When triggerCheckpoint failed, should increase numberOfFailedCheckpoints?
>  # When triggerCheckpoint failed, should show checkpoint info in checkpoint 
> history page?
>  # JM just show "Failed to trigger checkpoint", should we show detailed 
> exception to easy find the root cause?
>  
> Masters, could we do these changes? Please correct me if I'm wrong.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-26049) The tolerable-failed-checkpoints logic is invalid when checkpoint trigger failed

2022-02-18 Thread fanrui (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17494555#comment-17494555
 ] 

fanrui commented on FLINK-26049:


Hi [~akalashnikov] , could we increase numberOfFailedCheckpoints 
[here|https://github.com/apache/flink/blob/ac3ad139fbad02b2de241d5eef7b1e3ce6007b82/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L938]?
 It should be same bug with FLINK-24344.

 

 

> The tolerable-failed-checkpoints logic is invalid when checkpoint trigger 
> failed
> 
>
> Key: FLINK-26049
> URL: https://issues.apache.org/jira/browse/FLINK-26049
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.5, 1.14.3
>Reporter: fanrui
>Priority: Major
> Fix For: 1.15.0
>
> Attachments: image-2022-02-09-18-08-17-868.png, 
> image-2022-02-09-18-08-34-992.png, image-2022-02-09-18-08-42-920.png, 
> image-2022-02-18-11-28-53-337.png, image-2022-02-18-11-33-28-232.png, 
> image-2022-02-18-11-44-52-745.png
>
>
> After triggerCheckpoint, if checkpoint failed, flink will execute the 
> tolerable-failed-checkpoints logic. But if triggerCheckpoint failed, flink 
> won't execute the tolerable-failed-checkpoints logic.
> h1. How to reproduce this issue?
> In our online env, hdfs sre deletes the flink base dir by mistake, and flink 
> job don't have permission to create checkpoint dir. So cause flink trigger 
> checkpoint failed.
> There are some didn't meet expectations:
>  * JM just log _"Failed to trigger checkpoint for job 
> 6f09d4a15dad42b24d52c987f5471f18 since Trigger checkpoint failure" ._ Don't 
> show the root cause or exception.
>  * user set tolerable-failed-checkpoints=0, but if triggerCheckpoint failed, 
> flink won't execute the tolerable-failed-checkpoints logic. 
>  * When triggerCheckpoint failed, numberOfFailedCheckpoints is always 0
>  * When triggerCheckpoint failed, we can't find checkpoint info in checkpoint 
> history page.
>  
> !image-2022-02-09-18-08-17-868.png!
>  
> !image-2022-02-09-18-08-34-992.png!
> !image-2022-02-09-18-08-42-920.png!
>  
> h3. *All metrics are normal, so the next day we found out that the checkpoint 
> failed, and the checkpoint has been failing for a day. it's not acceptable to 
> the flink user.*
> I have some ideas:
>  # Should tolerable-failed-checkpoints logic be executed when 
> triggerCheckpoint fails?
>  # When triggerCheckpoint failed, should increase numberOfFailedCheckpoints?
>  # When triggerCheckpoint failed, should show checkpoint info in checkpoint 
> history page?
>  # JM just show "Failed to trigger checkpoint", should we show detailed 
> exception to easy find the root cause?
>  
> Masters, could we do these changes? Please correct me if I'm wrong.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-26049) The tolerable-failed-checkpoints logic is invalid when checkpoint trigger failed

2022-02-17 Thread fanrui (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17494353#comment-17494353
 ] 

fanrui commented on FLINK-26049:


I'm sorry, our prod env use Flink 1.13. I see some jiras have resolved this 
issue.

https://issues.apache.org/jira/browse/FLINK-23189

https://issues.apache.org/jira/browse/FLINK-24344

After I cherry-pick these commits, I think there are still some improvements we 
can make to facilitate troubleshooting. 
h2. 1. when create initializeLocation failure, JM don't show the root cause.

JM just show "An Exception occurred while triggering the checkpoint. IO-problem 
detected.". Don't show the root cause. 

We should show throwable instead of throwable.getMessage(), and I have shown my 
code.

 

!image-2022-02-18-11-28-53-337.png|width=2790,height=248!

!image-2022-02-18-11-33-28-232.png|width=2115,height=574!
h2. 2. Can we initializeLocation after create PendingCheckpoint?

After create PendingCheckpoint, if there are some exception, we can see 
checkpoint info in History Page, and the numberOfFailedCheckpoints metric can 
be increased. 

They are useful for troubleshooting and monitor job. and initializeLocation 
isn't necessary for create PendingCheckpoint. So I think we can 
initializeLocation after create PendingCheckpoint.

 

!image-2022-02-18-11-44-52-745.png!

 

Masters [~akalashnikov]  [~pnowojski] , how do you think?

> The tolerable-failed-checkpoints logic is invalid when checkpoint trigger 
> failed
> 
>
> Key: FLINK-26049
> URL: https://issues.apache.org/jira/browse/FLINK-26049
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.13.5, 1.14.3
>Reporter: fanrui
>Priority: Major
> Fix For: 1.15.0
>
> Attachments: image-2022-02-09-18-08-17-868.png, 
> image-2022-02-09-18-08-34-992.png, image-2022-02-09-18-08-42-920.png, 
> image-2022-02-18-11-28-12-354.png, image-2022-02-18-11-28-53-337.png, 
> image-2022-02-18-11-33-28-232.png, image-2022-02-18-11-44-52-745.png
>
>
> After triggerCheckpoint, if checkpoint failed, flink will execute the 
> tolerable-failed-checkpoints logic. But if triggerCheckpoint failed, flink 
> won't execute the tolerable-failed-checkpoints logic.
> h1. How to reproduce this issue?
> In our online env, hdfs sre deletes the flink base dir by mistake, and flink 
> job don't have permission to create checkpoint dir. So cause flink trigger 
> checkpoint failed.
> There are some didn't meet expectations:
>  * JM just log _"Failed to trigger checkpoint for job 
> 6f09d4a15dad42b24d52c987f5471f18 since Trigger checkpoint failure" ._ Don't 
> show the root cause or exception.
>  * user set tolerable-failed-checkpoints=0, but if triggerCheckpoint failed, 
> flink won't execute the tolerable-failed-checkpoints logic. 
>  * When triggerCheckpoint failed, numberOfFailedCheckpoints is always 0
>  * When triggerCheckpoint failed, we can't find checkpoint info in checkpoint 
> history page.
>  
> !image-2022-02-09-18-08-17-868.png!
>  
> !image-2022-02-09-18-08-34-992.png!
> !image-2022-02-09-18-08-42-920.png!
>  
> h3. *All metrics are normal, so the next day we found out that the checkpoint 
> failed, and the checkpoint has been failing for a day. it's not acceptable to 
> the flink user.*
> I have some ideas:
>  # Should tolerable-failed-checkpoints logic be executed when 
> triggerCheckpoint fails?
>  # When triggerCheckpoint failed, should increase numberOfFailedCheckpoints?
>  # When triggerCheckpoint failed, should show checkpoint info in checkpoint 
> history page?
>  # JM just show "Failed to trigger checkpoint", should we show detailed 
> exception to easy find the root cause?
>  
> Masters, could we do these changes? Please correct me if I'm wrong.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)