[jira] [Commented] (FLINK-23189) Count and fail the task when the disk is error on JobManager
[ https://issues.apache.org/jira/browse/FLINK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418602#comment-17418602 ] zlzhang0122 commented on FLINK-23189: - [~pnowojski] ok, I've seen the fix and found that it added the handled of the onTriggerFailure when the checkpoint is null, I've found this situation but I didn't reproduced it in our production environment, so I didn't change the code here, but actually we may indeed need that fix for some corner cases. > Count and fail the task when the disk is error on JobManager > > > Key: FLINK-23189 > URL: https://issues.apache.org/jira/browse/FLINK-23189 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.1 >Reporter: zlzhang0122 >Assignee: zlzhang0122 >Priority: Major > Labels: pull-request-available > Fix For: 1.14.0 > > Attachments: exception.txt > > > When the jobmanager disk is error and the triggerCheckpoint will throw a > IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this > failure won't cause Job failed. Users can hardly find this error if he don't > see the JobManager logs. To avoid this case, I propose that we can figure out > these IOException case and increase the failureCounter which can fail the job > finally. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-23189) Count and fail the task when the disk is error on JobManager
[ https://issues.apache.org/jira/browse/FLINK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418458#comment-17418458 ] Piotr Nowojski commented on FLINK-23189: [~zlzhang0122], [~akalashnikov] has already prepared [a fix|https://github.com/apache/flink/pull/17331] for that bug > Count and fail the task when the disk is error on JobManager > > > Key: FLINK-23189 > URL: https://issues.apache.org/jira/browse/FLINK-23189 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.1 >Reporter: zlzhang0122 >Assignee: zlzhang0122 >Priority: Major > Labels: pull-request-available > Fix For: 1.14.0 > > Attachments: exception.txt > > > When the jobmanager disk is error and the triggerCheckpoint will throw a > IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this > failure won't cause Job failed. Users can hardly find this error if he don't > see the JobManager logs. To avoid this case, I propose that we can figure out > these IOException case and increase the failureCounter which can fail the job > finally. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-23189) Count and fail the task when the disk is error on JobManager
[ https://issues.apache.org/jira/browse/FLINK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418420#comment-17418420 ] zlzhang0122 commented on FLINK-23189: - [~pnowojski] sure, I will check about it. > Count and fail the task when the disk is error on JobManager > > > Key: FLINK-23189 > URL: https://issues.apache.org/jira/browse/FLINK-23189 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.1 >Reporter: zlzhang0122 >Assignee: zlzhang0122 >Priority: Major > Labels: pull-request-available > Fix For: 1.14.0 > > Attachments: exception.txt > > > When the jobmanager disk is error and the triggerCheckpoint will throw a > IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this > failure won't cause Job failed. Users can hardly find this error if he don't > see the JobManager logs. To avoid this case, I propose that we can figure out > these IOException case and increase the failureCounter which can fail the job > finally. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-23189) Count and fail the task when the disk is error on JobManager
[ https://issues.apache.org/jira/browse/FLINK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17417981#comment-17417981 ] Piotr Nowojski commented on FLINK-23189: The feature implemented in this ticket doesn't work as intended. Please check FLINK-24344 CC [~zlzhang0122] [~akalashnikov] > Count and fail the task when the disk is error on JobManager > > > Key: FLINK-23189 > URL: https://issues.apache.org/jira/browse/FLINK-23189 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.1 >Reporter: zlzhang0122 >Assignee: zlzhang0122 >Priority: Major > Labels: pull-request-available > Fix For: 1.14.0 > > Attachments: exception.txt > > > When the jobmanager disk is error and the triggerCheckpoint will throw a > IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this > failure won't cause Job failed. Users can hardly find this error if he don't > see the JobManager logs. To avoid this case, I propose that we can figure out > these IOException case and increase the failureCounter which can fail the job > finally. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-23189) Count and fail the task when the disk is error on JobManager
[ https://issues.apache.org/jira/browse/FLINK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377391#comment-17377391 ] zlzhang0122 commented on FLINK-23189: - Hi [~pnowojski] I'd like to give this a try, I have done some work on this for our production environment. > Count and fail the task when the disk is error on JobManager > > > Key: FLINK-23189 > URL: https://issues.apache.org/jira/browse/FLINK-23189 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.1 >Reporter: zlzhang0122 >Priority: Major > Attachments: exception.txt > > > When the jobmanager disk is error and the triggerCheckpoint will throw a > IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this > failure won't cause Job failed. Users can hardly find this error if he don't > see the JobManager logs. To avoid this case, I propose that we can figure out > these IOException case and increase the failureCounter which can fail the job > finally. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-23189) Count and fail the task when the disk is error on JobManager
[ https://issues.apache.org/jira/browse/FLINK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377186#comment-17377186 ] Piotr Nowojski commented on FLINK-23189: Thanks for the more detailed explanation. I think your request makes sense. I think currently those kind of failures are just logged in {{CheckpointCoordinator#onTriggerFailure()}} while they should be checked against {{CheckpointFailureManager}} and it should be deciding whether the error should be just logged, or checked against the number of tolerable failures and maybe fail the job. So as a part of this ticket, I would expect someone to go through the current exceptions (including all occurrences of {{TRIGGER_CHECKPOINT_FAILURE}}) and decide which should be ignored/logged and which can cause job failover, potentially splitting {{TRIGGER_CHECKPOINT_FAILURE}} into new failure reasons and implement it accordingly in the {{CheckpointFailureManager}}. Additionally it would be good to check if other failure reasons are treated sensibly in the {{CheckpointFailureManager}}. I'm also afraid that this change would cause quite a bit of test instabilities, so it might turn somewhat more difficult than it looks at the first glance. [~zlzhang0122] would you be willing to work on this issue? Or you just wanted to propose an idea for us to pick up at some point of time in the future? > Count and fail the task when the disk is error on JobManager > > > Key: FLINK-23189 > URL: https://issues.apache.org/jira/browse/FLINK-23189 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.1 >Reporter: zlzhang0122 >Priority: Major > Attachments: exception.txt > > > When the jobmanager disk is error and the triggerCheckpoint will throw a > IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this > failure won't cause Job failed. Users can hardly find this error if he don't > see the JobManager logs. To avoid this case, I propose that we can figure out > these IOException case and increase the failureCounter which can fail the job > finally. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-23189) Count and fail the task when the disk is error on JobManager
[ https://issues.apache.org/jira/browse/FLINK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376181#comment-17376181 ] zlzhang0122 commented on FLINK-23189: - sure, [~pnowojski] I have posted a attachment which record the exception thrown in Flink 1.10. CheckpointCoordinator#triggerCheckpoint() will call the startTriggeringCheckpoint() function, while this function will call the initializeCheckpoint() function, this function may throw an IOException(see [link|https://github.com/zlzhang0122/flink/blob/9e1cc0ac2bbf0a2e8fcf00e6730a10893d651590/flink-runtime/src/main/java/org/apache/flink/runtime/state/CheckpointStorageCoordinatorView.java#L83]). The IOException will produce a CheckpointFailureReason.TRIGGER_CHECKPOINT_FAILURE just like any other Exception, I think that IOException is caused by disk error or any other IO problem that can hardly be resumed, and maybe we should treat it a little more serious and let users know it faster rather than just log it. > Count and fail the task when the disk is error on JobManager > > > Key: FLINK-23189 > URL: https://issues.apache.org/jira/browse/FLINK-23189 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.1 >Reporter: zlzhang0122 >Priority: Major > Attachments: exception.txt > > > When the jobmanager disk is error and the triggerCheckpoint will throw a > IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this > failure won't cause Job failed. Users can hardly find this error if he don't > see the JobManager logs. To avoid this case, I propose that we can figure out > these IOException case and increase the failureCounter which can fail the job > finally. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-23189) Count and fail the task when the disk is error on JobManager
[ https://issues.apache.org/jira/browse/FLINK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374788#comment-17374788 ] Piotr Nowojski commented on FLINK-23189: Thank you for reporting the problem [~zlzhang0122]. Could you maybe share an example stack trace/log entry that you are referring to and what types of the exceptions you would like to propose to check against the max tolerable checkpoint failures counter? At a first glance I can not see from where in {{CheckpointCoordinator#triggerCheckpoint()}} an {{IOException}} can be thrown. > Count and fail the task when the disk is error on JobManager > > > Key: FLINK-23189 > URL: https://issues.apache.org/jira/browse/FLINK-23189 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.12.2, 1.13.1 >Reporter: zlzhang0122 >Priority: Major > > When the jobmanager disk is error and the triggerCheckpoint will throw a > IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this > failure won't cause Job failed. Users can hardly find this error if he don't > see the JobManager logs. To avoid this case, I propose that we can figure out > these IOException case and increase the failureCounter which can fail the job > finally. -- This message was sent by Atlassian Jira (v8.3.4#803005)