[jira] [Commented] (FLINK-23189) Count and fail the task when the disk is error on JobManager

2021-09-22 Thread zlzhang0122 (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418602#comment-17418602
 ] 

zlzhang0122 commented on FLINK-23189:
-

[~pnowojski] ok, I've seen the fix and found that it added the handled of the 
onTriggerFailure when the checkpoint is null, I've found this situation but I 
didn't reproduced it in our production environment, so I didn't change the code 
here, but actually we may indeed need that fix for some corner cases.

> Count and fail the task when the disk is error on JobManager
> 
>
> Key: FLINK-23189
> URL: https://issues.apache.org/jira/browse/FLINK-23189
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.1
>Reporter: zlzhang0122
>Assignee: zlzhang0122
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.14.0
>
> Attachments: exception.txt
>
>
> When the jobmanager disk is error and the triggerCheckpoint will throw a 
> IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this 
> failure won't cause Job failed. Users can hardly find this error if he don't 
> see the JobManager logs. To avoid this case, I propose that we can figure out 
> these IOException case and increase the failureCounter which can fail the job 
> finally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23189) Count and fail the task when the disk is error on JobManager

2021-09-22 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418458#comment-17418458
 ] 

Piotr Nowojski commented on FLINK-23189:


[~zlzhang0122], [~akalashnikov] has already prepared [a 
fix|https://github.com/apache/flink/pull/17331] for that bug

> Count and fail the task when the disk is error on JobManager
> 
>
> Key: FLINK-23189
> URL: https://issues.apache.org/jira/browse/FLINK-23189
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.1
>Reporter: zlzhang0122
>Assignee: zlzhang0122
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.14.0
>
> Attachments: exception.txt
>
>
> When the jobmanager disk is error and the triggerCheckpoint will throw a 
> IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this 
> failure won't cause Job failed. Users can hardly find this error if he don't 
> see the JobManager logs. To avoid this case, I propose that we can figure out 
> these IOException case and increase the failureCounter which can fail the job 
> finally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23189) Count and fail the task when the disk is error on JobManager

2021-09-22 Thread zlzhang0122 (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418420#comment-17418420
 ] 

zlzhang0122 commented on FLINK-23189:
-

[~pnowojski] sure, I will check about it.

> Count and fail the task when the disk is error on JobManager
> 
>
> Key: FLINK-23189
> URL: https://issues.apache.org/jira/browse/FLINK-23189
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.1
>Reporter: zlzhang0122
>Assignee: zlzhang0122
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.14.0
>
> Attachments: exception.txt
>
>
> When the jobmanager disk is error and the triggerCheckpoint will throw a 
> IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this 
> failure won't cause Job failed. Users can hardly find this error if he don't 
> see the JobManager logs. To avoid this case, I propose that we can figure out 
> these IOException case and increase the failureCounter which can fail the job 
> finally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23189) Count and fail the task when the disk is error on JobManager

2021-09-21 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17417981#comment-17417981
 ] 

Piotr Nowojski commented on FLINK-23189:


The feature implemented in this ticket doesn't work as intended. Please check 
FLINK-24344

CC [~zlzhang0122] [~akalashnikov]

> Count and fail the task when the disk is error on JobManager
> 
>
> Key: FLINK-23189
> URL: https://issues.apache.org/jira/browse/FLINK-23189
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.1
>Reporter: zlzhang0122
>Assignee: zlzhang0122
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.14.0
>
> Attachments: exception.txt
>
>
> When the jobmanager disk is error and the triggerCheckpoint will throw a 
> IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this 
> failure won't cause Job failed. Users can hardly find this error if he don't 
> see the JobManager logs. To avoid this case, I propose that we can figure out 
> these IOException case and increase the failureCounter which can fail the job 
> finally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23189) Count and fail the task when the disk is error on JobManager

2021-07-08 Thread zlzhang0122 (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377391#comment-17377391
 ] 

zlzhang0122 commented on FLINK-23189:
-

Hi [~pnowojski] I'd like to give this a try, I have done some work on this for 
our production environment.

> Count and fail the task when the disk is error on JobManager
> 
>
> Key: FLINK-23189
> URL: https://issues.apache.org/jira/browse/FLINK-23189
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.1
>Reporter: zlzhang0122
>Priority: Major
> Attachments: exception.txt
>
>
> When the jobmanager disk is error and the triggerCheckpoint will throw a 
> IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this 
> failure won't cause Job failed. Users can hardly find this error if he don't 
> see the JobManager logs. To avoid this case, I propose that we can figure out 
> these IOException case and increase the failureCounter which can fail the job 
> finally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23189) Count and fail the task when the disk is error on JobManager

2021-07-08 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377186#comment-17377186
 ] 

Piotr Nowojski commented on FLINK-23189:


Thanks for the more detailed explanation. I think your request makes sense. I 
think currently those kind of failures are just logged in 
{{CheckpointCoordinator#onTriggerFailure()}} while they should be checked 
against {{CheckpointFailureManager}} and it should be deciding whether the 
error should be just logged, or checked against the number of tolerable 
failures and maybe fail the job.

So as a part of this ticket, I would expect someone to go through the current 
exceptions (including all occurrences of {{TRIGGER_CHECKPOINT_FAILURE}}) and 
decide which should be ignored/logged and which can cause job failover, 
potentially splitting {{TRIGGER_CHECKPOINT_FAILURE}} into new failure reasons 
and implement it accordingly in the {{CheckpointFailureManager}}.

Additionally it would be good to check if other failure reasons are treated 
sensibly in the {{CheckpointFailureManager}}.

I'm also afraid that this change would cause quite a bit of test instabilities, 
so it might turn somewhat more difficult than it looks at the first glance.

[~zlzhang0122] would you be willing to work on this issue? Or you just wanted 
to propose an idea for us to pick up at some point of time in the future?

> Count and fail the task when the disk is error on JobManager
> 
>
> Key: FLINK-23189
> URL: https://issues.apache.org/jira/browse/FLINK-23189
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.1
>Reporter: zlzhang0122
>Priority: Major
> Attachments: exception.txt
>
>
> When the jobmanager disk is error and the triggerCheckpoint will throw a 
> IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this 
> failure won't cause Job failed. Users can hardly find this error if he don't 
> see the JobManager logs. To avoid this case, I propose that we can figure out 
> these IOException case and increase the failureCounter which can fail the job 
> finally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23189) Count and fail the task when the disk is error on JobManager

2021-07-06 Thread zlzhang0122 (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376181#comment-17376181
 ] 

zlzhang0122 commented on FLINK-23189:
-

sure, [~pnowojski] I have posted a attachment which record the exception thrown 
in Flink 1.10. CheckpointCoordinator#triggerCheckpoint() will call the 
startTriggeringCheckpoint() function, while this function will call the 
initializeCheckpoint() function, this function may throw an IOException(see 
[link|https://github.com/zlzhang0122/flink/blob/9e1cc0ac2bbf0a2e8fcf00e6730a10893d651590/flink-runtime/src/main/java/org/apache/flink/runtime/state/CheckpointStorageCoordinatorView.java#L83]).
 The IOException will produce a 
CheckpointFailureReason.TRIGGER_CHECKPOINT_FAILURE just like any other 
Exception, I think that IOException is caused by disk error or any other IO 
problem that can hardly be resumed, and maybe we should treat it a little more 
serious and let users know it faster rather than just log it.

> Count and fail the task when the disk is error on JobManager
> 
>
> Key: FLINK-23189
> URL: https://issues.apache.org/jira/browse/FLINK-23189
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.1
>Reporter: zlzhang0122
>Priority: Major
> Attachments: exception.txt
>
>
> When the jobmanager disk is error and the triggerCheckpoint will throw a 
> IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this 
> failure won't cause Job failed. Users can hardly find this error if he don't 
> see the JobManager logs. To avoid this case, I propose that we can figure out 
> these IOException case and increase the failureCounter which can fail the job 
> finally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23189) Count and fail the task when the disk is error on JobManager

2021-07-05 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374788#comment-17374788
 ] 

Piotr Nowojski commented on FLINK-23189:


Thank you for reporting the problem [~zlzhang0122]. Could you maybe share an 
example stack trace/log entry that you are referring to and what types of the 
exceptions you would like to propose to check against the max tolerable 
checkpoint failures counter? At a first glance I can not see from where in 
{{CheckpointCoordinator#triggerCheckpoint()}} an {{IOException}} can be thrown.

> Count and fail the task when the disk is error on JobManager
> 
>
> Key: FLINK-23189
> URL: https://issues.apache.org/jira/browse/FLINK-23189
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.1
>Reporter: zlzhang0122
>Priority: Major
>
> When the jobmanager disk is error and the triggerCheckpoint will throw a 
> IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this 
> failure won't cause Job failed. Users can hardly find this error if he don't 
> see the JobManager logs. To avoid this case, I propose that we can figure out 
> these IOException case and increase the failureCounter which can fail the job 
> finally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)