[jira] [Comment Edited] (FLINK-23189) Count and fail the task when the disk is error on JobManager

2021-09-22 Thread zlzhang0122 (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418602#comment-17418602
 ] 

zlzhang0122 edited comment on FLINK-23189 at 9/22/21, 1:52 PM:
---

[~pnowojski] ok, I've seen the fix and found that it added the handled of the 
onTriggerFailure when the checkpoint is null, I've noticed this situation but I 
didn't reproduced it in our production environment, so I didn't change the code 
here, but actually we may indeed need this fix for this case.


was (Author: zlzhang0122):
[~pnowojski] ok, I've seen the fix and found that it added the handled of the 
onTriggerFailure when the checkpoint is null, I've found this situation but I 
didn't reproduced it in our production environment, so I didn't change the code 
here, but actually we may indeed need that fix for some corner cases.

> Count and fail the task when the disk is error on JobManager
> 
>
> Key: FLINK-23189
> URL: https://issues.apache.org/jira/browse/FLINK-23189
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.1
>Reporter: zlzhang0122
>Assignee: zlzhang0122
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.14.0
>
> Attachments: exception.txt
>
>
> When the jobmanager disk is error and the triggerCheckpoint will throw a 
> IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this 
> failure won't cause Job failed. Users can hardly find this error if he don't 
> see the JobManager logs. To avoid this case, I propose that we can figure out 
> these IOException case and increase the failureCounter which can fail the job 
> finally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-23189) Count and fail the task when the disk is error on JobManager

2021-09-22 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418458#comment-17418458
 ] 

Piotr Nowojski edited comment on FLINK-23189 at 9/22/21, 7:28 AM:
--

Thanks [~zlzhang0122], it looks like [~akalashnikov] has already prepared [a 
fix|https://github.com/apache/flink/pull/17331] for that bug


was (Author: pnowojski):
[~zlzhang0122], [~akalashnikov] has already prepared [a 
fix|https://github.com/apache/flink/pull/17331] for that bug

> Count and fail the task when the disk is error on JobManager
> 
>
> Key: FLINK-23189
> URL: https://issues.apache.org/jira/browse/FLINK-23189
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.12.2, 1.13.1
>Reporter: zlzhang0122
>Assignee: zlzhang0122
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.14.0
>
> Attachments: exception.txt
>
>
> When the jobmanager disk is error and the triggerCheckpoint will throw a 
> IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this 
> failure won't cause Job failed. Users can hardly find this error if he don't 
> see the JobManager logs. To avoid this case, I propose that we can figure out 
> these IOException case and increase the failureCounter which can fail the job 
> finally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)