[ 
https://issues.apache.org/jira/browse/FLINK-17351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091481#comment-17091481
 ] 

Roman Khachatryan commented on FLINK-17351:
-------------------------------------------

As we discussed offline, the fix would be to increment failure counter 
`CHECKPOINT_EXPIRED` in `CheckpointFailureManager`. Other reasons should also 
be checked.

But, some cases are not even reported to `CheckpointFailureManager`: 
`TOO_MANY_CONCURRENT_CHECKPOINTS` and `MINIMUM_TIME_BETWEEN_CHECKPOINTS` (maybe 
others).

> CheckpointCoordinator and CheckpointFailureManager ignores checkpoint timeouts
> ------------------------------------------------------------------------------
>
>                 Key: FLINK-17351
>                 URL: https://issues.apache.org/jira/browse/FLINK-17351
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.9.2, 1.10.0
>            Reporter: Piotr Nowojski
>            Priority: Critical
>             Fix For: 1.11.0
>
>
> As described in point 2: 
> https://issues.apache.org/jira/browse/FLINK-17327?focusedCommentId=17090576&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17090576
> (copy of description from above linked comment):
> The logic in how {{CheckpointCoordinator}} handles checkpoint timeouts is 
> broken. In your [~qinjunjerry] examples, your job should have failed after 
> first checkpoint failure, but checkpoints were time outing on 
> CheckpointCoordinator after 5 seconds, before {{FlinkKafkaProducer}} was 
> detecting Kafka failure after 2 minutes. Those timeouts were not checked 
> against {{setTolerableCheckpointFailureNumber(...)}} limit, so the job was 
> keep going with many timed out checkpoints. Now funny thing happens: 
> FlinkKafkaProducer detects Kafka failure. Funny thing is that it depends 
> where the failure was detected:
> a) on processing record? no problem, job will failover immediately once 
> failure is detected (in this example after 2 minutes)
> b) on checkpoint? heh, the failure is reported to {{CheckpointCoordinator}} 
> *and gets ignored, as PendingCheckpoint has already been discarded 2 minutes 
> ago* :) So theoretically the checkpoints can keep failing forever and the job 
> will not restart automatically, unless something else fails.
> Even more funny things can happen if we mix FLINK-17350 . or b) with 
> intermittent external system failure. Sink reports an exception, transaction 
> was lost/aborted, Sink is in failed state, but if there will be a happy 
> coincidence that it manages to accept further records, this exception can be 
> lost and all of the records in those failed checkpoints will be lost forever 
> as well. In all of the examples that [~qinjunjerry] posted it hasn't 
> happened. {{FlinkKafkaProducer}} was not able to recover after the initial 
> failure and it was keep throwing exceptions until the job finally failed (but 
> much later then it should have). And that's not guaranteed anywhere.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to