[jira] [Created] (FLINK-12858) Potentially not properly working Flink job in case of stop-with-savepoint failure

Alex (JIRA) Sat, 15 Jun 2019 05:03:49 -0700

Alex created FLINK-12858:
----------------------------

             Summary: Potentially not properly working Flink job in case of 
stop-with-savepoint failure
                 Key: FLINK-12858
                 URL: https://issues.apache.org/jira/browse/FLINK-12858
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
            Reporter: Alex



Current implementation of stop-with-savepoint (FLINK-11458) would lock the 
thread (on {{syncSavepointLatch}}) that carries 
{{StreamTask.performCheckpoint()}}. For non-source tasks, this thread is 
implied to be the task's main thread (stop-with-savepoint deliberately stops 
any activity in the task's main thread).

Unlocking happens either when the task is cancelled or when the corresponding 
checkpoint is acknowledged.

It's possible, that other downstream tasks of the same Flink job "soft" fail 
the checkpoint/savepoint due to various reasons (for example, due to max 
buffered bytes {{BarrierBuffer.checkSizeLimit()}}. In such case, the checkpoint 
abortion would be notified to JM . But it looks like, the checkpoint 
coordinator would handle such abortion as usual and assume that the Flink job 
continues running.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-12858) Potentially not properly working Flink job in case of stop-with-savepoint failure

Reply via email to