[ 
https://issues.apache.org/jira/browse/FLINK-16986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen updated FLINK-16986:
---------------------------------
    Fix Version/s: 1.11.0

> Enhance the OperatorEvent handling guarantee during checkpointing.
> ------------------------------------------------------------------
>
>                 Key: FLINK-16986
>                 URL: https://issues.apache.org/jira/browse/FLINK-16986
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Checkpointing
>            Reporter: Jiangjie Qin
>            Assignee: Jiangjie Qin
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.11.0
>
>
> When the {{CheckpointCoordinator}} takes a checkpoint, the checkpointing 
> order is following:
>  # {{CheckpointCoordinator}} triggers checkpoint on each 
> {{OperatorCoordinator}}
>  # Each {{OperatorCoordinator}} takes a snapshot.
>  # Right after taking the snapshot, the {{CheckpointCoordinator}} sends a 
> {{CHECKPOINT_FIN}} marker through the {{OperatorContext}}.
>  # Once the {{OperatorContext}} sees {{CHECKPOINT_FIN}} marker, it will wait 
> for all the previous events are acked and suspend the event gateway to the 
> operators by buffering the future {{OperatorEvents}} sent from the 
> {{OperatorCoordinator}} to the operators without actually sending them out.
>  # The {{CheckpointCoordinator}} waits until all the {{OperatorCoordinator}}s 
> finish step 2-4 and then triggers the task snapshots.
>  # The suspension of an event gateway to an operator can be lifted after all 
> the subtasks of that operator has finished their task checkpoint.
> The mechanism above guarantees all the {{OperatorEvents}} sent before taking 
> the operator coordinator snapshot are handled by the operator before the task 
> snapshots are taken.
> An operator can use this mechanism to know whether an {{OperatorEvent}} it 
> sent to the coordinator is included in the upcoming checkpoint or not. What 
> it has to do is to ask the operator coordinator to ACK that OperatorEvent. If 
> the ACK is received before the operator takes the next snapshot, that 
> OperatorEvent must have been handled and checkpointed by the 
> OperatorCoordinator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to