[jira] [Created] (FLINK-28639) Preserve distributed consistency of OperatorEvents from subtasks to OperatorCoordinator

Yunfeng Zhou (Jira) Thu, 21 Jul 2022 22:58:09 -0700

Yunfeng Zhou created FLINK-28639:
------------------------------------

             Summary: Preserve distributed consistency of OperatorEvents from 
subtasks to OperatorCoordinator
                 Key: FLINK-28639
                 URL: https://issues.apache.org/jira/browse/FLINK-28639
             Project: Flink
          Issue Type: Sub-task
          Components: Runtime / Checkpointing
    Affects Versions: 1.14.3
            Reporter: Yunfeng Zhou
             Fix For: 1.16.0



This is the second step to solving the consistency issue of OC communications. 
In this step, we would also guarantee the consistency of operator events sent 
from subtasks to OCs. Combined with the other subtask to preserve the 
consistency of communications in the reverse direction, all communications 
between OC and subtasks would be consistent across checkpoints and global 
failovers.
To achieve the goal of this step, we need to add closing/reopening functions to 
the subtasks' gateways and make the subtasks aware of a checkpoint before they 
receive the checkpoint barriers. The general process would be as follows.
1. When the OC starts checkpoint, it notifies all subtasks about this 
information.
2. After being notified about the ongoing checkpoint in OC, a subtask sends a 
special operator event to its OC, which is the last operator event the OC could 
receive from the subtask before the subtask completes the checkpoint. Then the 
subtask closes its gateway.
3. After receiving this special event from all subtasks, the OC finishes its 
checkpoint and closes its gateway. Then the checkpoint coordinator sends 
checkpoint barriers to the sources.
4. If the subtask or the OC generate any event to send to each other, they 
buffer the events locally.
5. When a subtask starts checkpointing, it also stores the buffered events in 
the checkpoint.
6. After the subtask completes the checkpoint, communications in both 
directions are recovered and the buffered events are sent out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-28639) Preserve distributed consistency of OperatorEvents from subtasks to OperatorCoordinator

Reply via email to