The GitHub Actions job "Nightly (beta)" on flink.git has failed.
Run started by GitHub user github-actions[bot] (triggered by 
github-actions[bot]).

Head commit for run:
8dc212c32036c3a4afd3af2e50e95cb87c8cee23 / Arvid Heise 
<ahe...@users.noreply.github.com>
[FLINK-36368] Do not prematurely merge CommittableManager (#25405)

When a sink contains a shuffle between writer and committer, a committer may 
receive committables coming from multiple subtasks. So far, we immediately 
merged them on receiving. However, that makes it later impossible to trace 
whether we received all messages from an upstream task.

It also made rescaling of the committer awkward: during normal processing all 
committables of a committer have the same subtaskId as the committer. On 
downscale, these subtaskIds suddenly don't match and need to be replaced, which 
we solved by merging the SubtaskCommittableManagers.

This commit decouples the collection of committables from changing the 
subtaskId for emission. Committables retain the upstream subtask id in the 
CommittableCollection, which survives serialization and deserialization. Only 
upon emission, we substitute the subtask id with the one of the emitting 
committer.

This is, in particular, useful for a global committer, where all subtasks are 
collected. As a side fix, the new serialization also contains the 
numberOfSubtasks such that different checkpoints may have different degree of 
parallelism.

The old approach probably has edge cases where scaling a UC would result in 
stalled pipelines because certain metadata doesn't match. This would not affect 
pipelines which  chain Writer/Committer (no channel state), Writer and 
Committer have same DOP (results in a Forward channel, which doesn't use UC for 
exactly these reasons), and a non-keyed shuffles (because they don't provide 
any guarantees). Since a keyed shuffle must use the subtask id of the 
committables, the new approach should be safe. However, since we disabled UC 
entirely for sinks to adhere to the contract of notifyCheckpointComplete, this 
shouldn't matter going forward. It's still important to consider these cases 
though for restoring from Flink 1 checkpoints.

Report URL: https://github.com/apache/flink/actions/runs/11154777381

With regards,
GitHub Actions via GitBox

Reply via email to