[jira] [Commented] (FLINK-20654) Unaligned checkpoint recovery may lead to corrupted data stream

Roman Khachatryan (Jira) Mon, 28 Dec 2020 05:02:06 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-20654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17255577#comment-17255577
 ]


Roman Khachatryan commented on FLINK-20654:
-------------------------------------------

The staus so far:

Even with high number of allowed restarts, there are still failures present. 
Which means that the issue is with persisting state, rather than reading it.

The issue doesn't manifest itself with legacy sources. Which means:
1. The issue is likely in MultipleInput task code or is related to different 
rates in such tasks
2. "git bisect" doesn't make sense (and bisecting until API changes didn't help)

Flattening the topology from
{code}
(src0, src1) -> map1
    (map1, src2) -> map2
        (map2, src3) -> map3
{code}
to
{code}
(src0, src1) -> map1
(src2, src3) -> map2
(map1, map2) -> map3
{code}
so that each task combines inputs of only a single type significantly reduces 
failure rate (which also confirms (1)):

The following ruled out:
* StreamTaskSourceInput - wrong (fake) gate/channels ids - removing didn't help
* Custom partitioner - replacing e.g. with keyby doesn't help
* Thread safety of SingleCheckpointBarrierHandler.allBarriersReceivedFuture - 
checked with synchronized, didn't help
* Gate alignment (EndOfChannelStateEvent) - doesn't fail with gate alignment 
always enabled
* Gate index stability - inspected the code, those shouldn't change

> Unaligned checkpoint recovery may lead to corrupted data stream
> ---------------------------------------------------------------
>
>                 Key: FLINK-20654
>                 URL: https://issues.apache.org/jira/browse/FLINK-20654
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.12.0
>            Reporter: Arvid Heise
>            Assignee: Roman Khachatryan
>            Priority: Critical
>              Labels: pull-request-available, test-stability
>             Fix For: 1.13.0, 1.12.1
>
>
> Fix of FLINK-20433 shows potential corruption after recovery for all 
> variations of UnalignedCheckpointITCase.
> To reproduce, run UCITCase a couple hundreds times. The issue showed for me 
> in:
> - execute [Parallel union, p = 5]
> - execute [Parallel union, p = 10]
> - execute [Parallel cogroup, p = 5]
> - execute [parallel pipeline with remote channels, p = 5]
> with decreasing frequency.
> The issue manifests as one of the following issues:
> - stream corrupted exception
> - EOF exception
> - assertion failure in NUM_LOST or NUM_OUT_OF_ORDER
> - (for union) ArithmeticException overflow (because the number that should be 
> [0;100000] has been mis-deserialized)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-20654) Unaligned checkpoint recovery may lead to corrupted data stream

Reply via email to