[
https://issues.apache.org/jira/browse/FLINK-37417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18052131#comment-18052131
]
Tim Van Laer commented on FLINK-37417:
--------------------------------------
[~hong] My Flink cluster is hitting this issue once in a while. I do understand
why KinesisSource doesn't support partial recovery, but I don't understand how
I can change my code to circumvent this issue. Is this documented somewhere?
My flow puts the keyBy after 2 filters, but I suspect that's not what you mean
with "_NOT have a keyBy immediately after the KDS source_" So my question is:
what should I put between my KDS and keyBy to avoid this issue? (The partial
failure does indeed happen on the operator before the first keyBy, namely
`source -> Filter -> Filter`)
> Implement better exception message to explain why KinesisSource doesn't
> support partial recovery
> ------------------------------------------------------------------------------------------------
>
> Key: FLINK-37417
> URL: https://issues.apache.org/jira/browse/FLINK-37417
> Project: Flink
> Issue Type: Improvement
> Reporter: Hong Liang Teoh
> Priority: Major
>
> The KinesisSource doesn't currently support partial recovery because we want
> to prevent duplicate records being read into KinesisSource when restart +
> partial failure happens.
>
> Partial recovery is when Flink restarts a subset of subtasks to minimize
> downtime during a job failure. Flink determines which subtasks need to be
> restarted by checking all connected subtasks to the failed subtasks.
> However, this algorithm doesn't work for a KDS source, because the
> KinesisSource:
> # Has parent-child shard ordering within the KDS stream.
> # Does not ensure that all child shards are assigned to the same subtask (to
> prevent skew)
> These two mean that when we restart "selected" subtasks, we cannot ensure
> that the child shards have not been assigned to other subtasks "not
> connected" to the selected subtask.
>
> This is very much an edge case, where users will have to be doing the
> following:
> # NOT have a keyBy immediately after the KDS source.
> # Partial failure happens on an operator BEFORE the first keyBy in the job
> graph.
>
> This JIRA is to enrich the Exception message thrown
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)