[ 
https://issues.apache.org/jira/browse/FLINK-37417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18052131#comment-18052131
 ] 

Tim Van Laer edited comment on FLINK-37417 at 1/15/26 3:09 PM:
---------------------------------------------------------------

[~hong] My Flink cluster is hitting this issue once in a while. I do understand 
why KinesisSource doesn't support partial recovery, but I don't understand how 
I can change my code to circumvent this issue. Is this documented somewhere?

My flow puts the keyBy after 2 filters, but I suspect that's not what you mean 
with "_NOT have a keyBy immediately after the KDS source_" So my question is: 
what should I put between my KDS and keyBy to avoid this issue? (The partial 
failure does indeed happen on the operator before the first keyBy, namely 
`source -> Filter -> Filter`)

Thanks for the good work!


was (Author: timvanlaer):
[~hong] My Flink cluster is hitting this issue once in a while. I do understand 
why KinesisSource doesn't support partial recovery, but I don't understand how 
I can change my code to circumvent this issue. Is this documented somewhere?

My flow puts the keyBy after 2 filters, but I suspect that's not what you mean 
with "_NOT have a keyBy immediately after the KDS source_" So my question is: 
what should I put between my KDS and keyBy to avoid this issue? (The partial 
failure does indeed happen on the operator before the first keyBy, namely 
`source -> Filter -> Filter`)

> Implement better exception message to explain why KinesisSource doesn't 
> support partial recovery
> ------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-37417
>                 URL: https://issues.apache.org/jira/browse/FLINK-37417
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Hong Liang Teoh
>            Priority: Major
>
> The KinesisSource doesn't currently support partial recovery because we want 
> to prevent duplicate records being read into KinesisSource when restart + 
> partial failure happens.
>  
> Partial recovery is when Flink restarts a subset of subtasks to minimize 
> downtime during a job failure. Flink determines which subtasks need to be 
> restarted by checking all connected subtasks to the failed subtasks. 
> However, this algorithm doesn't work for a KDS source, because the 
> KinesisSource:
>  # Has parent-child shard ordering within the KDS stream.
>  # Does not ensure that all child shards are assigned to the same subtask (to 
> prevent skew)
> These two mean that when we restart "selected" subtasks, we cannot ensure 
> that the child shards have not been assigned to other subtasks "not 
> connected" to the selected subtask.
>  
> This is very much an edge case, where users will have to be doing the 
> following:
>  # NOT have a keyBy immediately after the KDS source.
>  # Partial failure happens on an operator BEFORE the first keyBy in the job 
> graph.
>  
> This JIRA is to enrich the Exception message thrown
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to