[
https://issues.apache.org/jira/browse/FLINK-38345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18019760#comment-18019760
]
Rion Williams edited comment on FLINK-38345 at 9/12/25 4:33 AM:
----------------------------------------------------------------
I just came across this issue while perusing the Apache Flink JIRA and I'd like
to stab a stab at it, if anything to dig a bit deeper into some of the
internals of things.
I've been able to pull down the associated test that [~mcuento] initially
reported for the issue and made some progress on detecting recovery scenarios
and resolving the previous source and enumerator to ensure the test passed as
expected. I'll cobble together a PR shortly for review, hopefully by those more
familiar with this area of the codebase.
(Update: I've issued [https://github.com/apache/flink/pull/26983] as a proposed
solution and would welcome feedback from those that know better)
was (Author: rionmonster):
I just came across this issue while perusing the Apache Flink JIRA and I'd like
to stab a stab at it, if anything to dig a bit deeper into some of the
internals of things.
I've been able to pull down the associated test that [~mcuento] initially
reported for the issue and made some progress on detecting recovery scenarios
and resolving the previous source and enumerator to ensure the test passed as
expected. I'll cobble together a PR shortly for review, hopefully by those more
familiar with this area of the codebase.
> HybridSource checkpoint recovery results in NPE if addSource depends on
> previousEnumerator from SourceSwitchContext
> -------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-38345
> URL: https://issues.apache.org/jira/browse/FLINK-38345
> Project: Flink
> Issue Type: Bug
> Components: Connectors / HybridSource
> Affects Versions: 2.0.0, 1.19.1
> Reporter: Matt Cuento
> Priority: Minor
> Labels: pull-request-available
>
> {{org.apache.flink.connector.base.source.hybrid.HybridSource}} upon recovery
> from a checkpoint calls {{restoreEnumerator()}} with the configured source
> factories and checkpoint information.
> The enumerator
> {{org.apache.flink.connector.base.source.hybrid.HybridSourceSplitEnumerator}}
> then is started which invokes {{{}switchEnumerator(){}}}.
> The {{currentEnumerator}} is null on recovery and as part of source switching
> logic, {{previousEnumerator}} becomes null.
> For hybrid source implementations that require the use of the
> {{org.apache.flink.connector.base.source.hybrid.HybridSource.SourceSwitchContext}}
> to derive a start position at switch time, we end up in a situation where
> {{getPreviousEnumerator()}} returns null, meaning we're unable to derive a
> start position on checkpoint recovery.
> Enumerator state is later deserialized and recovered, but requires a source
> to get the serializer needed. So long as a source depends on a prior
> enumerator to derive the start position, this bug will be in effect.
>
> This bug can be reproduced via a unit test in the enumerator [1].
>
> [1] - [https://github.com/apache/flink/pull/26975]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)