yzeng1618 commented on issue #10193:
URL: https://github.com/apache/seatunnel/issues/10193#issuecomment-3659158041
Based on the information provided above, we have roughly determined the
position for now.
- Root cause:The error comes from
FlinkSourceSplitEnumeratorContext#getJobIdForV15, which uses reflection to dig
from SplitEnumeratorContext into Flink’s internal SchedulerBase to read the
JobID, relying on internal field names like operatorCoordinatorContext,
globalFailureHandler, and arg$1. Under Flink 1.17/1.18 with
RecreateOnResetOperatorCoordinator, these internals/initialization order
changed, so an intermediate object becomes null, causing an NPE wrapped as
IllegalStateException("Initialize flink job-id failed") and logged as a WARN,
leaving the jobId as null.
- When is it triggered?
First run is fine; on rolling upgrade / restore from checkpoint/savepoint
(RecreateOnResetOperatorCoordinator.resetToCheckpoint), a new
FlinkSourceSplitEnumeratorContext is created, and the reflection chain fails to
obtain a valid internal object in this restore path, triggering the WARN.
- Possible fixes for discussion
1. Keep the reflection-based approach but make every step null-safe (no
.get() or throwing IllegalStateException); if anything is missing, simply
return null, logging a single concise WARN stating that only event jobId is
affected, not checkpoint/savepoint restore.
2. In the extraction logic, detect whether OperatorCoordinator.Context class
name contains RecreateOnResetOperatorCoordinator / QuiesceableContext etc.; if
it is a restore/upgrade context, immediately return null and skip deeper
reflection, while keeping the normal path unchanged. This fully avoids NPEs on
upgrade/restore.
@
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]