nsivabalan opened a new pull request, #18889: URL: https://github.com/apache/hudi/pull/18889
### Change Logs When `HoodieStreamer` ingests with `hoodie.write.table.version=6`, the persisted commit metadata was carrying the V2 checkpoint key `streamer.checkpoint.key.v2` instead of the V1 key `deltastreamer.checkpoint.key`. `CheckpointUtils.shouldTargetCheckpointV2` returns false for `writeTableVersion < 8`, so this violated the contract. **Root cause:** `DFSPathSelector.getNextFilePathsAndMaxModificationTime` (lines 154 & 160) and many other source helpers unconditionally construct `new StreamerCheckpointV2(...)`. `*DFSSource` implementations pipe that V2 checkpoint into `InputBatch#getCheckpointForNextBatch`, and `StreamSync.extractCheckpointMetadata` trusted the type blindly. The very next code path two lines below (the no-batch fallback) already calls `buildCheckpointFromGeneralSource` and gets the version contract right — so this was a missed branch in an existing chokepoint, not a missing primitive. **Fix:** type-based normalization at the chokepoint in `StreamSync.extractCheckpointMetadata`. When the source returns a checkpoint, coerce its concrete type to whatever `shouldTargetCheckpointV2(versionCode, sourceClassName)` permits before calling `getCheckpointCommitMetadata`. Mirrors the inbound translation already done in `Source.translateCheckpoint`. **Coverage audit:** all 17 `new StreamerCheckpointV2(...)` construction sites in `hudi-utilities/src/main` funnel through this chokepoint — DFS family, Kafka family, Kinesis, Pulsar, Jdbc, Sql, Hive, HoodieIncrSource, GcsEventsSource, DatePartitionPathSelector, Debezium, S3EventsMetaSelector. The `DATASOURCES_NOT_SUPPORTED_WITH_CKPT_V2` carve-out (S3EventsHoodieIncrSource / GcsEventsHoodieIncrSource) is preserved: those sources stay V1 even on v8+ writes. ### Impact Bug fix: prevents the wrong checkpoint key from being persisted on v6 tables. Resumed ingestion against a v6 table will now correctly continue reading `deltastreamer.checkpoint.key` from prior commits. ### Risk level low ### Documentation Update No user-facing config or API change. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Adequate tests added ### Test Plan - `testExtractCheckpointMetadata_V2FromSourceDowngradedToV1OnV6Write` — 19 source classes covering DFS, Kafka, Kinesis, Pulsar, Jdbc, Sql, SqlFileBased, Hive, Hoodie/S3Events/GcsEvents IncrSources, GcsEventsSource. All pass. - `testExtractCheckpointMetadata_V1FromSourceUpgradedToV2OnV8Write` — 6 cases including the S3/Gcs IncrSource carve-out. All pass. - Full `TestStreamSync#testExtractCheckpointMetadata*` — 28/28 pass. - `mvn -pl hudi-utilities checkstyle:check` — 0 violations. Closes #18888 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
