nsivabalan opened a new pull request, #18889:
URL: https://github.com/apache/hudi/pull/18889

   ### Change Logs
   
   When `HoodieStreamer` ingests with `hoodie.write.table.version=6`, the 
persisted commit metadata was carrying the V2 checkpoint key 
`streamer.checkpoint.key.v2` instead of the V1 key 
`deltastreamer.checkpoint.key`. `CheckpointUtils.shouldTargetCheckpointV2` 
returns false for `writeTableVersion < 8`, so this violated the contract.
   
   **Root cause:** `DFSPathSelector.getNextFilePathsAndMaxModificationTime` 
(lines 154 & 160) and many other source helpers unconditionally construct `new 
StreamerCheckpointV2(...)`. `*DFSSource` implementations pipe that V2 
checkpoint into `InputBatch#getCheckpointForNextBatch`, and 
`StreamSync.extractCheckpointMetadata` trusted the type blindly. The very next 
code path two lines below (the no-batch fallback) already calls 
`buildCheckpointFromGeneralSource` and gets the version contract right — so 
this was a missed branch in an existing chokepoint, not a missing primitive.
   
   **Fix:** type-based normalization at the chokepoint in 
`StreamSync.extractCheckpointMetadata`. When the source returns a checkpoint, 
coerce its concrete type to whatever `shouldTargetCheckpointV2(versionCode, 
sourceClassName)` permits before calling `getCheckpointCommitMetadata`. Mirrors 
the inbound translation already done in `Source.translateCheckpoint`.
   
   **Coverage audit:** all 17 `new StreamerCheckpointV2(...)` construction 
sites in `hudi-utilities/src/main` funnel through this chokepoint — DFS family, 
Kafka family, Kinesis, Pulsar, Jdbc, Sql, Hive, HoodieIncrSource, 
GcsEventsSource, DatePartitionPathSelector, Debezium, S3EventsMetaSelector. The 
`DATASOURCES_NOT_SUPPORTED_WITH_CKPT_V2` carve-out (S3EventsHoodieIncrSource / 
GcsEventsHoodieIncrSource) is preserved: those sources stay V1 even on v8+ 
writes.
   
   ### Impact
   
   Bug fix: prevents the wrong checkpoint key from being persisted on v6 
tables. Resumed ingestion against a v6 table will now correctly continue 
reading `deltastreamer.checkpoint.key` from prior commits.
   
   ### Risk level
   
   low
   
   ### Documentation Update
   
   No user-facing config or API change.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Adequate tests added
   
   ### Test Plan
   
   - `testExtractCheckpointMetadata_V2FromSourceDowngradedToV1OnV6Write` — 19 
source classes covering DFS, Kafka, Kinesis, Pulsar, Jdbc, Sql, SqlFileBased, 
Hive, Hoodie/S3Events/GcsEvents IncrSources, GcsEventsSource. All pass.
   - `testExtractCheckpointMetadata_V1FromSourceUpgradedToV2OnV8Write` — 6 
cases including the S3/Gcs IncrSource carve-out. All pass.
   - Full `TestStreamSync#testExtractCheckpointMetadata*` — 28/28 pass.
   - `mvn -pl hudi-utilities checkstyle:check` — 0 violations.
   
   Closes #18888


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to