nsivabalan opened a new issue, #18888:
URL: https://github.com/apache/hudi/issues/18888

   ### Tips before filing an issue
   - [x] Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   ### Describe the problem you faced
   
   When `HoodieStreamer` ingests with `hoodie.write.table.version=6` and the 
source is one that funnels a `StreamerCheckpointV2` through 
`InputBatch#getCheckpointForNextBatch` (e.g. any `*DFSSource` via 
`DFSPathSelector`, Kafka via the `InputBatch(batch, String)` constructor, 
Kinesis, Pulsar, Jdbc, Sql, Hive, HoodieIncrSource, DatePartitionPathSelector, 
Debezium, S3EventsMetaSelector, GcsEventsSource), the persisted commit metadata 
carries the V2 key `streamer.checkpoint.key.v2` instead of the V1 key 
`deltastreamer.checkpoint.key`.
   
   This is a contract violation: 
`CheckpointUtils.shouldTargetCheckpointV2(writeTableVersion, sourceClassName)` 
returns `false` for `writeTableVersion < 8`, so V1 is required for v6 tables.
   
   ### To Reproduce
   
   Steps to reproduce the behavior:
   
   1. Use HoodieStreamer with `--source-class 
org.apache.hudi.utilities.sources.ParquetDFSSource`
   2. Pass `--hoodie-conf hoodie.write.table.version=6`
   3. After a sync produces a commit, inspect `extraMetadata` in the `.commit` 
file
   4. Observe `streamer.checkpoint.key.v2=<value>` is present and 
`deltastreamer.checkpoint.key` is absent
   
   ### Expected behavior
   
   For `hoodie.write.table.version=6`, the persisted commit metadata should 
carry the V1 key (`deltastreamer.checkpoint.key`) and not the V2 key 
(`streamer.checkpoint.key.v2`).
   
   ### Environment Description
   
   - Hudi version: master (1.x line)
   - Spark version: 3.5
   - Hive version: n/a
   - Hadoop version: n/a
   - Storage (HDFS/S3/GCS..): any
   - Running on Docker?: no
   
   ### Additional context
   
   Root cause: `DFSPathSelector.getNextFilePathsAndMaxModificationTime` (lines 
154 & 160) and similar helpers in other sources unconditionally construct `new 
StreamerCheckpointV2(...)` without checking the target table version. 
`StreamSync.extractCheckpointMetadata` then trusts that type and writes V2 keys 
regardless of write table version. The branch two lines below for the no-batch 
fallback already calls `buildCheckpointFromGeneralSource` and gets the version 
contract right — so this is a missed branch in an existing chokepoint, not a 
missing primitive.
   
   PR: WIP — will link.
   
   ### Stacktrace
   
   n/a — wrong-key persistence, not a thrown exception.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to