rahil-c opened a new issue, #18679:
URL: https://github.com/apache/hudi/issues/18679

   ### Motivation
   
   In [PR #18678 
review](https://github.com/apache/hudi/pull/18678#discussion_r3178430212), 
@yihua noted that format-specific conditions are hardcoded throughout the 
codebase (e.g., `isSplitable`, `supportBatch`, `buildBaseFileReader` in 
`HoodieFileGroupReaderBasedFileFormat`). Each time a new file format is added 
(like Lance), every such branch must be updated — this is error-prone and 
violates the open/closed principle.
   
   ### Proposal
   
   Introduce a pluggable **file format adapter** interface so that adding a new 
base file format only requires implementing the adapter rather than modifying 
every conditional in the read/write path.
   
   Hardcoded conditions to consolidate (non-exhaustive, scoped to 
`HoodieFileGroupReaderBasedFileFormat`):
   
   | Location | Current pattern |
   |---|---|
   | `isSplitable` (line 222) | `!isLance && superSplitable` |
   | `supportBatch` (line 161-170) | `if PARQUET/ORC ... else if LANCE ...` |
   | `buildBaseFileReader` (line 336-353) | `if PARQUET ... else if LANCE ...` |
   | `withVectorRewrite` (line 447) | `if (hoodieFileFormat != 
HoodieFileFormat.PARQUET)` |
   
   Similar format-branching exists in:
   - `HoodieSparkFileReaderFactory`
   - `HoodieSparkFileWriterFactory`
   - `HoodieInternalRowFileWriterFactory`
   
   ### Suggested approach
   
   Define a trait/interface (e.g., `HoodieBaseFileFormatAdapter`) with methods 
like:
   - `isSplitable(): Boolean`
   - `supportsBatchRead(): Boolean`
   - `createReader(...): SparkColumnarFileReader`
   - `needsVectorRewrite(): Boolean`
   
   Each format (Parquet, ORC, Lance) implements the adapter. The format object 
is resolved once from `HoodieFileFormat` and threaded through — no more 
`if/else` chains on the enum.
   
   ### Context
   
   This was identified during the Lance duplicate-read fix (#18677 / PR 
#18678), where `isSplitable` inherited Parquet's `true` because no 
Lance-specific branch existed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to