rahil-c opened a new issue, #18679: URL: https://github.com/apache/hudi/issues/18679
### Motivation In [PR #18678 review](https://github.com/apache/hudi/pull/18678#discussion_r3178430212), @yihua noted that format-specific conditions are hardcoded throughout the codebase (e.g., `isSplitable`, `supportBatch`, `buildBaseFileReader` in `HoodieFileGroupReaderBasedFileFormat`). Each time a new file format is added (like Lance), every such branch must be updated — this is error-prone and violates the open/closed principle. ### Proposal Introduce a pluggable **file format adapter** interface so that adding a new base file format only requires implementing the adapter rather than modifying every conditional in the read/write path. Hardcoded conditions to consolidate (non-exhaustive, scoped to `HoodieFileGroupReaderBasedFileFormat`): | Location | Current pattern | |---|---| | `isSplitable` (line 222) | `!isLance && superSplitable` | | `supportBatch` (line 161-170) | `if PARQUET/ORC ... else if LANCE ...` | | `buildBaseFileReader` (line 336-353) | `if PARQUET ... else if LANCE ...` | | `withVectorRewrite` (line 447) | `if (hoodieFileFormat != HoodieFileFormat.PARQUET)` | Similar format-branching exists in: - `HoodieSparkFileReaderFactory` - `HoodieSparkFileWriterFactory` - `HoodieInternalRowFileWriterFactory` ### Suggested approach Define a trait/interface (e.g., `HoodieBaseFileFormatAdapter`) with methods like: - `isSplitable(): Boolean` - `supportsBatchRead(): Boolean` - `createReader(...): SparkColumnarFileReader` - `needsVectorRewrite(): Boolean` Each format (Parquet, ORC, Lance) implements the adapter. The format object is resolved once from `HoodieFileFormat` and threaded through — no more `if/else` chains on the enum. ### Context This was identified during the Lance duplicate-read fix (#18677 / PR #18678), where `isSplitable` inherited Parquet's `true` because no Lance-specific branch existed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
