prashantwason opened a new issue, #18383:
URL: https://github.com/apache/hudi/issues/18383

   ## Context
   
   Reference: https://github.com/apache/hudi/discussions/17959
   
   Hudi currently enforces an all-or-nothing approach for meta field population 
via `hoodie.populate.meta.fields`. Users must either populate all 5 meta 
columns or none:
   
   - **All enabled** (`hoodie.populate.meta.fields=true`): Populates 
`_hoodie_commit_time`, `_hoodie_commit_seqno`, `_hoodie_record_key`, 
`_hoodie_partition_path`, `_hoodie_file_name`
   - **All disabled** (`hoodie.populate.meta.fields=false`): All 5 columns 
written as empty strings, losing incremental query capability
   
   ## Problem
   
   Users face an unnecessary trade-off between storage efficiency and 
incremental read capability:
   
   - `_hoodie_commit_time` is essential for incremental queries and **cannot be 
virtualized**
   - `_hoodie_record_key`, `_hoodie_partition_path`, and `_hoodie_file_name` 
**can be virtualized** (generated on-the-fly during reads)
   - There is no way to keep only the essential fields while nullifying the rest
   
   This results in unnecessary storage overhead for users who need incremental 
reads but don't need the virtualizable fields physically stored.
   
   ## Proposed Solution
   
   Introduce a new config `hoodie.meta.fields.to.exclude` — a comma-separated 
list of meta field names to exclude from population. Excluded fields are 
written as **null** (not empty string) for optimal Parquet storage savings 
(nulls take zero data bytes via definition levels).
   
   **Valid values:** `_hoodie_commit_time`, `_hoodie_commit_seqno`, 
`_hoodie_record_key`, `_hoodie_partition_path`, `_hoodie_file_name`
   
   **Example:** To keep only commit time for incremental reads:
   ```
   hoodie.populate.meta.fields=true
   
hoodie.meta.fields.to.exclude=_hoodie_record_key,_hoodie_partition_path,_hoodie_file_name,_hoodie_commit_seqno
   ```
   
   ### Behavior
   - Only effective when `hoodie.populate.meta.fields=true`
   - When all 5 fields are excluded, behavior matches 
`hoodie.populate.meta.fields=false`
   - When the exclude list is empty (default), all fields are populated — 
identical to current behavior
   - Bloom filter is disabled when `_hoodie_record_key` is excluded (since it 
indexes record keys)
   
   ## Design
   
   ### Config
   
   - **`HoodieTableConfig`**: Add `META_FIELDS_TO_EXCLUDE` config property
   - **`HoodieWriteConfig`**: Add `getMetaFieldsToExclude()` (returns 
`Set<String>`) and `getMetaFieldPopulationFlags()` (returns pre-computed 
`boolean[5]` indexed by meta field ordinal)
   
   ### Performance
   
   The `boolean[5]` array is computed once in writer constructors from the 
config. Per-row checks are a single array access (`if 
(populateField[ordinal])`) — zero allocation, branch-predictor friendly. No 
Set/Map lookups in the hot path.
   
   ### Writer paths modified
   
   There are 4 distinct code paths that populate meta fields, all modified to 
conditionally write null for excluded fields:
   
   | Path | Engine | Key File | Change |
   |------|--------|----------|--------|
   | 1 | Avro writers (Spark RDD) | `HoodieAvroFileWriter` | New 
`prepRecordWithMetadata(..., boolean[])` overload; skip `rec.put()` for 
excluded fields (defaults to null) |
   | 2 | Spark InternalRow writers | `HoodieSparkParquetWriter` | 
`updateRecordMetadata()` conditionally sets each field via 
`populateField[ordinal]` |
   | 3 | Spark SQL row-writer | `HoodieRowCreateHandle` | `writeRow()` sets 
excluded entries in `UTF8String[5]` to null instead of computed values |
   | 4 | Flink writer | `HoodieRowDataCreateHandle` | `write()` passes null for 
excluded fields to `HoodieRowDataCreation.create()` |
   
   ### Bloom filter
   
   Bloom filter indexes record keys, so it is disabled when 
`_hoodie_record_key` is excluded:
   - `HoodieInternalRowFileWriterFactory.tryInstantiateBloomFilter()` — checks 
`populateField[2]`
   - `HoodieAvroParquetWriter.writeAvro/writeAvroWithMetadata()` — guards 
`writeSupport.add()` with `populateField[2]`
   - `HoodieSparkParquetWriter.writeRow/writeRowWithMetadata()` — same guard
   
   ### Null safety (Flink)
   
   `AbstractHoodieRowData.getString()` updated to handle null meta columns 
without NPE — returns null instead of calling `StringData.fromString(null)`.
   
   ### Files modified
   
   1. `hudi-common/.../HoodieTableConfig.java` — config property
   2. `hudi-client/hudi-client-common/.../HoodieWriteConfig.java` — getter + 
boolean[] helper
   3. `hudi-common/.../HoodieAvroFileWriter.java` — new overload
   4. `hudi-hadoop-common/.../HoodieAvroParquetWriter.java` — selective 
population + bloom filter guard
   5. `hudi-hadoop-common/.../HoodieAvroOrcWriter.java` — selective population
   6. `hudi-hadoop-common/.../HoodieAvroHFileWriter.java` — selective population
   7. `hudi-client/hudi-spark-client/.../HoodieSparkParquetWriter.java` — 
selective population + bloom filter guard
   8. `hudi-client/hudi-spark-client/.../HoodieRowCreateHandle.java` — null for 
excluded fields
   9. `hudi-client/hudi-spark-client/.../HoodieDatasetBulkInsertHelper.scala` — 
null for excluded fields
   10. `hudi-client/hudi-flink-client/.../HoodieRowDataCreateHandle.java` — 
null for excluded fields
   11. `hudi-client/hudi-flink-client/.../AbstractHoodieRowData.java` — null 
safety fix
   12. 
`hudi-client/hudi-spark-client/.../HoodieInternalRowFileWriterFactory.java` — 
bloom filter guard


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to