prashantwason opened a new pull request, #18384:
URL: https://github.com/apache/hudi/pull/18384

   ## Summary
   
   - Adds `hoodie.meta.fields.to.exclude` config for selective meta field 
population
   - Excluded meta fields are written as **null** (not empty string) for 
optimal Parquet storage savings
   - Covers all 4 write paths: Avro file writers, Spark InternalRow, Spark SQL 
row-writer, Flink
   - Uses pre-computed `boolean[5]` indexed by meta field ordinal for 
zero-overhead per-row checks
   - Disables bloom filter when `_hoodie_record_key` is excluded
   - Fixes null safety in Flink `AbstractHoodieRowData.getString()`
   
   ## Motivation
   
   Closes https://github.com/apache/hudi/issues/18383
   Discussion: https://github.com/apache/hudi/discussions/17959
   
   Users currently face a trade-off: `hoodie.populate.meta.fields` is 
all-or-nothing. Disabling it saves storage but loses incremental query 
capability (requires `_hoodie_commit_time`). Fields like `_hoodie_record_key`, 
`_hoodie_partition_path`, and `_hoodie_file_name` can be virtualized and don't 
need physical storage.
   
   This PR adds a middle ground: selectively exclude virtualizable meta fields 
while keeping essential ones like `_hoodie_commit_time`.
   
   **Example config:**
   ```
   hoodie.populate.meta.fields=true
   
hoodie.meta.fields.to.exclude=_hoodie_record_key,_hoodie_partition_path,_hoodie_file_name,_hoodie_commit_seqno
   ```
   
   ## Test plan
   
   - [ ] Verify compilation across all modules (Spark, Flink, Avro)
   - [ ] Run existing `populateMetaFields` tests for regression 
(`TestHoodieRowCreateHandle`, `TestHoodieDatasetBulkInsertHelper`)
   - [ ] Add test with selective exclusion verifying excluded fields are null 
in written Parquet files
   - [ ] Verify non-excluded fields have correct values
   - [ ] Verify all-excluded behavior matches `populateMetaFields=false`
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to