prashantwason opened a new issue, #18383: URL: https://github.com/apache/hudi/issues/18383
## Context Reference: https://github.com/apache/hudi/discussions/17959 Hudi currently enforces an all-or-nothing approach for meta field population via `hoodie.populate.meta.fields`. Users must either populate all 5 meta columns or none: - **All enabled** (`hoodie.populate.meta.fields=true`): Populates `_hoodie_commit_time`, `_hoodie_commit_seqno`, `_hoodie_record_key`, `_hoodie_partition_path`, `_hoodie_file_name` - **All disabled** (`hoodie.populate.meta.fields=false`): All 5 columns written as empty strings, losing incremental query capability ## Problem Users face an unnecessary trade-off between storage efficiency and incremental read capability: - `_hoodie_commit_time` is essential for incremental queries and **cannot be virtualized** - `_hoodie_record_key`, `_hoodie_partition_path`, and `_hoodie_file_name` **can be virtualized** (generated on-the-fly during reads) - There is no way to keep only the essential fields while nullifying the rest This results in unnecessary storage overhead for users who need incremental reads but don't need the virtualizable fields physically stored. ## Proposed Solution Introduce a new config `hoodie.meta.fields.to.exclude` — a comma-separated list of meta field names to exclude from population. Excluded fields are written as **null** (not empty string) for optimal Parquet storage savings (nulls take zero data bytes via definition levels). **Valid values:** `_hoodie_commit_time`, `_hoodie_commit_seqno`, `_hoodie_record_key`, `_hoodie_partition_path`, `_hoodie_file_name` **Example:** To keep only commit time for incremental reads: ``` hoodie.populate.meta.fields=true hoodie.meta.fields.to.exclude=_hoodie_record_key,_hoodie_partition_path,_hoodie_file_name,_hoodie_commit_seqno ``` ### Behavior - Only effective when `hoodie.populate.meta.fields=true` - When all 5 fields are excluded, behavior matches `hoodie.populate.meta.fields=false` - When the exclude list is empty (default), all fields are populated — identical to current behavior - Bloom filter is disabled when `_hoodie_record_key` is excluded (since it indexes record keys) ## Design ### Config - **`HoodieTableConfig`**: Add `META_FIELDS_TO_EXCLUDE` config property - **`HoodieWriteConfig`**: Add `getMetaFieldsToExclude()` (returns `Set<String>`) and `getMetaFieldPopulationFlags()` (returns pre-computed `boolean[5]` indexed by meta field ordinal) ### Performance The `boolean[5]` array is computed once in writer constructors from the config. Per-row checks are a single array access (`if (populateField[ordinal])`) — zero allocation, branch-predictor friendly. No Set/Map lookups in the hot path. ### Writer paths modified There are 4 distinct code paths that populate meta fields, all modified to conditionally write null for excluded fields: | Path | Engine | Key File | Change | |------|--------|----------|--------| | 1 | Avro writers (Spark RDD) | `HoodieAvroFileWriter` | New `prepRecordWithMetadata(..., boolean[])` overload; skip `rec.put()` for excluded fields (defaults to null) | | 2 | Spark InternalRow writers | `HoodieSparkParquetWriter` | `updateRecordMetadata()` conditionally sets each field via `populateField[ordinal]` | | 3 | Spark SQL row-writer | `HoodieRowCreateHandle` | `writeRow()` sets excluded entries in `UTF8String[5]` to null instead of computed values | | 4 | Flink writer | `HoodieRowDataCreateHandle` | `write()` passes null for excluded fields to `HoodieRowDataCreation.create()` | ### Bloom filter Bloom filter indexes record keys, so it is disabled when `_hoodie_record_key` is excluded: - `HoodieInternalRowFileWriterFactory.tryInstantiateBloomFilter()` — checks `populateField[2]` - `HoodieAvroParquetWriter.writeAvro/writeAvroWithMetadata()` — guards `writeSupport.add()` with `populateField[2]` - `HoodieSparkParquetWriter.writeRow/writeRowWithMetadata()` — same guard ### Null safety (Flink) `AbstractHoodieRowData.getString()` updated to handle null meta columns without NPE — returns null instead of calling `StringData.fromString(null)`. ### Files modified 1. `hudi-common/.../HoodieTableConfig.java` — config property 2. `hudi-client/hudi-client-common/.../HoodieWriteConfig.java` — getter + boolean[] helper 3. `hudi-common/.../HoodieAvroFileWriter.java` — new overload 4. `hudi-hadoop-common/.../HoodieAvroParquetWriter.java` — selective population + bloom filter guard 5. `hudi-hadoop-common/.../HoodieAvroOrcWriter.java` — selective population 6. `hudi-hadoop-common/.../HoodieAvroHFileWriter.java` — selective population 7. `hudi-client/hudi-spark-client/.../HoodieSparkParquetWriter.java` — selective population + bloom filter guard 8. `hudi-client/hudi-spark-client/.../HoodieRowCreateHandle.java` — null for excluded fields 9. `hudi-client/hudi-spark-client/.../HoodieDatasetBulkInsertHelper.scala` — null for excluded fields 10. `hudi-client/hudi-flink-client/.../HoodieRowDataCreateHandle.java` — null for excluded fields 11. `hudi-client/hudi-flink-client/.../AbstractHoodieRowData.java` — null safety fix 12. `hudi-client/hudi-spark-client/.../HoodieInternalRowFileWriterFactory.java` — bloom filter guard -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
