GitHub user suryaprasanna edited a discussion: Support Selective Metafield 
Population: Enable Only _hoodie_commit_time for Incremental Reads

**Context**

Hudi currently supports an all-or-nothing approach for metafield population. 
Users can either:
  - Populate all metafields (**_hoodie_record_key, _hoodie_commit_time, 
_hoodie_partition_path, _hoodie_file_name**, etc.)
  - Disable all metafields to save on storage costs

When metafields are disabled, Hudi operations still function correctly because 
certain fields like _hoodie_record_key are virtualized - they're automatically 
generated on-the-go during operations like upserts. This virtualization 
approach works seamlessly without requiring the field to be physically stored.

However, incremental reads rely on the **_hoodie_commit_time** metafield to 
identify and read delta changes at the record level. This field is essential 
for incremental query patterns, enabling Hudi to provide efficient record-level 
change tracking.

  **Problem Statement**
  Users face a trade-off between storage efficiency and incremental read 
capabilities:

  - Scenario 1: Users who disable metafields save storage but lose the ability 
to perform incremental reads because **_hoodie_commit_time** is not available
  - Scenario 2: Users who enable metafields can perform incremental reads but 
pay for storing all metafields, even though most of them (like 
**_hoodie_record_key, _hoodie_partition_path**) can be virtualized and aren't 
strictly necessary

The gap: There's no option to selectively enable only the metafields that 
cannot be virtualized or are critical for certain query patterns (like 
**_hoodie_commit_time** for incremental reads) while excluding others that can 
be generated on-the-go.

This results in unnecessary storage overhead for users who need incremental 
reads but don't require the other metafields to be physically stored.

**Proposed Solution**

Introduce selective metafield population that allows users to:
  - Enable only **_hoodie_commit_time** for incremental read support
  - Nullify/disable other metafields (**_hoodie_record_key, 
_hoodie_partition_path, _hoodie_file_name**) that can be virtualized
  - Achieve storage savings while maintaining incremental query capabilities

This would provide a middle ground between the current all-or-nothing approach, 
optimizing for both storage efficiency and functional requirements.

GitHub link: https://github.com/apache/hudi/discussions/17959

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to