nsivabalan opened a new pull request, #19047:
URL: https://github.com/apache/hudi/pull/19047

   ### Describe the issue this Pull Request addresses
   
   Today the only way to opt out of populating Hudi's five meta columns is the 
all-or-nothing `hoodie.populate.meta.fields=false`. That saves storage but 
disables incremental queries (which require `_hoodie_commit_time`).
   
   A community user surfaced this trade-off (#18383, also discussed at #17959). 
The concrete ask was: "give me the storage saving without giving up incremental 
queries." A separate exploratory PR (#18384) attempted a fully orthogonal 
exclude-list with per-field branching across the writer/reader paths; that 
surface ended up being ~2300 lines across 87 files. This PR proposes a simpler, 
scoped alternative: three named modes instead of the full 2^5 matrix.
   
   Closes #18383.
   
   ### Summary and Changelog
   
   Adds an additive opt-in flag, `hoodie.meta.fields.commit.time.enabled`, that 
— when set together with `hoodie.populate.meta.fields=false` — additionally 
populates `_hoodie_commit_time` so incremental queries remain functional. The 
remaining four meta columns stay null on disk, preserving the storage saving.
   
   The three resulting modes:
   
   | `populate.meta.fields` | `meta.fields.commit.time.enabled` | Effective 
mode |
   |---|---|---|
   | `true` (default) | ignored | **ALL** — today's default |
   | `false` | `false` (default) | **NONE** — today's 
`populate.meta.fields=false` |
   | `false` | `true` | **COMMIT_TIME_ONLY** — new |
   | `true` | `true` | rejected at writer init (ambiguous) |
   
   #### Why a separate boolean instead of a single enum
   
   - **Bit-identical backward compatibility.** Every existing table on disk 
resolves to ALL or NONE without any new property being read. No reader-side 
migration. No precedence rules.
   - **Pre-1.3.0 readers behave correctly.** They don't know the new property 
exists. They open a COMMIT_TIME_ONLY table, see `populate.meta.fields=false`, 
and behave as a NONE reader — they cannot do incremental queries on the table, 
but they don't produce silent wrong results either.
   - **Encodes "additive" structurally.** The new flag only modifies a NONE 
table — it's literally a NONE table plus one populated column. Most code paths 
that branch on `populate.meta.fields` keep working unchanged; only paths that 
specifically need commit_time consult the new accessor.
   
   #### Plug points
   
   **Config + accessors (`hudi-common` / `hudi-client-common`):**
   - New `HoodieTableConfig.META_FIELDS_COMMIT_TIME_ENABLED` property.
   - New accessors: `isCommitTimeOnlyMetaFieldsMode()`, 
`isCommitTimePopulated()`, `isRecordKeyPopulated()` — three named predicates.
   - `HoodieWriteConfig` pass-throughs + 
`Builder.withMetaFieldsCommitTimeEnabled()`.
   - `HoodieWriteConfig.validate()` rejects the `populate=true` + 
`commit.time=true` combination at build time.
   - `HoodieTableMetaClient.TableBuilder.setMetaFieldsCommitTimeEnabled()` 
persists the flag on `hoodie.properties` at table init.
   - `HoodieSparkSqlWriter` wires both fresh-table and bootstrap creation paths.
   
   **Writer engines:**
   - `HoodieAvroParquetWriter`, `HoodieSparkParquetWriter`, 
`HoodieRowCreateHandle` each gain a `commitTimeOnly` constructor overload. When 
`commitTimeOnly && !populateMetaFields`, they populate `_hoodie_commit_time` 
and the derived seq id; the other four columns stay null. Bloom-filter / 
record-key index registration is intentionally skipped (the record-key column 
is not populated).
   
   **Read path (incremental query rejection):**
   - `IncrementalRelationV1/V2`, `MergeOnReadIncrementalRelationV1/V2` now 
check `isCommitTimePopulated()` rather than `populateMetaFields()` — 
COMMIT_TIME_ONLY tables are accepted, NONE tables remain rejected with a 
clearer message.
   
   #### Scope
   
   - ✅ Spark Avro / Spark Row writer paths.
   - ✅ Spark Parquet bulk-insert.
   - ✅ Incremental query rejection logic across V1 / V2 / CoW / MoR.
   - ❌ Flink RowData writer — out of scope for this patch; behaves as NONE 
under COMMIT_TIME_ONLY (no commit_time populated). Tracked as a follow-up.
   - ❌ ORC / HFile writers — ORC continues to populate all meta fields 
unconditionally (legacy behavior); HFile is used only by MDT which is always 
ALL.
   
   ### Impact
   
   - **Storage layout**: no change for tables that don't opt in. New optional 
mode for tables that do. Default behavior unchanged.
   - **API**: no public API breakage. New table property, new accessors, new 
builder method — all additive.
   - **Configuration**:
     - `hoodie.meta.fields.commit.time.enabled` (default `false`). Only 
meaningful when `hoodie.populate.meta.fields=false`. Persisted on 
`hoodie.properties` at table init.
   - **Performance**: writer hot path adds one boolean check per row when in 
the new mode; the bool is final and cached in the writer constructor.
   - **Forward-compat**: pre-1.3.0 readers ignore the new flag and treat the 
table as NONE — no silent wrong results.
   
   ### Risk Level
   
   low
   
   Additive change with a narrow scope. The default path is untouched. The 
validation guard rejects the ambiguous combination loudly at writer init. 
Existing `TestHoodieTableConfig` regression coverage (93 tests) passes 
unchanged.
   
   ### Documentation Update
   
   - New config `hoodie.meta.fields.commit.time.enabled` documented via 
`@ConfigProperty` annotation on `HoodieTableConfig`.
   - No public-facing docs update needed in this patch; if the website page on 
meta fields exists, a separate docs PR will add the three-mode table.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to