linliu-code opened a new pull request, #18807:
URL: https://github.com/apache/hudi/pull/18807

   ### Describe the issue this Pull Request addresses
   
   When a new column is added to a Hudi table via schema evolution, the MDT 
column_stats index behaves differently based on whether the user has set an 
explicit `hoodie.metadata.index.column.stats.column.list`:
   
   | Mode | New column auto-indexed? |
   |---|---|
   | Default (no explicit list) | ✅ Yes |
   | Explicit list (without the new column name) | ❌ No |
   
   This PR contains only tests that codify the two behaviors. No production 
code change.
   
   **The questions this PR is opening for community confirmation:**
   
   **Q1.** Is the **default-mode auto-extend** behavior intentional and safe to 
rely on? Empirically, files written after an `ADD COLUMN` evolution have 
populated `col_stats` records for the new column; files written before the 
evolution have null stats for the new column (since the column didn't exist in 
those files). That's the right behavior, but worth confirming it's by-design.
   
   **Q2.** Is the **explicit-list "no auto-extend"** behavior the intended 
strict opt-in design? Users with an explicit `column.list` who add a column at 
the source will silently lose data skipping on the new column — queries that 
filter on it will fall back to full-file scans without any warning. Should 
there be a way to opt into auto-extend with an explicit list (e.g., a column 
pattern like `"user_*"`, or an `auto-extend-on-schema-evolution` flag)?
   
   ### Summary and Changelog
   
   Adds one test file:
   - 
`hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestColStatsAutoExtendOnAddColumn.scala`
   
   Two test methods, each parameterized across `COPY_ON_WRITE` and 
`MERGE_ON_READ`:
   
   | Test | Scenario | Expected (and observed) |
   |---|---|---|
   | `testNewColumnInDefaultModeIsAutoIndexed` | No `column.list` set, add 
column via schema evolution | New column gets col_stats records in 
post-evolution files |
   | `testNewColumnWithExplicitListIsNotAutoIndexed` | Explicit `column.list = 
"col_a,col_b"`, add `col_c` via schema evolution | `col_c` gets NO col_stats 
records |
   
   All 4 cells pass on current master. The file exists to:
   - Make the actual behavior visible (which was non-obvious until we ran the 
probe).
   - Guard against future regression of the default-mode auto-extend.
   - Surface the explicit-list ergonomics question for community discussion.
   
   ### Impact
   
   No source code change. New tests only.
   
   User-facing implication (if confirmed):
   - Default mode: no action needed when adding columns at the source.
   - Explicit-list mode: users must remember to extend 
`hoodie.metadata.index.column.stats.column.list` when adding columns, or data 
skipping silently regresses.
   
   ### Risk Level
   
   None — test-only.
   
   ### Documentation Update
   
   If Q2 resolves as "current behavior is intentional," the Hudi 
schema-evolution and metadata-table docs should clarify that an explicit 
`column.list` is strict opt-in and the operator must extend it on schema 
evolution.
   
   If Q2 resolves as "should auto-extend," a separate PR can add the opt-in 
mechanism.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable (this PR IS the tests; no 
production code change)
   - [ ] CI passed — expected to pass; all 4 cells pass on current master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to