linliu-code opened a new pull request, #18807: URL: https://github.com/apache/hudi/pull/18807
### Describe the issue this Pull Request addresses When a new column is added to a Hudi table via schema evolution, the MDT column_stats index behaves differently based on whether the user has set an explicit `hoodie.metadata.index.column.stats.column.list`: | Mode | New column auto-indexed? | |---|---| | Default (no explicit list) | ✅ Yes | | Explicit list (without the new column name) | ❌ No | This PR contains only tests that codify the two behaviors. No production code change. **The questions this PR is opening for community confirmation:** **Q1.** Is the **default-mode auto-extend** behavior intentional and safe to rely on? Empirically, files written after an `ADD COLUMN` evolution have populated `col_stats` records for the new column; files written before the evolution have null stats for the new column (since the column didn't exist in those files). That's the right behavior, but worth confirming it's by-design. **Q2.** Is the **explicit-list "no auto-extend"** behavior the intended strict opt-in design? Users with an explicit `column.list` who add a column at the source will silently lose data skipping on the new column — queries that filter on it will fall back to full-file scans without any warning. Should there be a way to opt into auto-extend with an explicit list (e.g., a column pattern like `"user_*"`, or an `auto-extend-on-schema-evolution` flag)? ### Summary and Changelog Adds one test file: - `hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestColStatsAutoExtendOnAddColumn.scala` Two test methods, each parameterized across `COPY_ON_WRITE` and `MERGE_ON_READ`: | Test | Scenario | Expected (and observed) | |---|---|---| | `testNewColumnInDefaultModeIsAutoIndexed` | No `column.list` set, add column via schema evolution | New column gets col_stats records in post-evolution files | | `testNewColumnWithExplicitListIsNotAutoIndexed` | Explicit `column.list = "col_a,col_b"`, add `col_c` via schema evolution | `col_c` gets NO col_stats records | All 4 cells pass on current master. The file exists to: - Make the actual behavior visible (which was non-obvious until we ran the probe). - Guard against future regression of the default-mode auto-extend. - Surface the explicit-list ergonomics question for community discussion. ### Impact No source code change. New tests only. User-facing implication (if confirmed): - Default mode: no action needed when adding columns at the source. - Explicit-list mode: users must remember to extend `hoodie.metadata.index.column.stats.column.list` when adding columns, or data skipping silently regresses. ### Risk Level None — test-only. ### Documentation Update If Q2 resolves as "current behavior is intentional," the Hudi schema-evolution and metadata-table docs should clarify that an explicit `column.list` is strict opt-in and the operator must extend it on schema evolution. If Q2 resolves as "should auto-extend," a separate PR can add the opt-in mechanism. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable (this PR IS the tests; no production code change) - [ ] CI passed — expected to pass; all 4 cells pass on current master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
