linliu-code opened a new pull request, #18810: URL: https://github.com/apache/hudi/pull/18810
### Describe the issue this Pull Request addresses The Hudi schema-evolution docs at <https://hudi.apache.org/docs/schema_evolution/> list `bytes → string` as a supported type promotion. Empirically on Hudi 1.x master: - ✅ The WRITE succeeds (initial bytes batch + evolved string batch both commit). - ✅ Reading WITHOUT data skipping succeeds and returns the matching row. - ❌ Reading WITH data skipping (the default) throws: ``` java.lang.ClassCastException: class java.nio.HeapByteBuffer cannot be cast to class [B (java.nio.HeapByteBuffer and [B are in module java.base of loader 'bootstrap') at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getBinary(rows.scala:46) at org.apache.spark.sql.execution.LocalTableScanExec.$anonfun$unsafeRows$1(LocalTableScanExec.scala:44) ... ``` **This PR contains only a test** that documents the observed behavior. It does NOT include a fix. The intent is to ask reviewers to confirm: - **(a)** the test correctly demonstrates a bug — `bytes → string` is documented as supported, so data-skipping queries should not crash after the promotion, OR - **(b)** the test setup is missing a config or usage detail that the empirical crash depends on (in which case the docs probably need a clarification on this corner of the promotion matrix). ### Summary and Changelog Adds one new test file: `hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestBytesToStringPromotionDataSkipping.scala`. The test is `testBytesToStringPromotionReadAfterEvolution`, parameterized via `@CsvSource` across: | Dimension | Values | |---|---| | `tableType` | `COPY_ON_WRITE`, `MERGE_ON_READ` | | `dataSkippingEnabled` | `true`, `false` | → 4 cells total. Per the docs, all 4 should PASS. **Observed on this branch (off latest master `facb517ef957`):** ``` Tests run: 4, Failures: 0, Errors: 2, Skipped: 0 ``` | Cell | Result | |---|---| | (COW, dataSkipping=true) | ❌ ERROR — `ClassCastException` | | (COW, dataSkipping=false) | ✅ PASS | | (MOR, dataSkipping=true) | ❌ ERROR — `ClassCastException` | | (MOR, dataSkipping=false) | ✅ PASS | The crash reproduces consistently on both COW and MOR with data skipping enabled. ### Test details 1. Write a 3-row initial batch with `col_promote` as `BinaryType` (arrays like `Array[Byte](0x01, 0x02)`). 2. Write a 2-row evolved batch with `col_promote` as `StringType` (values `"zz_alpha"`, `"zz_beta"`). 3. Read with a string predicate: `SELECT _row_key FROM t WHERE col_promote = 'zz_alpha'`. 4. Expected: 1 matching row returned. MDT col_stats is explicitly enabled and `col_promote` is included in the indexed-columns list, so the read path consults col_stats. The col_stats records for the pre-evolution files carry stats in the bytes union-member; the post-evolution file carries stats in the string union-member. The crash appears to happen when the comparator/projection path retrieves a `HeapByteBuffer` (Avro's bytes representation) where it expects a Java `byte[]`. ### Impact No source code change. New test only. CI will show 2 failing cells (the data-skipping=true cells) until either: - the production code is fixed to handle the bytes→string promotion in the data-skipping path, OR - the test is removed because the documented expectation was misread. ### Risk Level None — test-only. ### Documentation Update If the resolution is **(b)** (expected behavior), the schema-evolution docs should note that the promotion matrix's `bytes → string` row has a data-skipping limitation, or that queries on a column that has gone through this promotion must set `hoodie.enable.data.skipping=false`. If the resolution is **(a)** (bug), no docs change needed — fix the comparator path to handle the bytes-to-string union-member transition. ### Related PRs in this series This is the third "repro for community confirmation" PR on schema-evolution behavior: - apache/hudi#18806 — `reconcile.schema=true` blocks documented type promotions (int→long, int→double) - apache/hudi#18807 — codifies MDT col_stats auto-extend behavior on `ADD COLUMN` - **this PR** — `bytes → string` promotion crashes data-skipping read ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable (this PR IS the test; no production code change) - [ ] CI passed — **EXPECTED to FAIL on the 2 dataSkipping=true cells; that's the repro this PR is opening for discussion** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
