linliu-code opened a new pull request, #18810:
URL: https://github.com/apache/hudi/pull/18810

   ### Describe the issue this Pull Request addresses
   
   The Hudi schema-evolution docs at 
<https://hudi.apache.org/docs/schema_evolution/> list `bytes → string` as a 
supported type promotion. Empirically on Hudi 1.x master:
   
   - ✅ The WRITE succeeds (initial bytes batch + evolved string batch both 
commit).
   - ✅ Reading WITHOUT data skipping succeeds and returns the matching row.
   - ❌ Reading WITH data skipping (the default) throws:
     ```
     java.lang.ClassCastException:
       class java.nio.HeapByteBuffer cannot be cast to class [B
       (java.nio.HeapByteBuffer and [B are in module java.base of loader 
'bootstrap')
       at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getBinary(rows.scala:46)
       at 
org.apache.spark.sql.execution.LocalTableScanExec.$anonfun$unsafeRows$1(LocalTableScanExec.scala:44)
       ...
     ```
   
   **This PR contains only a test** that documents the observed behavior. It 
does NOT include a fix. The intent is to ask reviewers to confirm:
   
   - **(a)** the test correctly demonstrates a bug — `bytes → string` is 
documented as supported, so data-skipping queries should not crash after the 
promotion, OR
   - **(b)** the test setup is missing a config or usage detail that the 
empirical crash depends on (in which case the docs probably need a 
clarification on this corner of the promotion matrix).
   
   ### Summary and Changelog
   
   Adds one new test file: 
`hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestBytesToStringPromotionDataSkipping.scala`.
   
   The test is `testBytesToStringPromotionReadAfterEvolution`, parameterized 
via `@CsvSource` across:
   
   | Dimension | Values |
   |---|---|
   | `tableType` | `COPY_ON_WRITE`, `MERGE_ON_READ` |
   | `dataSkippingEnabled` | `true`, `false` |
   
   → 4 cells total. Per the docs, all 4 should PASS.
   
   **Observed on this branch (off latest master `facb517ef957`):**
   
   ```
   Tests run: 4, Failures: 0, Errors: 2, Skipped: 0
   ```
   
   | Cell | Result |
   |---|---|
   | (COW, dataSkipping=true) | ❌ ERROR — `ClassCastException` |
   | (COW, dataSkipping=false) | ✅ PASS |
   | (MOR, dataSkipping=true) | ❌ ERROR — `ClassCastException` |
   | (MOR, dataSkipping=false) | ✅ PASS |
   
   The crash reproduces consistently on both COW and MOR with data skipping 
enabled.
   
   ### Test details
   
   1. Write a 3-row initial batch with `col_promote` as `BinaryType` (arrays 
like `Array[Byte](0x01, 0x02)`).
   2. Write a 2-row evolved batch with `col_promote` as `StringType` (values 
`"zz_alpha"`, `"zz_beta"`).
   3. Read with a string predicate: `SELECT _row_key FROM t WHERE col_promote = 
'zz_alpha'`.
   4. Expected: 1 matching row returned.
   
   MDT col_stats is explicitly enabled and `col_promote` is included in the 
indexed-columns list, so the read path consults col_stats. The col_stats 
records for the pre-evolution files carry stats in the bytes union-member; the 
post-evolution file carries stats in the string union-member. The crash appears 
to happen when the comparator/projection path retrieves a `HeapByteBuffer` 
(Avro's bytes representation) where it expects a Java `byte[]`.
   
   ### Impact
   
   No source code change. New test only. CI will show 2 failing cells (the 
data-skipping=true cells) until either:
   - the production code is fixed to handle the bytes→string promotion in the 
data-skipping path, OR
   - the test is removed because the documented expectation was misread.
   
   ### Risk Level
   
   None — test-only.
   
   ### Documentation Update
   
   If the resolution is **(b)** (expected behavior), the schema-evolution docs 
should note that the promotion matrix's `bytes → string` row has a 
data-skipping limitation, or that queries on a column that has gone through 
this promotion must set `hoodie.enable.data.skipping=false`.
   
   If the resolution is **(a)** (bug), no docs change needed — fix the 
comparator path to handle the bytes-to-string union-member transition.
   
   ### Related PRs in this series
   
   This is the third "repro for community confirmation" PR on schema-evolution 
behavior:
   
   - apache/hudi#18806 — `reconcile.schema=true` blocks documented type 
promotions (int→long, int→double)
   - apache/hudi#18807 — codifies MDT col_stats auto-extend behavior on `ADD 
COLUMN`
   - **this PR** — `bytes → string` promotion crashes data-skipping read
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable (this PR IS the test; no 
production code change)
   - [ ] CI passed — **EXPECTED to FAIL on the 2 dataSkipping=true cells; 
that's the repro this PR is opening for discussion**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to