rahil-c opened a new issue, #18681: URL: https://github.com/apache/hudi/issues/18681
## Summary When using Lance as the Hudi base file format and reading a nested struct in a way that prunes some of its children (e.g. selecting only `image_bytes.reference.offset` and `image_bytes.reference.length` from a struct that also has `external_path` and `managed`), the Lance vectorized reader throws `UnsupportedOperationException` from `ArrowVectorAccessor.getLong`. This is **upstream**, in [`lance-format/lance-spark`'s `LanceArrowColumnVector`](https://github.com/lance-format/lance-spark/issues/499) — Hudi's `LanceRecordIterator` is just the caller. Filing this issue to track the impact on the Hudi side and decide whether we want any temporary mitigation until upstream lands a fix. ## Upstream issue https://github.com/lance-format/lance-spark/issues/499 ## Repro from a Hudi context Hudi 1.2.0-rc1 + `lance-spark-bundle-3.5_2.12` 0.4.0, Hudi table with `'hoodie.table.base.file.format' = 'lance'` and a `BLOB` column. The descriptor read from `BatchedBlobReader` works fine because it projects the full struct; a user-written query that prunes nested children fails. Failing: ```sql SELECT image_bytes.reference.offset, image_bytes.reference.length FROM hudi_lance_table ``` Working: ```sql SELECT image_bytes.reference.external_path, image_bytes.reference.offset, image_bytes.reference.length, image_bytes.reference.managed FROM hudi_lance_table ``` ## Stack trace (relevant frames) ``` java.lang.UnsupportedOperationException at org.apache.spark.sql.vectorized.ArrowColumnVector$ArrowVectorAccessor.getLong(ArrowColumnVector.java:238) at org.apache.spark.sql.vectorized.ArrowColumnVector.getLong(ArrowColumnVector.java:90) at org.lance.spark.vectorized.LanceArrowColumnVector.getLong(LanceArrowColumnVector.java:310) at org.apache.spark.sql.vectorized.ColumnarRow.getLong(ColumnarRow.java:116) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.hudi.io.storage.LanceRecordIterator.next(LanceRecordIterator.java:162) ``` ## Possible Hudi-side actions (to discuss) 1. **Wait for upstream** — bump `lance-spark` once [lance-format/lance-spark#499](https://github.com/lance-format/lance-spark/issues/499) is fixed and released. Lowest effort, no Hudi changes. 2. **Documentation** — add a "Known issues" note in the Lance integration docs so users hit it less. 3. **Workaround in `LanceRecordIterator`** — force full nested-struct projection when binding Lance vectors so partial pruning never reaches `LanceArrowColumnVector`. Higher effort and may regress read perf on wide structs; only worth it if upstream stalls. ## Environment - Apache Hudi 1.2.0-rc1 - Spark 3.5, Scala 2.12 - `lance-spark-bundle-3.5_2.12` 0.4.0 - macOS, JDK 11 ## Notes Discovered while building demo assertions in `hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/hudi_blob_reader_demo.py`. The demo's `assert_descriptors()` step now projects all four reference children as a workaround. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
