rahil-c opened a new issue, #18681:
URL: https://github.com/apache/hudi/issues/18681

   ## Summary
   
   When using Lance as the Hudi base file format and reading a nested struct in 
a way that prunes some of its children (e.g. selecting only 
`image_bytes.reference.offset` and `image_bytes.reference.length` from a struct 
that also has `external_path` and `managed`), the Lance vectorized reader 
throws `UnsupportedOperationException` from `ArrowVectorAccessor.getLong`.
   
   This is **upstream**, in [`lance-format/lance-spark`'s 
`LanceArrowColumnVector`](https://github.com/lance-format/lance-spark/issues/499)
 — Hudi's `LanceRecordIterator` is just the caller. Filing this issue to track 
the impact on the Hudi side and decide whether we want any temporary mitigation 
until upstream lands a fix.
   
   ## Upstream issue
   
   https://github.com/lance-format/lance-spark/issues/499
   
   ## Repro from a Hudi context
   
   Hudi 1.2.0-rc1 + `lance-spark-bundle-3.5_2.12` 0.4.0, Hudi table with 
`'hoodie.table.base.file.format' = 'lance'` and a `BLOB` column. The descriptor 
read from `BatchedBlobReader` works fine because it projects the full struct; a 
user-written query that prunes nested children fails.
   
   Failing:
   ```sql
   SELECT image_bytes.reference.offset,
          image_bytes.reference.length
   FROM hudi_lance_table
   ```
   
   Working:
   ```sql
   SELECT image_bytes.reference.external_path,
          image_bytes.reference.offset,
          image_bytes.reference.length,
          image_bytes.reference.managed
   FROM hudi_lance_table
   ```
   
   ## Stack trace (relevant frames)
   
   ```
   java.lang.UnsupportedOperationException
       at 
org.apache.spark.sql.vectorized.ArrowColumnVector$ArrowVectorAccessor.getLong(ArrowColumnVector.java:238)
       at 
org.apache.spark.sql.vectorized.ArrowColumnVector.getLong(ArrowColumnVector.java:90)
       at 
org.lance.spark.vectorized.LanceArrowColumnVector.getLong(LanceArrowColumnVector.java:310)
       at 
org.apache.spark.sql.vectorized.ColumnarRow.getLong(ColumnarRow.java:116)
       at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
       at 
org.apache.hudi.io.storage.LanceRecordIterator.next(LanceRecordIterator.java:162)
   ```
   
   ## Possible Hudi-side actions (to discuss)
   
   1. **Wait for upstream** — bump `lance-spark` once 
[lance-format/lance-spark#499](https://github.com/lance-format/lance-spark/issues/499)
 is fixed and released. Lowest effort, no Hudi changes.
   2. **Documentation** — add a "Known issues" note in the Lance integration 
docs so users hit it less.
   3. **Workaround in `LanceRecordIterator`** — force full nested-struct 
projection when binding Lance vectors so partial pruning never reaches 
`LanceArrowColumnVector`. Higher effort and may regress read perf on wide 
structs; only worth it if upstream stalls.
   
   ## Environment
   
   - Apache Hudi 1.2.0-rc1
   - Spark 3.5, Scala 2.12
   - `lance-spark-bundle-3.5_2.12` 0.4.0
   - macOS, JDK 11
   
   ## Notes
   
   Discovered while building demo assertions in 
`hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/hudi_blob_reader_demo.py`.
 The demo's `assert_descriptors()` step now projects all four reference 
children as a workaround.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to