rahil-c opened a new issue, #18727:
URL: https://github.com/apache/hudi/issues/18727

   ## TL;DR
   
   `SELECT COUNT(*) FROM <lance-backed hudi table>` fails with:
   
   ```
   Lance batch column count 14 does not match expected Spark schema size 0
     for file: .../category=Abyssinian/....lance
     at 
org.apache.hudi.io.storage.LanceRecordIterator.hasNext(LanceRecordIterator.java:124)
   ```
   
   Any query shape that triggers Spark's "no columns needed, just count rows" 
optimization (`COUNT(*)`, `EXISTS`, `CREATE TABLE AS SELECT 1 FROM ...`) blows 
up on a Lance-backed Hudi table. Parquet-backed tables work fine.
   
   ## Why it happens
   
   
[`LanceRecordIterator.java:122-127`](https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/LanceRecordIterator.java#L122-L127)
 has a strict equality check when building `ColumnVector[]`:
   
   ```java
   StructField[] sparkFields = sparkSchema.fields();
   if (sparkFields.length != fieldVectors.size()) {
     throw new HoodieException("Lance batch column count " + fieldVectors.size()
         + " does not match expected Spark schema size " + sparkFields.length + 
...);
   }
   ```
   
   When Spark's optimizer prunes all columns for an aggregate-only read 
(`COUNT`, `EXISTS`), the request arrives with `sparkSchema.fields().length == 
0`, but the Lance file's batch always has the full column set. The reader sees 
`0 != 14` and throws.
   
   The Parquet reader handles this naturally — `ParquetFileFormat` has a 
zero-column fast path where it just yields N empty rows (where N is the row 
count) so the aggregate can count them without reading any data. Lance needs 
the equivalent.
   
   ## Workaround
   
   Use `COUNT(<named_col>)` instead of `COUNT(*)`. On a non-null primary key 
the two are semantically equivalent, but the former forces Spark to request one 
column, satisfying the check.
   
   ## Proposed fix
   
   In `LanceRecordIterator.hasNext()`:
   - If `sparkSchema.fields().length == 0`, skip the `ColumnVector[]` build 
entirely.
   - Still call `arrowReader.loadNextBatch()` to advance, and yield empty rows 
matching the Arrow `VectorSchemaRoot.getRowCount()` so downstream count 
aggregators work.
   - Add a test in 
[`TestLanceDataSource.scala`](https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestLanceDataSource.scala)
 exercising `spark.sql("SELECT COUNT(*) FROM …")` over a Lance-backed table and 
`df.count()` on the same.
   
   ## Related code paths
   
   - 
[`LanceRecordIterator.java`](https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/LanceRecordIterator.java)
   - 
[`HoodieSparkLanceReader.java`](https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java)
   - 
[`TestLanceDataSource.scala`](https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestLanceDataSource.scala)
   
   ## Environment
   
   - Hudi `master` @ commit `4d0e9cd47f9e`
   - Spark datasource path with Lance-backed base files


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to