Re: [PR] feat(lance): Implement columnar batch reading for Lance (COW only) [hudi]

via GitHub Mon, 04 May 2026 01:33:09 -0700


hudi-agent commented on code in PR #18403:
URL: https://github.com/apache/hudi/pull/18403#discussion_r3180280150



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/LanceBatchIterator.java:
##########
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.lance.file.LanceFileReader;
+import org.lance.spark.vectorized.LanceArrowColumnVector;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.Iterator;
+import java.util.NoSuchElementException;
+import java.util.Objects;
+
+/**
+ * Iterator that returns {@link ColumnarBatch} directly from Lance files 
without
+ * decomposing to individual rows. Used for vectorized/columnar batch reading
+ * in Spark's COW base-file-only read path.
+ *
+ * <p>Unlike {@link LanceRecordIterator} which extracts rows one by one,
+ * this iterator preserves the columnar format for zero-copy batch processing.
+ *
+ * <p>Manages the lifecycle of:
+ * <ul>
+ *   <li>BufferAllocator - Arrow memory management</li>
+ *   <li>LanceFileReader - Lance file handle</li>
+ *   <li>ArrowReader - Arrow batch reader</li>
+ * </ul>
+ */
+public class LanceBatchIterator implements Iterator<ColumnarBatch>, Closeable {
+  private final BufferAllocator allocator;
+  private final LanceFileReader lanceReader;
+  private final ArrowReader arrowReader;
+  private final String path;
+
+  private ColumnVector[] columnVectors;
+  private ColumnarBatch currentBatch;
+  private boolean nextBatchLoaded = false;
+  private boolean finished = false;
+  private boolean closed = false;
+
+  /**
+   * Creates a new Lance batch iterator.
+   *
+   * @param allocator   Arrow buffer allocator for memory management
+   * @param lanceReader Lance file reader
+   * @param arrowReader Arrow reader for batch reading
+   * @param path        File path (for error messages)
+   */
+  public LanceBatchIterator(BufferAllocator allocator,
+                            LanceFileReader lanceReader,
+                            ArrowReader arrowReader,
+                            String path) {
+    this.allocator = Objects.requireNonNull(allocator, "allocator must not be 
null");
+    this.lanceReader = Objects.requireNonNull(lanceReader, "lanceReader must 
not be null");
+    this.arrowReader = Objects.requireNonNull(arrowReader, "arrowReader must 
not be null");
+    this.path = path;
+  }
+
+  @Override
+  public boolean hasNext() {
+    if (finished) {
+      return false;
+    }
+    if (nextBatchLoaded) {
+      return true;
+    }
+
+    try {
+      if (arrowReader.loadNextBatch()) {
+        VectorSchemaRoot root = arrowReader.getVectorSchemaRoot();
+
+        // Create column vector wrappers once and reuse across batches
+        // (ArrowReader reuses the same VectorSchemaRoot)
+        if (columnVectors == null) {
+          columnVectors = root.getFieldVectors().stream()
+              .map(LanceArrowColumnVector::new)
+              .toArray(ColumnVector[]::new);
+        }
+
+        currentBatch = new ColumnarBatch(columnVectors, root.getRowCount());
+        nextBatchLoaded = true;
+        return true;
+      }
+    } catch (IOException e) {
+      throw new HoodieException("Failed to read next batch from Lance file: " 
+ path, e);
+    }
+
+    finished = true;
+    return false;
+  }
+
+  @Override
+  public ColumnarBatch next() {
+    if (!hasNext()) {
+      throw new NoSuchElementException("No more batches available");
+    }
+    nextBatchLoaded = false;
+    return currentBatch;
+  }
+
+  @Override
+  public void close() {
+    if (closed) {
+      return;
+    }
+    closed = true;
+
+    IOException arrowException = null;
+    Exception lanceException = null;
+    Exception allocatorException = null;
+
+    // Don't close currentBatch here: ColumnarBatch.close() would close the
+    // underlying Arrow FieldVectors through LanceArrowColumnVector, but they
+    // are owned by the ArrowReader (via VectorSchemaRoot) and will be closed
+    // when arrowReader.close() is called below.
+    currentBatch = null;
+
+    try {
+      arrowReader.close();
+    } catch (IOException e) {

Review Comment:
   🤖 Catch here is `IOException` while the `lanceReader` and `allocator` blocks 
below catch the broader `Exception`. If `arrowReader.close()` throws a 
`RuntimeException` (e.g., from Arrow allocator state — unclosed children, 
etc.), it propagates out of `close()` and we skip both `lanceReader.close()` 
and `allocator.close()`, leaking them. Was the narrower catch intentional, or 
worth widening to `Exception` for symmetry?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/lance/SparkLanceReaderBase.scala:
##########


Review Comment:
   🤖 **Line 556:** For partition types not in the match (complex types like 
Struct/Array/Map, plus atomic types like CharType/VarcharType/interval types), 
this silently substitutes null. The row path preserves the value via 
`JoinedRow`, so this is a silent regression in batch mode for those types. 
Worth either throwing on unsupported types or filtering them in `supportBatch` 
(cf. Parquet's `schema.forall(_.dataType.isInstanceOf[AtomicType])` gate)?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/LanceBatchIterator.java:
##########
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+
+import org.apache.arrow.memory.BufferAllocator;
+import org.apache.arrow.vector.VectorSchemaRoot;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+import org.lance.file.LanceFileReader;
+import org.lance.spark.vectorized.LanceArrowColumnVector;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.Iterator;
+import java.util.NoSuchElementException;
+import java.util.Objects;
+
+/**
+ * Iterator that returns {@link ColumnarBatch} directly from Lance files 
without
+ * decomposing to individual rows. Used for vectorized/columnar batch reading
+ * in Spark's COW base-file-only read path.
+ *
+ * <p>Unlike {@link LanceRecordIterator} which extracts rows one by one,
+ * this iterator preserves the columnar format for zero-copy batch processing.
+ *
+ * <p>Manages the lifecycle of:
+ * <ul>
+ *   <li>BufferAllocator - Arrow memory management</li>
+ *   <li>LanceFileReader - Lance file handle</li>
+ *   <li>ArrowReader - Arrow batch reader</li>
+ * </ul>
+ */
+public class LanceBatchIterator implements Iterator<ColumnarBatch>, Closeable {
+  private final BufferAllocator allocator;
+  private final LanceFileReader lanceReader;
+  private final ArrowReader arrowReader;
+  private final String path;
+
+  private ColumnVector[] columnVectors;
+  private ColumnarBatch currentBatch;
+  private boolean nextBatchLoaded = false;

Review Comment:
   🤖 nit: could you rename `nextBatchLoaded` to `hasCachedBatch`? The current 
name describes the mechanism (a batch was loaded from the reader), but what 
readers of `hasNext()` really need to know is whether a batch is already 
available to return — `if (hasCachedBatch) return true` reads much more 
naturally than `if (nextBatchLoaded) return true`.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/lance/SparkLanceReaderBase.scala:
##########
@@ -23,37 +23,47 @@ import org.apache.hudi.SparkAdapterSupport.sparkAdapter
 import org.apache.hudi.common.config.{HoodieReaderConfig, HoodieStorageConfig}
 import org.apache.hudi.common.schema.{HoodieSchema, HoodieSchemaType}
 import org.apache.hudi.common.util
-import org.apache.hudi.common.util.collection.ClosableIterator
+import org.apache.hudi.common.util.collection.{ClosableIterator, Pair => 
HoodiePair}
 import org.apache.hudi.internal.schema.InternalSchema
 import org.apache.hudi.io.memory.HoodieArrowAllocator
-import org.apache.hudi.io.storage.{BlobDescriptorTransform, 
LanceRecordIterator, VectorConversionUtils}
+import org.apache.hudi.io.storage.{BlobDescriptorTransform, 
LanceBatchIterator, LanceRecordIterator, VectorConversionUtils}
 import org.apache.hudi.storage.StorageConfiguration
 
+import org.apache.arrow.memory.BufferAllocator
+import org.apache.arrow.vector.FieldVector
+import org.apache.arrow.vector.ipc.ArrowReader
 import org.apache.hadoop.conf.Configuration
 import org.apache.parquet.schema.MessageType
 import org.apache.spark.TaskContext
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
JoinedRow, UnsafeProjection, UnsafeRow}
 import 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
 import org.apache.spark.sql.execution.datasources.{PartitionedFile, 
SparkColumnarFileReader, SparkSchemaTransformUtils}
+import org.apache.spark.sql.execution.vectorized.{OnHeapColumnVector, 
WritableColumnVector}
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.sources.Filter
-import org.apache.spark.sql.types.{ArrayType, DataType, MapType, StructField, 
StructType}
+import org.apache.spark.sql.types._
 import org.apache.spark.sql.util.LanceArrowUtils
+import org.apache.spark.sql.vectorized.{ColumnarBatch, ColumnVector}
 import org.lance.file.{BlobReadMode, FileReadOptions, LanceFileReader}
+import org.lance.spark.vectorized.LanceArrowColumnVector
 
-import java.io.IOException
+import java.io.{Closeable, IOException}
 
 import scala.collection.JavaConverters._
 
 /**
  * Reader for Lance files in Spark datasource.
- * Implements vectorized reading using LanceArrowColumnVector.
+ * Supports both row-based and columnar batch reading modes.
  *
- * @param enableVectorizedReader whether to use vectorized reading (currently 
always true for Lance)
+ * @param enableVectorizedReader when true, returns ColumnarBatch for 
vectorized processing;
+ *                               when false, returns InternalRow one by one
  */
 class SparkLanceReaderBase(enableVectorizedReader: Boolean) extends 
SparkColumnarFileReader {
 
+  /** Holds a pre-created all-null Arrow vector for a column missing from the 
file (schema evolution). */
+  private case class NullColumnEntry(colIndex: Int, columnVector: 
LanceArrowColumnVector, arrowVector: FieldVector)
+
   // Batch size for reading Lance files (number of rows per batch)

Review Comment:
   🤖 nit: the `columnVector` field name is a bit generic alongside 
`arrowVector` — both are "vectors". Something like `sparkVector` (or 
`lanceColumnVector`) would make the Spark-facing wrapper vs. the raw Arrow 
backing vector distinction explicit without having to look at the types.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(lance): Implement columnar batch reading for Lance (COW only) [hudi]

Reply via email to