Re: [PR] Simplify ColumnReader SPI: drop iterator API, replace type checks with getValueType() [pinot]

via GitHub Fri, 03 Jul 2026 19:42:06 -0700


Copilot commented on code in PR #18918:
URL: https://github.com/apache/pinot/pull/18918#discussion_r3522434688



##########
pinot-plugins/pinot-input-format/pinot-arrow/src/main/java/org/apache/pinot/plugin/inputformat/arrow/ArrowToPinotTypeConverter.java:
##########
@@ -59,6 +60,53 @@ public final class ArrowToPinotTypeConverter {
   private ArrowToPinotTypeConverter() {
   }
 
+  /// Returns the [PinotDataType] that the given Arrow `elementType` (a scalar 
element type, already unwrapped from any
+  /// list) can be read as through a type-specific `ColumnReader` accessor, or 
`null` when it has no dedicated accessor
+  /// and must be read through `ColumnReader.getValue()`. Decided from the 
logical Arrow type, not the physical vector:
+  /// logical `Bool`, temporal types (unless `extractRawTimeValues` surfaces a 
`Timestamp`'s raw epoch `long`), and
+  /// complex types have no dedicated accessor.
+  ///
+  /// @param elementType scalar Arrow type of the column (or of a list's 
elements)
+  /// @param singleValue `true` for a single-value column (scalar type), 
`false` for the `_ARRAY` variant
+  /// @param extractRawTimeValues when `true`, a `Timestamp` column is 
readable as `LONG` (its raw epoch value);
+  ///                             otherwise it has no dedicated accessor
+  @Nullable
+  public static PinotDataType toValueType(ArrowType elementType, boolean 
singleValue, boolean extractRawTimeValues) {
+    switch (elementType.getTypeID()) {
+      case Int:
+        switch (((ArrowType.Int) elementType).getBitWidth()) {
+          case 32:
+            return singleValue ? PinotDataType.INT : PinotDataType.INT_ARRAY;
+          case 64:
+            return singleValue ? PinotDataType.LONG : PinotDataType.LONG_ARRAY;
+          default:
+            return null;
+        }
+      case FloatingPoint:
+        switch (((ArrowType.FloatingPoint) elementType).getPrecision()) {
+          case SINGLE:
+            return singleValue ? PinotDataType.FLOAT : 
PinotDataType.FLOAT_ARRAY;
+          case DOUBLE:
+            return singleValue ? PinotDataType.DOUBLE : 
PinotDataType.DOUBLE_ARRAY;
+          default:
+            return null;
+        }
+      case Decimal:
+        return singleValue ? PinotDataType.BIG_DECIMAL : 
PinotDataType.BIG_DECIMAL_ARRAY;
+      case Utf8:
+      case LargeUtf8:
+        return singleValue ? PinotDataType.STRING : PinotDataType.STRING_ARRAY;
+      case Binary:
+      case LargeBinary:
+      case FixedSizeBinary:
+        return singleValue ? PinotDataType.BYTES : PinotDataType.BYTES_ARRAY;

Review Comment:
   `toValueType()` reports `STRING`/`BYTES` for 
`LargeUtf8`/`LargeBinary`/`FixedSizeBinary`, but the Arrow `ColumnReader` typed 
accessors in this module only handle `VarCharVector`/`VarBinaryVector` (not the 
Large*/FixedSize vector classes). This can violate the `getValueType()` 
contract and cause callers to pick `getString()`/`getBytes()` and then fail 
with a type mismatch. Consider returning `null` for these large/fixed logical 
types unless/until the typed accessors are extended to support them.



##########
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/ColumnReader.java:
##########
@@ -22,301 +22,180 @@
 import java.io.IOException;
 import java.io.Serializable;
 import java.math.BigDecimal;
+import java.sql.Timestamp;
 import javax.annotation.Nullable;
-
-
-/**
- * The <code>ColumnReader</code> interface is used to read column data from 
various data sources
- * for columnar segment building. Unlike RecordReader which reads row-by-row, 
ColumnReader provides
- * column-wise access to data, enabling efficient columnar segment creation.
- *
- * <p>This interface provides 3 patterns optimised for different use cases
- * (Some implementations may not support all patterns):
- * <ul>
- *   <li>Sequential iteration over all values in a column using hasNext(), 
next() and rewind()</li>
- *   <li>Sequential iteration with type-specific methods (isInt(), isLong(), 
nextInt(), nextLong(), etc.)
- *        and null handling (isNextNull(), skipNext()) and supports hasNext() 
and rewind()</li>
- *   <li>Random access by document ID using getInt(docId), getLong(docId), 
etc. and isNull(docId) for null checks</li>
- * </ul>
- *
- * <p>Implementations should handle data type conversions and efficient 
column-wise data access patterns.
- *
- * <h2>Usage Patterns</h2>
- *
- * <p>There are three primary patterns for reading data from a ColumnReader:
- *
- * <h3>Pattern 1: Sequential Iteration with Generic next() and Null Checks</h3>
- * <p>This pattern uses the generic {@link #next()} method which returns 
Object and may return null.
- * Suitable when you need to handle arbitrary data types or when null handling 
is done on the return value.
- *
- * <pre>{@code
- * // Read all values in the column
- * while (columnReader.hasNext()) {
- *   Object value;
- *   try {
- *    value = columnReader.next();
- *   } catch (Exception e) {
- *    // Handle exception / log
- *    continue;
- *    }
- *   if (value != null) {
- *     // Process non-null value
- *     processValue(value);
- *   } else {
- *     // Handle null value
- *     handleNullValue();
- *   }
- * }
- *
- * // Rewind to read the column again
- * columnReader.rewind();
- *
- * // Second pass through the data
- * while (columnReader.hasNext()) {
- *   Object value = columnReader.next();
- *   if (value != null) {
- *     processValueAgain(value);
- *   }
- * }
- * }</pre>
- *
- * <h3>Pattern 2: Sequential Iteration with Type-Specific Methods and Explicit 
Null Checks</h3>
- * <p>This pattern uses {@link #isNextNull()} to check for nulls before 
calling type-specific methods
- * like {@link #nextInt()}, {@link #nextLong()}, etc. Use {@link #skipNext()} 
to advance past null values.
- * This is the preferred pattern when you know the column data type and want 
to avoid boxing overhead.
- * Before using this pattern, check the column data type using methods like 
{@link #isInt()}, {@link #isLong()}, etc.
- * If the data type does not match, fall back to Pattern 1 with {@link 
#next()}.
- * <pre>{@code
- * // Read all int values in the column, handling nulls
- * if (columnReader.isInt()) {
- *  while (columnReader.hasNext()) {
- *     if (columnReader.isNextNull()) {
- *       // Skip the null value
- *      columnReader.skipNext();
- *      handleNullValue();
- *     } else {
- *      // Read the primitive int value (no boxing)
- *      int value = columnReader.nextInt();
- *      processIntValue(value);
- *    }
- *  }
- * } else {
- *  // Fallback to Pattern 1 if not INT type
- * }
- *
- * }</pre>
- *
- * <h3>Pattern 3: Random Access by Document ID</h3>
- * <p>This pattern uses {@link #getTotalDocs()} to get the total number of 
documents, then uses
- * document ID-based accessors like {@link #getInt(int)}, {@link 
#getLong(int)}, etc. to read
- * specific values. Use {@link #isNull(int)} to check if a value is null 
before reading.
- * This pattern is useful when you need random access or want to process 
documents in a specific order.
- *
- * <pre>{@code
- *
- * // Random access example - read specific document IDs
- * int[] docIdsToRead = {5, 10, 15, 20};
- * for (int docId : docIdsToRead) {
- *   if (!columnReader.isNull(docId)) {
- *     int value = columnReader.getInt(docId);
- *     processSpecificDoc(docId, value);
- *   }
- * }
- *
- * // Read in reverse order
- * // Get the total number of documents
- * int totalDocs = columnReader.getTotalDocs();
- * for (int docId = totalDocs - 1; docId >= 0; docId--) {
- *   if (!columnReader.isNull(docId)) {
- *     int value = columnReader.getInt(docId);
- *     processReverseOrder(docId, value);
- *   }
- * }
- * }</pre>
- *
- * <h3>Choosing the Right Pattern</h3>
- * <ul>
- *   <li><b>Pattern 1</b>: Use when dealing with generic Object types or when 
you don't know
- *       the column type at compile time. Less efficient due to boxing.</li>
- *   <li><b>Pattern 2</b>: Use when you know the column type and want 
efficient sequential iteration
- *       with primitive types. Preferred for most columnar segment building 
scenarios.</li>
- *   <li><b>Pattern 3</b>: Use when you need random access, want to process 
documents in a specific
- *       order, or need to access the same document multiple times.</li>
- * </ul>
- *
- */
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.utils.BooleanUtils;
+import org.apache.pinot.spi.utils.PinotDataType;
+
+
+/// Reads the values of a single column by document ID, for columnar segment 
building. Where a `RecordReader` exposes a
+/// data source row-by-row, a `ColumnReader` exposes one column at a time: a 
caller reads every value of a column before
+/// moving to the next one. Building a segment one fully-materialized column 
at a time is cheaper than transposing row
+/// records when the source is already columnar (e.g. Arrow, Parquet).
+///
+/// Documents are addressed by a 0-based id, from 0 (inclusive) to 
[#getTotalDocs()] (exclusive). Two access patterns
+/// are supported:
+/// - **Random access** — [#getValue(int)] and the typed accessors may be 
called for any document id, in any order.
+/// - **Repeated sequential traversal** — the full range may be traversed by 
ascending document id more than once. The
+///   reader starts positioned at document 0; call [#rewind()] to reset it 
before each subsequent traversal. The
+///   segment build uses this: a statistics pass, then an index-writing pass 
over the same reader.
+///
+/// To read a column:
+/// 1. Call [#getValueType()] once. A non-`null` result names the type that 
can be read directly through the matching
+///    type-specific accessor; `null` means the column must be read through 
the boxed [#getValue(int)].
+/// 2. For each document id, call [#isNull(int)] first — type-specific 
accessors cannot represent null and must not be
+///    called for a null value.
+/// 3. Read each non-null value with the accessor chosen in step 1, e.g. 
[#getInt(int)] / [#getIntMV(int)] when
+///    [#getValueType()] is [PinotDataType#INT] / [PinotDataType#INT_ARRAY], 
or [#getValue(int)] otherwise.
+///
+/// Implementations perform any conversion between the source encoding and the 
returned Pinot type, are not required to
+/// be thread-safe, and must be released with [#close()] when no longer needed.
+///
+/// ```
+/// PinotDataType valueType = columnReader.getValueType();
+/// for (int docId = 0; docId < columnReader.getTotalDocs(); docId++) {
+///   if (columnReader.isNull(docId)) {
+///     continue;
+///   }
+///   if (valueType == PinotDataType.INT) {
+///     consumeInt(columnReader.getInt(docId));
+///   } else {
+///     consumeObject(columnReader.getValue(docId));
+///   }
+/// }
+/// ```
 public interface ColumnReader extends Closeable, Serializable {
 
-  /**
-   * Return <code>true</code> if more values remain to be read in this column.
-   * <p>This method should not throw exception. Caller is not responsible for 
handling exceptions from this method.
-   */
-  boolean hasNext();
+  /// Returns the name of the column read by this reader.
+  String getColumnName();
 
-  /**
-   * Get the next value in the column.
-   * <p>This method should be called only if {@link #hasNext()} returns 
<code>true</code>. Caller is responsible for
-   * handling exceptions from this method and skip the value if user wants to 
continue reading the remaining values.
-   *
-   * @return Next column value, or null if the value is null
-   * @throws IOException If an I/O error occurs while reading
-   */
+  /// Returns the [PinotDataType] that [#getValue(int)] produces for this 
column, when that type has a matching typed
+  /// accessor; otherwise `null`.
+  ///
+  /// A non-`null` result `T` therefore names an accessor that returns exactly 
what [#getValue(int)] returns, only
+  /// unboxed: for every document id where [#isNull(int)] is `false`, the 
accessor matching `T` — e.g. [#getInt(int)]
+  /// for [PinotDataType#INT], [#getIntMV(int)] for [PinotDataType#INT_ARRAY] 
— returns the same value as
+  /// [#getValue(int)]. `null` means [#getValue(int)]'s type has no typed 
accessor; read the column through
+  /// [#getValue(int)].
+  ///
+  /// The result is a constant property of the column (independent of the 
document id) and, when non-`null`, is one of
+  /// the accessor-backed types — [PinotDataType#INT], [PinotDataType#LONG], 
[PinotDataType#FLOAT],
+  /// [PinotDataType#DOUBLE], [PinotDataType#BIG_DECIMAL], 
[PinotDataType#STRING], [PinotDataType#BYTES] — or their
+  /// `_ARRAY` variants for a multi-value column.
+  ///
+  /// Because [#getValue(int)] returns the column's logical type, a logical 
type whose object is not itself an
+  /// accessor-backed type returns `null`: `BOOLEAN` (a `Boolean`), 
`TIMESTAMP` (a `Timestamp`), and the complex types.
+  /// A `JSON` column, stored and read as its text `String`, reports 
[PinotDataType#STRING]. Implementations keyed on
+  /// the stored [DataType] can use [#toValueType(DataType, boolean)].

Review Comment:
   The `getValueType()` contract doc currently states that `TIMESTAMP` always 
returns `null` because `getValue()` returns a `Timestamp`, but Arrow 
implementations can surface a TIMESTAMP column as raw epoch `LONG`/`LONG_ARRAY` 
when `extractRawTimeValues` is enabled (see ArrowToPinotTypeConverter + 
ArrowColumnReaderTest). This is a documentation/contract mismatch that can 
confuse implementers and callers.



##########
pinot-plugins/pinot-input-format/pinot-arrow/src/main/java/org/apache/pinot/plugin/inputformat/arrow/ArrowToPinotTypeConverter.java:
##########
@@ -59,6 +60,53 @@ public final class ArrowToPinotTypeConverter {
   private ArrowToPinotTypeConverter() {
   }
 
+  /// Returns the [PinotDataType] that the given Arrow `elementType` (a scalar 
element type, already unwrapped from any
+  /// list) can be read as through a type-specific `ColumnReader` accessor, or 
`null` when it has no dedicated accessor
+  /// and must be read through `ColumnReader.getValue()`. Decided from the 
logical Arrow type, not the physical vector:
+  /// logical `Bool`, temporal types (unless `extractRawTimeValues` surfaces a 
`Timestamp`'s raw epoch `long`), and
+  /// complex types have no dedicated accessor.
+  ///
+  /// @param elementType scalar Arrow type of the column (or of a list's 
elements)
+  /// @param singleValue `true` for a single-value column (scalar type), 
`false` for the `_ARRAY` variant
+  /// @param extractRawTimeValues when `true`, a `Timestamp` column is 
readable as `LONG` (its raw epoch value);
+  ///                             otherwise it has no dedicated accessor
+  @Nullable
+  public static PinotDataType toValueType(ArrowType elementType, boolean 
singleValue, boolean extractRawTimeValues) {
+    switch (elementType.getTypeID()) {
+      case Int:
+        switch (((ArrowType.Int) elementType).getBitWidth()) {
+          case 32:
+            return singleValue ? PinotDataType.INT : PinotDataType.INT_ARRAY;
+          case 64:
+            return singleValue ? PinotDataType.LONG : PinotDataType.LONG_ARRAY;

Review Comment:
   `toValueType()` currently returns `INT`/`LONG` for any Arrow `Int(32|64)` 
regardless of signedness. For unsigned int vectors (e.g. `UInt4Vector`), Arrow 
often surfaces values as a wider boxed type (e.g. `Long`) via `getObject()`, 
and the typed accessors here don’t support unsigned vectors. Reporting 
`INT`/`LONG` can therefore violate the ColumnReader contract (typed accessor 
must match `getValue()`), and can lead callers to invoke `getInt()`/`getLong()` 
and hit a type-mismatch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Simplify ColumnReader SPI: drop iterator API, replace type checks with getValueType() [pinot]

Reply via email to