Re: [PR] Simplify ColumnReader SPI: drop iterator API, replace type checks with getValueType() [pinot]

via GitHub Fri, 03 Jul 2026 13:25:22 -0700


Jackie-Jiang commented on code in PR #18918:
URL: https://github.com/apache/pinot/pull/18918#discussion_r3521745901



##########
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/readers/PinotSegmentColumnReaderImplTest.java:
##########
@@ -185,30 +167,30 @@ private Schema getSchema(IndexType indexType) {
     }
   }
 
-  /**
-   * Data provider for single-value accessor methods.
-   * @return test parameters: column name, random access getter, sequential 
getter,
-   * value converter function (required for float), index type
-   */
+  /// Data provider for single-value accessor methods.
+  ///
+  /// @return test parameters: column name, random access getter, value 
converter function (required for float), index
+  ///     type
   @DataProvider(name = "singleValueAccessorProvider")
   public Object[][] singleValueAccessorProvider() {
     // Base column configurations
-    Object[][] baseConfigs = new Object[][] {
-        {INT_COL_1, (SingleValueGetter) PinotSegmentColumnReaderImpl::getInt,
-            (SingleValueSequentialGetter) 
PinotSegmentColumnReaderImpl::nextInt, null},
-        {LONG_COL, (SingleValueGetter) PinotSegmentColumnReaderImpl::getLong,
-            (SingleValueSequentialGetter) 
PinotSegmentColumnReaderImpl::nextLong, null},
-        {FLOAT_COL, (SingleValueGetter) PinotSegmentColumnReaderImpl::getFloat,
-            (SingleValueSequentialGetter) 
PinotSegmentColumnReaderImpl::nextFloat,
-            (Function<Object, Object>) val -> val instanceof Double ? 
((Double) val).floatValue() : val},
-        {DOUBLE_COL, (SingleValueGetter) 
PinotSegmentColumnReaderImpl::getDouble,
-            (SingleValueSequentialGetter) 
PinotSegmentColumnReaderImpl::nextDouble, null},
-        {BIG_DECIMAL_COL, (SingleValueGetter) 
PinotSegmentColumnReaderImpl::getBigDecimal,
-            (SingleValueSequentialGetter) 
PinotSegmentColumnReaderImpl::nextBigDecimal, null},
-        {STRING_COL_1, (SingleValueGetter) 
PinotSegmentColumnReaderImpl::getString,
-            (SingleValueSequentialGetter) 
PinotSegmentColumnReaderImpl::nextString, null},
-        {BYTES_COL, (SingleValueGetter) PinotSegmentColumnReaderImpl::getBytes,
-            (SingleValueSequentialGetter) 
PinotSegmentColumnReaderImpl::nextBytes, null},
+    Object[][] baseConfigs = new Object[][]{
+        {INT_COL_1, (SingleValueGetter) PinotSegmentColumnReaderImpl::getInt, 
null}, {
+        LONG_COL,
+        (SingleValueGetter) PinotSegmentColumnReaderImpl::getLong, null
+    }, {
+        FLOAT_COL, (SingleValueGetter) PinotSegmentColumnReaderImpl::getFloat, 
(Function<Object, Object>) val ->
+        val instanceof Double ? ((Double) val).floatValue() : val
+    }, {DOUBLE_COL, (SingleValueGetter) 
PinotSegmentColumnReaderImpl::getDouble, null}, {
+        BIG_DECIMAL_COL,
+        (SingleValueGetter) PinotSegmentColumnReaderImpl::getBigDecimal, null
+    }, {
+        STRING_COL_1,
+        (SingleValueGetter) PinotSegmentColumnReaderImpl::getString, null
+    }, {
+        BYTES_COL,
+        (SingleValueGetter) PinotSegmentColumnReaderImpl::getBytes, null
+    },
     };

Review Comment:
   Reformatting deferred to `spotless:apply` (pre-commit). The multi-line rows 
only carry per-row converter/getter lambdas; the simple rows stay one-line, at 
consistent 12-space indentation. Not hand-editing to avoid diverging from 
spotless's canonical output.



##########
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/readers/PinotSegmentColumnReaderImplTest.java:
##########
@@ -273,166 +242,87 @@ public void testSingleValueAccessors(String columnName, 
SingleValueGetter random
         if (expectedValue == null) {
           // Null values are replaced with default during segment creation 
(non-nullable)
           if (actualValue instanceof byte[]) {
-            Assert.assertEquals(actualValue, defaultValue, "Null should be 
replaced with default at docId " + docId);
+            assertEquals(actualValue, defaultValue, "Null should be replaced 
with default at docId " + docId);
           } else if (actualValue instanceof Double) {
-            Assert.assertEquals((Double) actualValue, (Double) defaultValue, 
0.0001,
+            assertEquals((Double) actualValue, (Double) defaultValue, 0.0001,
                 "Null should be replaced with default at docId " + docId);
           } else if (actualValue instanceof Float) {
-            Assert.assertEquals((Float) actualValue, (Float) defaultValue, 
0.0001f,
+            assertEquals((Float) actualValue, (Float) defaultValue, 0.0001f,
                 "Null should be replaced with default at docId " + docId);
           } else if (actualValue instanceof BigDecimal) {
-            Assert.assertEquals(((BigDecimal) 
actualValue).compareTo((BigDecimal) defaultValue), 0,
+            assertEquals(((BigDecimal) actualValue).compareTo((BigDecimal) 
defaultValue), 0,
                 "Null should be replaced with default at docId " + docId);
           } else {
-            Assert.assertEquals(actualValue, defaultValue, "Null should be 
replaced with default at docId " + docId);
+            assertEquals(actualValue, defaultValue, "Null should be replaced 
with default at docId " + docId);
           }
         } else {
           // Convert expected value if converter is provided
           Object convertedExpectedValue = valueConverter != null ? 
valueConverter.apply(expectedValue) : expectedValue;
           if (actualValue instanceof Double) {
-            Assert.assertEquals((Double) actualValue, (Double) 
convertedExpectedValue, 0.0001,
+            assertEquals((Double) actualValue, (Double) 
convertedExpectedValue, 0.0001,
                 "Value mismatch at docId " + docId);
           } else if (actualValue instanceof Float) {
-            Assert.assertEquals((Float) actualValue, (Float) 
convertedExpectedValue, 0.0001f,
+            assertEquals((Float) actualValue, (Float) convertedExpectedValue, 
0.0001f,
                 "Value mismatch at docId " + docId);
           } else if (actualValue instanceof BigDecimal) {
-            Assert.assertEquals(((BigDecimal) 
actualValue).compareTo((BigDecimal) convertedExpectedValue), 0,
+            assertEquals(((BigDecimal) actualValue).compareTo((BigDecimal) 
convertedExpectedValue), 0,
                 "Value mismatch at docId " + docId);
           } else {
-            Assert.assertEquals(actualValue, convertedExpectedValue, "Value 
mismatch at docId " + docId);
-          }
-        }
-      }
-
-      // Test sequential iteration with type-specific methods (Pattern 2) - 
only for columns with sequential getters
-      if (sequentialGetter != null) {
-        reader.rewind();
-        int docId = 0;
-        while (reader.hasNext()) {
-          GenericRow expectedRow = _testData.get(docId);
-          Object expectedValue = expectedRow.getValue(columnName);
-
-          if (indexType == IndexType.NULLABLE && expectedValue == null) {
-            Assert.assertTrue(reader.isNextNull(),
-                "isNextNull() should return true for null value at docId " + 
docId);
-            reader.skipNext();
-          } else if (reader.isNextNull()) {
-            reader.skipNext();
-            // For non-nullable segments, null is replaced with default
-            Assert.assertNull(expectedValue, "Expected null value at docId " + 
docId);
-          } else {
-            Object actualValue = sequentialGetter.get(reader);
-
-            if (expectedValue == null) {
-              // Null values are replaced with default during segment creation 
(non-nullable)
-              if (actualValue instanceof byte[]) {
-                Assert.assertEquals(actualValue, defaultValue,
-                    "Null should be replaced with default at docId " + docId + 
" (sequential)");
-              } else if (actualValue instanceof Double) {
-                Assert.assertEquals((Double) actualValue, (Double) 
defaultValue, 0.0001,
-                    "Null should be replaced with default at docId " + docId + 
" (sequential)");
-              } else if (actualValue instanceof Float) {
-                Assert.assertEquals((Float) actualValue, (Float) defaultValue, 
0.0001f,
-                    "Null should be replaced with default at docId " + docId + 
" (sequential)");
-              } else if (actualValue instanceof BigDecimal) {
-                Assert.assertEquals(((BigDecimal) 
actualValue).compareTo((BigDecimal) defaultValue), 0,
-                    "Null should be replaced with default at docId " + docId + 
" (sequential)");
-              } else {
-                Assert.assertEquals(actualValue, defaultValue,
-                    "Null should be replaced with default at docId " + docId + 
" (sequential)");
-              }
-            } else {
-              // Convert expected value if converter is provided
-              Object convertedValue = valueConverter != null ? 
valueConverter.apply(expectedValue) : expectedValue;
-              if (actualValue instanceof Double) {
-                Assert.assertEquals((Double) actualValue, (Double) 
convertedValue, 0.0001,
-                    "Value mismatch at docId " + docId + " (sequential)");
-              } else if (actualValue instanceof Float) {
-                Assert.assertEquals((Float) actualValue, (Float) 
convertedValue, 0.0001f,
-                    "Value mismatch at docId " + docId + " (sequential)");
-              } else if (actualValue instanceof BigDecimal) {
-                Assert.assertEquals(((BigDecimal) 
actualValue).compareTo((BigDecimal) convertedValue), 0,
-                    "Value mismatch at docId " + docId + " (sequential)");
-              } else {
-                Assert.assertEquals(actualValue, convertedValue,
-                    "Value mismatch at docId " + docId + " (sequential)");
-              }
-            }
+            assertEquals(actualValue, convertedExpectedValue, "Value mismatch 
at docId " + docId);
           }
-          docId++;
         }
-        Assert.assertEquals(docId, totalDocs, "Should have iterated through 
all documents");
       }
     } finally {
       reader.close();
     }
   }
 
-
-  /**
-   * Data provider for multi-value accessor methods.
-   * @return test parameters: column name, random access getter function,
-   * sequential getter function, array converter function, index type
-   */
+  /// Data provider for multi-value accessor methods.
+  ///
+  /// @return test parameters: column name, random access getter function, 
array converter function, index type
   @DataProvider(name = "multiValueAccessorProvider")
   public Object[][] multiValueAccessorProvider() {
     // Base column configurations
     // For primitive MV types, we need to extract .getValues() from 
MultiValueResult
     // and verify that hasNulls() is false (nulls are removed by 
NullValueTransformer for MV primitive types)
-    Object[][] baseConfigs = new Object[][] {
-        {MV_INT_COL, (MultiValueGetter) (reader, docId) -> {
+    Object[][] baseConfigs = new Object[][]{
+        {
+            MV_INT_COL, (MultiValueGetter) (reader, docId) -> {
           MultiValueResult<int[]> result = reader.getIntMV(docId);
-          Assert.assertFalse(result.hasNulls(), "Multi-value primitive types 
should not have nulls");
+          assertFalse(result.hasNulls(), "Multi-value primitive types should 
not have nulls");

Review Comment:
   Reformatting deferred to `spotless:apply` (pre-commit). The multi-line rows 
only carry per-row converter/getter lambdas; the simple rows stay one-line, at 
consistent 12-space indentation. Not hand-editing to avoid diverging from 
spotless's canonical output.



##########
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/readers/DefaultValueColumnReaderTest.java:
##########
@@ -19,24 +19,26 @@
 package org.apache.pinot.segment.local.segment.readers;
 
 import java.math.BigDecimal;
-import java.util.Arrays;
 import org.apache.pinot.spi.data.DimensionFieldSpec;
 import org.apache.pinot.spi.data.FieldSpec;
 import org.apache.pinot.spi.data.MetricFieldSpec;
 import org.apache.pinot.spi.data.readers.MultiValueResult;
-import org.testng.Assert;
+import org.apache.pinot.spi.utils.PinotDataType;
 import org.testng.annotations.Test;
 
+import static org.testng.Assert.assertEquals;
+import static org.testng.Assert.assertFalse;
+import static org.testng.Assert.assertSame;
+import static org.testng.Assert.assertTrue;
+
 
 /// Comprehensive tests for [DefaultValueColumnReader].
 ///
 /// This test validates:
 /// - Single-value fields for all data types (INT, LONG, FLOAT, DOUBLE, 
BIG_DECIMAL, STRING, BYTES)
 /// - Multi-value fields for all data types
-/// - Sequential access methods (next*, hasNext, skipNext, isNextNull)
 /// - Random access methods (get*, isNull)
 /// - Type indicator methods (isInt, isLong, isBigDecimal, etc.)

Review Comment:
   Fixed in 6471bb8 — updated to "Value type reporting (getValueType)".



##########
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/readers/DefaultValueColumnReaderTest.java:
##########
@@ -308,20 +231,20 @@ public void testMultiValueFloatColumn() {
     FieldSpec fieldSpec = new DimensionFieldSpec(COLUMN_NAME, 
FieldSpec.DataType.FLOAT, false);
     DefaultValueColumnReader reader = new 
DefaultValueColumnReader(COLUMN_NAME, NUM_DOCS, fieldSpec);
 
-    // Test sequential access with nextFloatMV
+    // Test random access
     float expectedValue = ((Number) 
fieldSpec.getDefaultNullValue()).floatValue();
     float[] expectedArray = new float[]{expectedValue};
-    for (int i = 0; i < NUM_DOCS; i++) {
-      MultiValueResult<float[]> mvResult = reader.nextFloatMV();
-      Assert.assertFalse(mvResult.hasNulls());
-      Assert.assertTrue(Arrays.equals(mvResult.getValues(), expectedArray));
+    for (int docId = 0; docId < reader.getTotalDocs(); docId++) {
+      MultiValueResult<float[]> mvResult = reader.getFloatMV(docId);
+      assertFalse(mvResult.hasNulls());
+      assertEquals(mvResult.getValues(), expectedArray);
     }
 
-    // Test random access and array reuse
-    reader.rewind();
+    // Test array reuse
+    // Test array reuse

Review Comment:
   Fixed in 6471bb8 — collapsed the duplicate comment.



##########
pinot-plugins/pinot-input-format/pinot-arrow/src/main/java/org/apache/pinot/plugin/inputformat/arrow/ArrowColumnReader.java:
##########
@@ -35,60 +35,47 @@
 import org.apache.arrow.vector.complex.ListVector;
 import org.apache.pinot.spi.data.readers.ColumnReader;
 import org.apache.pinot.spi.data.readers.MultiValueResult;
-
-
-/**
- * Column reader for Apache Arrow {@link FieldVector}.
- *
- * <p>Wraps a single Arrow {@link FieldVector} and exposes sequential and 
random-access
- * read patterns conforming to the {@link ColumnReader} contract. The vector 
is owned by
- * the enclosing {@link ArrowColumnReaderFactory} which is responsible for its 
lifecycle;
- * closing this reader is a no-op for the underlying vector.
- *
- * <p>Supported Arrow types map to Pinot's stored types as follows:
- * <ul>
- *   <li>{@code IntVector} / {@code BitVector} → {@code INT} (BitVector 
promoted to 0/1)</li>
- *   <li>{@code BigIntVector} → {@code LONG}</li>
- *   <li>{@code Float4Vector} → {@code FLOAT}</li>
- *   <li>{@code Float8Vector} → {@code DOUBLE}</li>
- *   <li>{@code DecimalVector} → {@code BIG_DECIMAL}</li>
- *   <li>{@code VarCharVector} → {@code STRING}</li>
- *   <li>{@code VarBinaryVector} → {@code BYTES}</li>
- *   <li>{@code ListVector} of the above → multi-value variant</li>
- * </ul>
- *
- * <p>The list above applies to the typed primitive accessors only ({@link 
#getInt},
- * {@link #getString}, {@link #getIntMV}, ...). Complex types (Map, Struct, 
Union, ...)
- * and temporal types are still readable via the generic {@link 
#getValue(int)} /
- * {@link #next()} accessors, which delegate to {@link 
ArrowToPinotTypeConverter} and
- * return the same canonical JDK types as the row-major path ({@link
- * ArrowRecordExtractor}) — e.g. {@code Map<String, Object>} for Struct / Map,
- * {@code Object[]} for List variants, {@code LocalDate} / {@code LocalTime} /
- * {@code Timestamp} for temporal types.
- *
- * <p><b>BitVector caveat:</b> Arrow's {@link FieldVector#getObject} returns a 
{@code Boolean}
- * for a {@link BitVector}, but Pinot stores booleans as {@code INT} (0/1). 
The generic
- * {@link #getValue(int)} therefore special-cases {@code BitVector} to return 
an {@code Integer}
- * 0/1, keeping it consistent with {@link #getInt(int)} and the advertised INT 
mapping above
- * rather than surfacing the shared converter's {@code Boolean}.
- *
- * <p>{@code @SuppressWarnings("serial")}: {@link ColumnReader} is {@link 
java.io.Serializable} by SPI
- * contract, but this reader holds non-serializable Arrow state and is never 
serialized — it lives
- * only for the duration of a columnar segment build.
- *
- * <p>This class is not thread-safe.
- */
-@SuppressWarnings("serial")
+import org.apache.pinot.spi.utils.PinotDataType;
+
+
+/// Column reader for an Apache Arrow [FieldVector].
+///
+/// Wraps a single Arrow [FieldVector] and exposes random-access reads 
conforming to the [ColumnReader] contract. The
+/// vector is owned by the enclosing [ArrowColumnReaderFactory], which is 
responsible for its lifecycle; closing this
+/// reader is a no-op for the underlying vector.
+///
+/// Supported Arrow types map to Pinot's stored types as follows:
+/// - `IntVector` / `BitVector` → `INT` (BitVector promoted to 0/1)
+/// - `BigIntVector` → `LONG`
+/// - `Float4Vector` → `FLOAT`
+/// - `Float8Vector` → `DOUBLE`
+/// - `DecimalVector` → `BIG_DECIMAL`
+/// - `VarCharVector` → `STRING`
+/// - `VarBinaryVector` → `BYTES`
+/// - `ListVector` of the above → multi-value variant
+///
+/// The list above applies to the typed primitive accessors only 
([#getInt(int)], [#getString(int)], [#getIntMV(int)],
+/// ...). Complex types (Map, Struct, Union, ...) and temporal types are still 
readable via the generic
+/// [#getValue(int)] accessor, which delegates to [ArrowToPinotTypeConverter] 
and returns the same canonical JDK types
+/// as the row-major path ([ArrowRecordExtractor]) — e.g. `Map<String, 
Object>` for Struct / Map, `Object[]` for List
+/// variants, `LocalDate` / `LocalTime` / `Timestamp` for temporal types.
+///
+/// **BitVector caveat:** Arrow's [FieldVector#getObject(int)] returns a 
`Boolean` for a `BitVector`, but Pinot stores
+/// booleans as `INT` (0/1). The generic [#getValue(int)] therefore 
special-cases `BitVector` to return an `Integer`
+/// 0/1, keeping it consistent with [#getInt(int)] and the advertised INT 
mapping above rather than surfacing the
+/// shared converter's `Boolean`.
+///
+/// `@SuppressWarnings("serial")`: [ColumnReader] is [java.io.Serializable] by 
SPI contract, but this reader holds
+/// non-serializable Arrow state and is never serialized — it lives only for 
the duration of a columnar segment build.
+///
+/// This class is not thread-safe.
 public class ArrowColumnReader implements ColumnReader {

Review Comment:
   The `@SuppressWarnings("serial")` is redundant here: the sibling 
`ColumnReader` implementations (`PinotSegmentColumnReaderImpl`, 
`DefaultValueColumnReader`, `InMemoryColumnReader`) are `Serializable` via the 
interface without it and compile clean, so the `serial` lint is not emitted. 
Removed the stale Javadoc reference in 6471bb8 rather than restoring a 
redundant annotation.



##########
pinot-plugins/pinot-input-format/pinot-arrow/src/main/java/org/apache/pinot/plugin/inputformat/arrow/BatchedArrowColumnReader.java:
##########
@@ -27,38 +27,30 @@
 import org.apache.arrow.vector.types.pojo.Field;
 import org.apache.pinot.spi.data.readers.ColumnReader;
 import org.apache.pinot.spi.data.readers.MultiValueResult;
-
-
-/**
- * A {@link ColumnReader} over one column of a {@link BatchedArrowFileSource}. 
Reads are served by
- * delegating to a per-batch {@link ArrowColumnReader} over the current record 
batch's column vector;
- * the source loads at most one batch at a time, so peak resident memory is 
one batch (see
- * {@link BatchedArrowFileSource}).
- *
- * <p>Random access ({@code getXxx(docId)}) seeks to the owning batch and 
reads it; sequential access
- * ({@code next} / typed {@code nextXxx}) is built on it. Type predicates are 
derived from the column's
- * (decoded) Arrow {@link Field}, so they need no batch load. The column-major 
segment build consumes
- * each column to completion sequentially (rewind + next), one column at a 
time.
- *
- * <p>The per-batch delegate is invalidated when the shared source moves to 
another batch (whether
- * this reader advanced it or a different one did); {@link #delegate(int)} 
documents why and rebuilds
- * it. Interleaving readers therefore stays correct (it just re-loads 
batches); the build never
- * interleaves.
- *
- * <p>This class is not thread-safe. {@code @SuppressWarnings("serial")}: 
{@link ColumnReader} is
- * {@link java.io.Serializable} by SPI contract, but this reader holds 
non-serializable Arrow state
- * and is never serialized.
- */
-@SuppressWarnings("serial")
+import org.apache.pinot.spi.utils.PinotDataType;
+
+
+/// A [ColumnReader] over one column of a [BatchedArrowFileSource]. Reads are 
served by delegating to a per-batch
+/// [ArrowColumnReader] over the current record batch's column vector; the 
source loads at most one batch at a time, so
+/// peak resident memory is one batch (see [BatchedArrowFileSource]).
+///
+/// Random access (`getXxx(docId)`) seeks to the owning batch and reads it. 
The value type is derived from the column's
+/// (decoded) Arrow [Field], so it needs no batch load. The column-major 
segment build consumes each column to
+/// completion (by ascending docId), one column at a time.
+///
+/// The per-batch delegate is invalidated when the shared source moves to 
another batch (whether this reader advanced
+/// it or a different one did); [#delegate(int)] documents why and rebuilds 
it. Interleaving readers therefore stays
+/// correct (it just re-loads batches); the build never interleaves.
+///
+/// This class is not thread-safe. `@SuppressWarnings("serial")`: 
[ColumnReader] is [java.io.Serializable] by SPI
+/// contract, but this reader holds non-serializable Arrow state and is never 
serialized.
 public class BatchedArrowColumnReader implements ColumnReader {

Review Comment:
   The `@SuppressWarnings("serial")` is redundant here: the sibling 
`ColumnReader` implementations (`PinotSegmentColumnReaderImpl`, 
`DefaultValueColumnReader`, `InMemoryColumnReader`) are `Serializable` via the 
interface without it and compile clean, so the `serial` lint is not emitted. 
Removed the stale Javadoc reference in 6471bb8 rather than restoring a 
redundant annotation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Simplify ColumnReader SPI: drop iterator API, replace type checks with getValueType() [pinot]

Reply via email to