Re: [PR] Refine RecordExtractor contract: TIMESTAMP narrows to Long, push time conversions per-format [pinot]

via GitHub Fri, 01 May 2026 19:01:47 -0700


Jackie-Jiang commented on code in PR #18400:
URL: https://github.com/apache/pinot/pull/18400#discussion_r3175845451



##########
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/BaseRecordExtractor.java:
##########
@@ -77,14 +68,6 @@ public void init(@Nullable Set<String> fields, @Nullable 
RecordExtractorConfig c
   protected void initConfig(@Nullable RecordExtractorConfig config) {
   }
 

Review Comment:
   Acknowledged as a deliberate API change. `BaseRecordExtractor` is part of 
`pinot-spi`'s extractor API which has historically been treated as Evolving — 
`shouldExtract` was a thin convenience over `_extractAll || 
_fields.contains(name)` and had zero in-tree callers. Any third-party subclass 
that called it can switch to the inline check trivially. Trading the small 
breakage now for a cleaner API that doesn't pretend `_extractAll` and per-field 
include-list are interchangeable.



##########
pinot-spi/src/main/java/org/apache/pinot/spi/utils/JsonUtils.java:
##########
@@ -157,11 +188,6 @@ public static JsonNode stringToJsonNode(String jsonString)
     return DEFAULT_READER.readTree(jsonString);
   }
 

Review Comment:
   Acknowledged. The pre-existing signature `Map<String, String>` was lying 
about the runtime type — internally it deserialized via `MAP_TYPE_REFERENCE` 
(`Map<String, Object>`), so any external caller using it as `Map<String, 
String>` would CCE on the first non-string value. Removing the broken method is 
a forcing function for callers who only got away with it because their data 
happened to be all-strings; they should migrate to the honest `jsonNodeToMap` 
(which is exactly the same code with the correct type). Net: shorter migration 
path than letting a buggy signature linger.



##########
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/BaseRecordExtractor.java:
##########
@@ -211,22 +194,18 @@ protected Map<Object, Object> convertRecord(Object value) 
{
         getClass().getSimpleName() + " does not support nested records; 
override convertRecord() to enable.");
   }
 
-  /// Converts a single value per the [RecordExtractor] contract:
+  /// Universal normalizations for types that any format extractor might 
produce:
   /// - `Byte` / `Short` widen to `Integer` so all small ints unify behind a 
single Pinot type
   /// - [BigInteger] widens to [BigDecimal] (Pinot has no `BigInteger` data 
type; downstream transforms handle
   ///   `BigDecimal` natively)
   /// - other `Number` (`Integer` / `Long` / `Float` / `Double` / 
`BigDecimal`) passes through
   /// - `Boolean` passes through
   /// - `byte[]` passes through
-  /// - [Timestamp] passes through
   /// - `ByteBuffer` materializes to `byte[]` (sliced so the source buffer's 
position is not advanced)
   ///
-  /// [Temporal] family (`java.time`):
-  /// - [LocalDate] / [LocalTime] pass through (TZ-independent date / 
time-of-day)
-  /// - [Instant] / [OffsetDateTime] / [ZonedDateTime] bridge to [Timestamp] 
via [Timestamp#from] — all three are
-  ///   unambiguously TZ-anchored, so the conversion is lossless (sub-second 
nanos preserved)
-  /// - [LocalDateTime] bridges to [Timestamp] interpreting the wall-clock as 
UTC
-  /// - other [Temporal] types fall through to `value.toString()`
+  /// Date / time types (DATE / TIME / TIMESTAMP) are NOT handled here — see 
[RecordExtractor] for the contract
+  /// and the format-specific extractor (`convertSingleValue` override or 
direct construction in `extractValue`)
+  /// for the native-to-contract conversion.

Review Comment:
   Acknowledged — this IS the contract refinement. Per the PR description: 
every time-typed extractor output is timezone-independent, and the right 
narrowing depends on what the format actually produces (Avro `timestamp-millis` 
from Instant, JDBC `getTimestamp()` from Timestamp, Parquet `INT96`, etc.). 
Pinot's bundled extractors that produce time types (Avro, Parquet, ORC, Arrow) 
are all updated in this PR. Third-party `BaseRecordExtractor` subclasses 
producing time types do need a one-time update — this is exactly the visible 
breakage we'd want for a contract change rather than silent behavior drift.



##########
pinot-spi/src/main/java/org/apache/pinot/spi/utils/MapUtils.java:
##########
@@ -50,11 +49,13 @@ private MapUtils() {
   }
 
   private static final Logger LOGGER = LoggerFactory.getLogger(MapUtils.class);
-  private static final ObjectMapper SORTED_MAPPER =
-      new 
ObjectMapper().configure(SerializationFeature.ORDER_MAP_ENTRIES_BY_KEYS, true);
-  private static final ObjectMapper UNSORTED_MAPPER = new ObjectMapper();
-  private static final TypeReference<Map<String, Object>> MAP_TYPE_REFERENCE = 
new TypeReference<>() {
-  };
+
+  // Pinot's standard ObjectMapper config (JSR-310 / ISO-8601 for LocalDate / 
LocalTime — see JsonUtils),
+  // plus key-ordered output for the sorted variant so byte output is a pure 
function of the logical map.
+  // Writes only — read paths delegate to JsonUtils, whose DEFAULT_READER is 
already JSR-310-aware.
+  private static final ObjectWriter SORTED_WRITER = 
JsonUtils.newObjectMapperWithJavaTime()
+      .configure(SerializationFeature.ORDER_MAP_ENTRIES_BY_KEYS, 
true).writer();
+  private static final ObjectWriter UNSORTED_WRITER = 
JsonUtils.newObjectMapperWithJavaTime().writer();

Review Comment:
   Fair. Added 4 cases in `MapUtilsTest`:
   - `testToStringSerializesLocalDateAsIsoString` / 
`testToStringSerializesLocalTimeAsIsoString` — JSR-310 → ISO output via the 
`MapUtils` writers (separate from `JsonUtils`).
   - `testSerializeMapWithLocalDateRoundTrips` — full 
`serializeMap`/`deserializeMap` cycle confirms the binary path doesn't crash 
and the deserialized form is the expected ISO string.
   - `testToStringSerializesNestedMapWithLocalDate` — the original 
integration-test-failure scenario: Map containing a nested Map with a LocalDate 
value.



##########
pinot-plugins/pinot-input-format/pinot-arrow/src/main/java/org/apache/pinot/plugin/inputformat/arrow/ArrowToGenericRowConverter.java:
##########
@@ -246,12 +245,10 @@ private Object 
convertArrowTypeToPinotCompatible(@Nullable Object value) {
       return value.toString();
     }
 
-    // Handle Arrow LocalDateTime -> java.sql.Timestamp conversion
+    // Handle Arrow LocalDateTime -> Long epoch millis (per RecordExtractor 
contract: TIMESTAMP → Long).
     if (value instanceof LocalDateTime) {
-      // Arrow TimeStampMilliVector.getObject() returns LocalDateTime, but 
Pinot expects
-      // java.sql.Timestamp objects for proper timestamp handling and native 
support
       LocalDateTime dateTime = (LocalDateTime) value;
-      return Timestamp.from(dateTime.toInstant(ZoneOffset.UTC));
+      return dateTime.toInstant(ZoneOffset.UTC).toEpochMilli();

Review Comment:
   Good catch. Added `testArrowTimestampConvertedToExactUtcEpochMillis`: builds 
a single-row Arrow batch with a fixed `baseTime = 1649924302123L` (instead of 
`System.currentTimeMillis()`), runs through the decoder, and asserts the 
converter emits exactly that long. This catches both wrong-math and wrong-zone 
regressions — if the converter started using JVM-local instead of UTC the 
assertion would fail (unless the JVM happens to be in UTC). 
`ArrowTestDataUtil.createMultiTypeArrowIpcData` got a new overload that takes 
the base time so the existing test still passes through with 
`System.currentTimeMillis()`.



##########
pinot-plugins/pinot-input-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractor.java:
##########
@@ -206,13 +206,12 @@ private static Object extractSingleValue(String field, 
ColumnVector columnVector
       }
       case TIMESTAMP:
       case TIMESTAMP_INSTANT: {
-        // `time[rowId]` is epoch millis truncated to second precision; 
sub-second nanos (0..999_999_999)
+        // `time[rowId]` is epoch millis (truncated to second precision); 
sub-second nanos (0..999_999_999)
         // live in `nanos[rowId]`. `TIMESTAMP_INSTANT` (UTC instant) shares 
the same vector layout as
-        // `TIMESTAMP` (local). Construct a `Timestamp` and set the full nanos 
field to preserve precision.
+        // `TIMESTAMP` (local). Per the contract, output is `Long` epoch 
millis; sub-millisecond precision is
+        // dropped here to match Pinot's TIMESTAMP physical storage.
         TimestampColumnVector tsv = (TimestampColumnVector) columnVector;
-        Timestamp ts = new Timestamp(tsv.time[rowId]);
-        ts.setNanos(tsv.nanos[rowId]);
-        return ts;
+        return tsv.time[rowId] + tsv.nanos[rowId] / 1_000_000L;

Review Comment:
   Fair. Added `testTimestampPreEpochExtractedAsLongEpochMillis`: pre-epoch 
value `-60_000` millis (1969-12-31T23:59:00Z) with `time = -60_000, nanos = 0`. 
ORC's `nanos` field is always non-negative (0..999_999_999), so the relevant 
edge case is whether the extractor's `time + nanos / 1_000_000` arithmetic 
handles negative `time` correctly — which it does because Java integer division 
on the non-negative `nanos` is well-defined and the addition preserves sign.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Refine RecordExtractor contract: TIMESTAMP narrows to Long, push time conversions per-format [pinot]

Reply via email to