Jackie-Jiang commented on code in PR #18400:
URL: https://github.com/apache/pinot/pull/18400#discussion_r3175845451
##########
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/BaseRecordExtractor.java:
##########
@@ -77,14 +68,6 @@ public void init(@Nullable Set<String> fields, @Nullable
RecordExtractorConfig c
protected void initConfig(@Nullable RecordExtractorConfig config) {
}
Review Comment:
Acknowledged as a deliberate API change. `BaseRecordExtractor` is part of
`pinot-spi`'s extractor API which has historically been treated as Evolving —
`shouldExtract` was a thin convenience over `_extractAll ||
_fields.contains(name)` and had zero in-tree callers. Any third-party subclass
that called it can switch to the inline check trivially. Trading the small
breakage now for a cleaner API that doesn't pretend `_extractAll` and per-field
include-list are interchangeable.
##########
pinot-spi/src/main/java/org/apache/pinot/spi/utils/JsonUtils.java:
##########
@@ -157,11 +188,6 @@ public static JsonNode stringToJsonNode(String jsonString)
return DEFAULT_READER.readTree(jsonString);
}
Review Comment:
Acknowledged. The pre-existing signature `Map<String, String>` was lying
about the runtime type — internally it deserialized via `MAP_TYPE_REFERENCE`
(`Map<String, Object>`), so any external caller using it as `Map<String,
String>` would CCE on the first non-string value. Removing the broken method is
a forcing function for callers who only got away with it because their data
happened to be all-strings; they should migrate to the honest `jsonNodeToMap`
(which is exactly the same code with the correct type). Net: shorter migration
path than letting a buggy signature linger.
##########
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/BaseRecordExtractor.java:
##########
@@ -211,22 +194,18 @@ protected Map<Object, Object> convertRecord(Object value)
{
getClass().getSimpleName() + " does not support nested records;
override convertRecord() to enable.");
}
- /// Converts a single value per the [RecordExtractor] contract:
+ /// Universal normalizations for types that any format extractor might
produce:
/// - `Byte` / `Short` widen to `Integer` so all small ints unify behind a
single Pinot type
/// - [BigInteger] widens to [BigDecimal] (Pinot has no `BigInteger` data
type; downstream transforms handle
/// `BigDecimal` natively)
/// - other `Number` (`Integer` / `Long` / `Float` / `Double` /
`BigDecimal`) passes through
/// - `Boolean` passes through
/// - `byte[]` passes through
- /// - [Timestamp] passes through
/// - `ByteBuffer` materializes to `byte[]` (sliced so the source buffer's
position is not advanced)
///
- /// [Temporal] family (`java.time`):
- /// - [LocalDate] / [LocalTime] pass through (TZ-independent date /
time-of-day)
- /// - [Instant] / [OffsetDateTime] / [ZonedDateTime] bridge to [Timestamp]
via [Timestamp#from] — all three are
- /// unambiguously TZ-anchored, so the conversion is lossless (sub-second
nanos preserved)
- /// - [LocalDateTime] bridges to [Timestamp] interpreting the wall-clock as
UTC
- /// - other [Temporal] types fall through to `value.toString()`
+ /// Date / time types (DATE / TIME / TIMESTAMP) are NOT handled here — see
[RecordExtractor] for the contract
+ /// and the format-specific extractor (`convertSingleValue` override or
direct construction in `extractValue`)
+ /// for the native-to-contract conversion.
Review Comment:
Acknowledged — this IS the contract refinement. Per the PR description:
every time-typed extractor output is timezone-independent, and the right
narrowing depends on what the format actually produces (Avro `timestamp-millis`
from Instant, JDBC `getTimestamp()` from Timestamp, Parquet `INT96`, etc.).
Pinot's bundled extractors that produce time types (Avro, Parquet, ORC, Arrow)
are all updated in this PR. Third-party `BaseRecordExtractor` subclasses
producing time types do need a one-time update — this is exactly the visible
breakage we'd want for a contract change rather than silent behavior drift.
##########
pinot-spi/src/main/java/org/apache/pinot/spi/utils/MapUtils.java:
##########
@@ -50,11 +49,13 @@ private MapUtils() {
}
private static final Logger LOGGER = LoggerFactory.getLogger(MapUtils.class);
- private static final ObjectMapper SORTED_MAPPER =
- new
ObjectMapper().configure(SerializationFeature.ORDER_MAP_ENTRIES_BY_KEYS, true);
- private static final ObjectMapper UNSORTED_MAPPER = new ObjectMapper();
- private static final TypeReference<Map<String, Object>> MAP_TYPE_REFERENCE =
new TypeReference<>() {
- };
+
+ // Pinot's standard ObjectMapper config (JSR-310 / ISO-8601 for LocalDate /
LocalTime — see JsonUtils),
+ // plus key-ordered output for the sorted variant so byte output is a pure
function of the logical map.
+ // Writes only — read paths delegate to JsonUtils, whose DEFAULT_READER is
already JSR-310-aware.
+ private static final ObjectWriter SORTED_WRITER =
JsonUtils.newObjectMapperWithJavaTime()
+ .configure(SerializationFeature.ORDER_MAP_ENTRIES_BY_KEYS,
true).writer();
+ private static final ObjectWriter UNSORTED_WRITER =
JsonUtils.newObjectMapperWithJavaTime().writer();
Review Comment:
Fair. Added 4 cases in `MapUtilsTest`:
- `testToStringSerializesLocalDateAsIsoString` /
`testToStringSerializesLocalTimeAsIsoString` — JSR-310 → ISO output via the
`MapUtils` writers (separate from `JsonUtils`).
- `testSerializeMapWithLocalDateRoundTrips` — full
`serializeMap`/`deserializeMap` cycle confirms the binary path doesn't crash
and the deserialized form is the expected ISO string.
- `testToStringSerializesNestedMapWithLocalDate` — the original
integration-test-failure scenario: Map containing a nested Map with a LocalDate
value.
##########
pinot-plugins/pinot-input-format/pinot-arrow/src/main/java/org/apache/pinot/plugin/inputformat/arrow/ArrowToGenericRowConverter.java:
##########
@@ -246,12 +245,10 @@ private Object
convertArrowTypeToPinotCompatible(@Nullable Object value) {
return value.toString();
}
- // Handle Arrow LocalDateTime -> java.sql.Timestamp conversion
+ // Handle Arrow LocalDateTime -> Long epoch millis (per RecordExtractor
contract: TIMESTAMP → Long).
if (value instanceof LocalDateTime) {
- // Arrow TimeStampMilliVector.getObject() returns LocalDateTime, but
Pinot expects
- // java.sql.Timestamp objects for proper timestamp handling and native
support
LocalDateTime dateTime = (LocalDateTime) value;
- return Timestamp.from(dateTime.toInstant(ZoneOffset.UTC));
+ return dateTime.toInstant(ZoneOffset.UTC).toEpochMilli();
Review Comment:
Good catch. Added `testArrowTimestampConvertedToExactUtcEpochMillis`: builds
a single-row Arrow batch with a fixed `baseTime = 1649924302123L` (instead of
`System.currentTimeMillis()`), runs through the decoder, and asserts the
converter emits exactly that long. This catches both wrong-math and wrong-zone
regressions — if the converter started using JVM-local instead of UTC the
assertion would fail (unless the JVM happens to be in UTC).
`ArrowTestDataUtil.createMultiTypeArrowIpcData` got a new overload that takes
the base time so the existing test still passes through with
`System.currentTimeMillis()`.
##########
pinot-plugins/pinot-input-format/pinot-orc/src/main/java/org/apache/pinot/plugin/inputformat/orc/ORCRecordExtractor.java:
##########
@@ -206,13 +206,12 @@ private static Object extractSingleValue(String field,
ColumnVector columnVector
}
case TIMESTAMP:
case TIMESTAMP_INSTANT: {
- // `time[rowId]` is epoch millis truncated to second precision;
sub-second nanos (0..999_999_999)
+ // `time[rowId]` is epoch millis (truncated to second precision);
sub-second nanos (0..999_999_999)
// live in `nanos[rowId]`. `TIMESTAMP_INSTANT` (UTC instant) shares
the same vector layout as
- // `TIMESTAMP` (local). Construct a `Timestamp` and set the full nanos
field to preserve precision.
+ // `TIMESTAMP` (local). Per the contract, output is `Long` epoch
millis; sub-millisecond precision is
+ // dropped here to match Pinot's TIMESTAMP physical storage.
TimestampColumnVector tsv = (TimestampColumnVector) columnVector;
- Timestamp ts = new Timestamp(tsv.time[rowId]);
- ts.setNanos(tsv.nanos[rowId]);
- return ts;
+ return tsv.time[rowId] + tsv.nanos[rowId] / 1_000_000L;
Review Comment:
Fair. Added `testTimestampPreEpochExtractedAsLongEpochMillis`: pre-epoch
value `-60_000` millis (1969-12-31T23:59:00Z) with `time = -60_000, nanos = 0`.
ORC's `nanos` field is always non-negative (0..999_999_999), so the relevant
edge case is whether the extractor's `time + nanos / 1_000_000` arithmetic
handles negative `time` correctly — which it does because Java integer division
on the non-negative `nanos` is well-defined and the addition preserves sign.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]