Copilot commented on code in PR #18400:
URL: https://github.com/apache/pinot/pull/18400#discussion_r3185724224
##########
pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroRecordExtractor.java:
##########
@@ -41,94 +55,209 @@
/// - avro `double` → `Double` → `Double`
/// - avro `string` → `Utf8` → `String` (via `Utf8.toString()`)
/// - avro `bytes` → `ByteBuffer` → `byte[]` (materialized from the buffer's
remaining content)
-/// - avro `fixed` → [GenericFixed] / `GenericData.Fixed` → `byte[]`
+/// - avro `fixed` → `GenericFixed` / `GenericData.Fixed` → `byte[]`
/// - avro `enum` → `GenericData.EnumSymbol` → `String` (enum name)
/// - avro `array<T>` → `List<T>` → `Object[]` (each element recursively
converted)
/// - avro `map<string, T>` → `Map<Utf8, T>` → `Map<String, Object>` (each
value recursively converted)
-/// - avro nested `record` → [GenericRecord] → `Map<String, Object>`
+/// - avro nested `record` → `GenericRecord` → `Map<String, Object>`
/// - avro `union[null, X]` with the `null` branch selected → `null`
///
-/// **Logical types** (applied when `enableLogicalTypes = true`, the config
default; `GenericDatumReader` emits
-/// the post-conversion Java types listed below):
-/// - `decimal` → `BigDecimal` → `BigDecimal`
-/// - `uuid` → `UUID` → `String` (UUID string form via `UUID.toString()`)
-/// - `date` → `LocalDate` → `LocalDate` (passes through; TZ-independent)
-/// - `time-millis` / `time-micros` → `LocalTime` → `LocalTime` (passes
through; full ns precision preserved)
-/// - `timestamp-millis` / `timestamp-micros` → `Instant` →
`java.sql.Timestamp` (sub-millisecond nanos preserved
-/// via `Timestamp.getNanos()`)
+/// **Logical types** (the reader emits the raw Avro physical type; this
extractor applies the conversion via
+/// [#CONVERSION_MAP] to produce the Pinot contract type):
+/// - `decimal` / `big-decimal` → raw `ByteBuffer` → [BigDecimal] (always
converted; raw bytes aren't interpretable
+/// without external precision/scale)
+/// - `timestamp-millis` → raw `Long` → [Timestamp], or `Long` raw epoch
millis when `extractRawTimeValues` is `true`
+/// - `timestamp-micros` → raw `Long` → [Timestamp] (sub-millisecond micros
preserved), or `Long` raw epoch micros when
+/// `extractRawTimeValues` is `true`
+/// - `timestamp-nanos` → raw `Long` → [Timestamp] (nanosecond precision
preserved), or `Long` raw epoch nanos when
+/// `extractRawTimeValues` is `true`
+/// - `date` → raw `Integer` → [LocalDate], or `Integer` raw days-since-epoch
when `extractRawTimeValues` is `true`
+/// - `time-millis` → raw `Integer` → [LocalTime], or `Integer` raw
ms-since-midnight when `extractRawTimeValues` is
+/// `true`
+/// - `time-micros` → raw `Long` → [LocalTime], or `Long` raw
µs-since-midnight when `extractRawTimeValues` is `true`
+/// - `uuid` → raw `Utf8` / `GenericFixed` → [UUID] (always converted;
downstream type transformer adapts to the Pinot
+/// column type — `STRING` column gets canonical UUID string, `BYTES` column
gets 16-byte big-endian form)
public class AvroRecordExtractor extends BaseRecordExtractor<GenericRecord> {
- private boolean _applyLogicalTypes = true;
+ private static final Conversion<BigDecimal> DECIMAL_CONVERSION = new
Conversions.DecimalConversion();
+ private static final Conversion<BigDecimal> BIG_DECIMAL_CONVERSION = new
Conversions.BigDecimalConversion();
+ private static final Conversion<Instant> TIMESTAMP_MILLIS_CONVERSION =
+ new TimeConversions.TimestampMillisConversion();
+ private static final Conversion<Instant> TIMESTAMP_MICROS_CONVERSION =
+ new TimeConversions.TimestampMicrosConversion();
+ private static final Conversion<Instant> TIMESTAMP_NANOS_CONVERSION = new
TimeConversions.TimestampNanosConversion();
+ private static final Conversion<LocalDate> DATE_CONVERSION = new
TimeConversions.DateConversion();
+ private static final Conversion<LocalTime> TIME_MILLIS_CONVERSION = new
TimeConversions.TimeMillisConversion();
+ private static final Conversion<LocalTime> TIME_MICROS_CONVERSION = new
TimeConversions.TimeMicrosConversion();
+ private static final Conversion<UUID> UUID_CONVERSION = new
Conversions.UUIDConversion();
+
+ private static final Map<String, Conversion<?>> CONVERSION_MAP = Map.of(
+ DECIMAL_CONVERSION.getLogicalTypeName(), DECIMAL_CONVERSION,
+ BIG_DECIMAL_CONVERSION.getLogicalTypeName(), BIG_DECIMAL_CONVERSION,
+ TIMESTAMP_MILLIS_CONVERSION.getLogicalTypeName(),
TIMESTAMP_MILLIS_CONVERSION,
+ TIMESTAMP_MICROS_CONVERSION.getLogicalTypeName(),
TIMESTAMP_MICROS_CONVERSION,
+ TIMESTAMP_NANOS_CONVERSION.getLogicalTypeName(),
TIMESTAMP_NANOS_CONVERSION,
+ DATE_CONVERSION.getLogicalTypeName(), DATE_CONVERSION,
+ TIME_MILLIS_CONVERSION.getLogicalTypeName(), TIME_MILLIS_CONVERSION,
+ TIME_MICROS_CONVERSION.getLogicalTypeName(), TIME_MICROS_CONVERSION,
+ UUID_CONVERSION.getLogicalTypeName(), UUID_CONVERSION
+ );
+
+ private static final Set<String> TEMPORAL_LOGICAL_TYPES = Set.of(
+ TIMESTAMP_MILLIS_CONVERSION.getLogicalTypeName(),
+ TIMESTAMP_MICROS_CONVERSION.getLogicalTypeName(),
+ TIMESTAMP_NANOS_CONVERSION.getLogicalTypeName(),
+ DATE_CONVERSION.getLogicalTypeName(),
+ TIME_MILLIS_CONVERSION.getLogicalTypeName(),
Review Comment:
`timestamp-nanos` was added to the Avro logical-type conversion path, but
this file's tests still cover only millis/micros timestamps. Because nanosecond
handling has its own conversion class and raw-mode behavior, a regression there
would currently slip through unnoticed; please add end-to-end extractor tests
for both the default `Timestamp` path and `extractRawTimeValues=true`.
##########
pinot-plugins/pinot-input-format/pinot-parquet/src/main/java/org/apache/pinot/plugin/inputformat/parquet/ParquetRecordReaderConfig.java:
##########
@@ -18,34 +18,31 @@
*/
package org.apache.pinot.plugin.inputformat.parquet;
-import org.apache.commons.configuration2.Configuration;
import org.apache.pinot.spi.data.readers.RecordReaderConfig;
-/**
- * Config for ParquetRecordReader
- */
+/// Config for [ParquetRecordReader] (and the underlying
[ParquetAvroRecordReader] / [ParquetNativeRecordReader]).
+/// Three settings, all default `false`:
+/// - `useParquetAvroRecordReader` — force the parquet-avro reader.
+/// - `useParquetNativeRecordReader` — force the native parquet reader. When
neither flag is set, the dispatcher
+/// auto-detects via the file's `avro.schema` metadata (Avro reader if
present, native otherwise).
+/// - `extractRawTimeValues` — opt out of TIMESTAMP / DATE / TIME conversion
at the extractor boundary,
+/// surfacing the raw underlying integer in the column's declared unit
instead of the contract Java type.
+/// DECIMAL and UUID always convert. See [ParquetAvroRecordExtractor] /
[ParquetNativeRecordExtractor] for
+/// the per-type matrix.
public class ParquetRecordReaderConfig implements RecordReaderConfig {
- private static final String USE_PARQUET_AVRO_RECORDER_READER =
"useParquetAvroRecordReader";
- private static final String USE_PARQUET_NATIVE_RECORDER_READER =
"useParquetNativeRecordReader";
-
private boolean _useParquetAvroRecordReader;
private boolean _useParquetNativeRecordReader;
- private Configuration _conf;
-
- public ParquetRecordReaderConfig() {
- }
-
- public ParquetRecordReaderConfig(Configuration conf) {
- _conf = conf;
- _useParquetAvroRecordReader =
conf.getBoolean(USE_PARQUET_AVRO_RECORDER_READER, false);
- _useParquetNativeRecordReader =
conf.getBoolean(USE_PARQUET_NATIVE_RECORDER_READER, false);
- }
+ private boolean _extractRawTimeValues;
Review Comment:
This removes the `Configuration`-accepting constructor and `getConfig()`
accessor from a public record-reader config class. In-repo callers may not use
them anymore, but external ingestion code that still constructs
`ParquetRecordReaderConfig` from a Commons Configuration or inspects the
backing config will now fail to compile on upgrade. If this break is
intentional, please keep a compatibility shim or deprecate first rather than
deleting the API outright in one step.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]