xiangfu0 commented on code in PR #18325:
URL: https://github.com/apache/pinot/pull/18325#discussion_r3142703035
##########
pinot-plugins/pinot-input-format/pinot-parquet/src/main/java/org/apache/pinot/plugin/inputformat/parquet/ParquetAvroRecordExtractor.java:
##########
@@ -18,25 +18,174 @@
*/
package org.apache.pinot.plugin.inputformat.parquet;
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.List;
import java.util.Set;
import javax.annotation.Nullable;
import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.parquet.schema.GroupType;
+import org.apache.parquet.schema.LogicalTypeAnnotation;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.Type;
import org.apache.pinot.plugin.inputformat.avro.AvroRecordExtractor;
import org.apache.pinot.spi.data.readers.RecordExtractorConfig;
+/**
+ * Extracts Pinot rows from Avro {@link GenericRecord}s materialized by
parquet-avro.
+ *
+ * <p>Parquet's standard 3-level LIST encoding is
+ * {@code <list-rep> group <name> (LIST) { repeated group list { <elem>
element; } }}. parquet-avro tries to
+ * strip that wrapper while building the Avro records, but it leaves the
wrapper in place when the file has no
+ * Avro schema in its metadata and parquet-avro has to reconstruct the Avro
schema from the Parquet schema. When
+ * that happens a Parquet field declared as {@code LIST<T>} surfaces as Avro
{@code array<record<element: T>>},
+ * which is the bug fixed here (issue #17420).
+ *
+ * <p>The Avro shape alone is ambiguous — a user-defined Avro {@code
array<record<UserTag, [element: string]>>}
+ * matches the same shape and must NOT be flattened. Two pieces of information
together resolve the ambiguity:
Review Comment:
Good pointer — `AvroUtils.extractSupportedSchema()` recursively unwraps a
single-entry RECORD to its inner field's schema (with a
`Preconditions.checkState(recordFields.size() == 1)`). That's the
schema-inference equivalent of stripping the parquet-avro LIST wrapper at
runtime.
In the latest revision
([`b5d171c`](https://github.com/apache/pinot/pull/18325/commits/b5d171cf3b))
the runtime stripping is moved into parquet-avro itself by setting
`parquet.avro.add-list-element-records=false` in the Hadoop config (matches
Apache Arrow's Parquet behavior). With that flag, parquet-avro flattens the
standard 3-level LIST encoding to Avro `array<elem-type>` directly, so the
extractor never sees a wrapper record at all — `transformValue` reduces to
plain `convert(value)` plus the existing INT96 promotion, no record-handling
logic needed. The end-to-end result is consistent with
`extractSupportedSchema`'s single-entry assumption (and with the Avro reader's
behavior for the same logical input).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]