Re: [PR] Fix Parquet LIST/MAP wrapper extraction in native and Avro readers [pinot]

via GitHub Sat, 25 Apr 2026 16:08:18 -0700


xiangfu0 commented on code in PR #18325:
URL: https://github.com/apache/pinot/pull/18325#discussion_r3142703035



##########
pinot-plugins/pinot-input-format/pinot-parquet/src/main/java/org/apache/pinot/plugin/inputformat/parquet/ParquetAvroRecordExtractor.java:
##########
@@ -18,25 +18,174 @@
  */
 package org.apache.pinot.plugin.inputformat.parquet;
 
+import java.util.ArrayDeque;
+import java.util.Deque;
+import java.util.List;
 import java.util.Set;
 import javax.annotation.Nullable;
 import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.parquet.schema.GroupType;
+import org.apache.parquet.schema.LogicalTypeAnnotation;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.Type;
 import org.apache.pinot.plugin.inputformat.avro.AvroRecordExtractor;
 import org.apache.pinot.spi.data.readers.RecordExtractorConfig;
 
 
+/**
+ * Extracts Pinot rows from Avro {@link GenericRecord}s materialized by 
parquet-avro.
+ *
+ * <p>Parquet's standard 3-level LIST encoding is
+ * {@code <list-rep> group <name> (LIST) { repeated group list { <elem> 
element; } }}. parquet-avro tries to
+ * strip that wrapper while building the Avro records, but it leaves the 
wrapper in place when the file has no
+ * Avro schema in its metadata and parquet-avro has to reconstruct the Avro 
schema from the Parquet schema. When
+ * that happens a Parquet field declared as {@code LIST<T>} surfaces as Avro 
{@code array<record<element: T>>},
+ * which is the bug fixed here (issue #17420).
+ *
+ * <p>The Avro shape alone is ambiguous — a user-defined Avro {@code 
array<record<UserTag, [element: string]>>}
+ * matches the same shape and must NOT be flattened. Two pieces of information 
together resolve the ambiguity:

Review Comment:
   Good pointer — `AvroUtils.extractSupportedSchema()` recursively unwraps a 
single-entry RECORD to its inner field's schema (with a 
`Preconditions.checkState(recordFields.size() == 1)`). That's the 
schema-inference equivalent of stripping the parquet-avro LIST wrapper at 
runtime.
   
   In the latest revision 
([`b5d171c`](https://github.com/apache/pinot/pull/18325/commits/b5d171cf3b)) 
the runtime stripping is moved into parquet-avro itself by setting 
`parquet.avro.add-list-element-records=false` in the Hadoop config (matches 
Apache Arrow's Parquet behavior). With that flag, parquet-avro flattens the 
standard 3-level LIST encoding to Avro `array<elem-type>` directly, so the 
extractor never sees a wrapper record at all — `transformValue` reduces to 
plain `convert(value)` plus the existing INT96 promotion, no record-handling 
logic needed. The end-to-end result is consistent with 
`extractSupportedSchema`'s single-entry assumption (and with the Avro reader's 
behavior for the same logical input).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix Parquet LIST/MAP wrapper extraction in native and Avro readers [pinot]

Reply via email to