hcrosse opened a new issue, #722:
URL: https://github.com/apache/arrow-go/issues/722

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   arrow-go writes `REPEATED` as the `repetition_type` for the root 
`SchemaElement` in the Parquet Thrift footer. I think this is non-standard and 
it's caused some interoperability failures for me.
   
   The default `rootRepetition` in 
[`WriterProperties`](https://github.com/apache/arrow-go/blob/main/parquet/writer_properties.go#L519)
 is `Repetitions.Repeated`. While `WithRootRepetition` exists as an opt-in 
override, the default itself is non-standard, and consumers of arrow-go (like 
[apache/iceberg-go](https://github.com/apache/iceberg-go)) inherit this default 
and may not expose ways to modify it.
   
   Per the [Parquet format 
spec](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L516-L518):
   
   > "The root of the schema does not have a repetition_type. All other nodes 
must have one."
   
   The `repetition_type` field on `SchemaElement` is `optional` in the Thrift 
definition specifically because the root should not carry one. Among the 
Parquet implementations I checked, arrow-go is the only one that writes 
`REPEATED` into the Thrift footer for the root element:
   
   | Implementation | In-memory | On disk (Thrift footer) | Source |
   |---|---|---|---|
   | **Parquet spec** | N/A | Not set | 
[parquet.thrift#L516-L518](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L516-L518)
 |
   | **parquet-java** | `REPEATED` | **Not set** (stripped during 
serialization) | 
[MessageType.java#L36](https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/schema/MessageType.java#L36),
 
[ParquetMetadataConverter.java#L323-L329](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L323-L329)
 |
   | **Arrow C++ / pyarrow** | `REQUIRED` | `REQUIRED` | 
[schema.cc#L1228](https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc#L1228)
 |
   | **arrow-rs (Rust)** | `None` | **Not set** | 
[types.rs#L45-L46](https://github.com/apache/arrow-rs/blob/main/parquet/src/schema/types.rs#L45-L46),
 
[types.rs#L590-L591](https://github.com/apache/arrow-rs/blob/main/parquet/src/schema/types.rs#L590-L591)
 |
   | **arrow-go** | `REPEATED` | **`REPEATED`** | 
[writer_properties.go#L519](https://github.com/apache/arrow-go/blob/main/parquet/writer_properties.go#L519)
 |
   
   For added context, arrow-rs explicitly tolerates and strips root repetition 
when reading files from other implementations 
([types.rs#L1383-L1396](https://github.com/apache/arrow-rs/blob/main/parquet/src/schema/types.rs#L1383-L1396)).
   
   In my specific example, Snowflake rejects Parquet files written with 
`REPEATED` root repetition when they contain list columns:
   
   > "List encoding is not supported. List encoding: '0'"
   
   I was able to reproduce this consistently: any iceberg-go table with list 
columns fails to load in Snowflake when the root schema element has `REPEATED` 
repetition.
   
   A couple possible fixes:
   
   1. Don't serialize `repetition_type` for the root `SchemaElement` at all, 
matching parquet-java and arrow-rs behavior and the Parquet spec exactly. The 
`WithRootRepetition` option and in-memory representation would be unaffected.
   2. Change the default to `Repetitions.Required`, matching Arrow C++/pyarrow. 
Less spec-pure but a smaller change.
   
   Either way, the existing `WithRootRepetition` API will remain available for 
anyone who needs to override the behavior. I'm happy to submit a PR for 
whichever approach is preferred if either of these sound good.
   
   ### Component(s)
   
   Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to