I think that makes sense - arrow is creating this "SchemaManifest" that normalizes the list encoding under the hood. So would the desired state for parquet-java be something like this <https://github.com/clairemcginty/parquet-mr/blob/48591ebe04e42648c31d97c121440ab5e93e981b/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetReaderListEncodings.java#L268-L306>, where you can pass a reader schema using any of the valid list encodings and it will work on any valid writer schema encoding? I could see that getting complex to implement, but it would be quite nice as a Parquet consumer.
Claire On Tue, Apr 14, 2026 at 5:48 AM Gang Wu <[email protected]> wrote: > IMO, this is a common feature not tied to avro so parquet-column looks > better. > > On Tue, Apr 14, 2026 at 4:03 AM Claire McGinty <[email protected] > > > wrote: > > > Hey Gang, thanks for linking the Arrow code! That functionality would be > > great to have in parquet-java. Would you see it living in the > parquet-avro > > reader code specifically (and therefore picked up by parquet-cli), or > added > > to the core reader functionality in parquet-column? > > > > - Claire > > > > On Wed, Apr 1, 2026 at 10:22 PM Gang Wu <[email protected]> wrote: > > > > > Hi Claire, > > > > > > I agree that supporting all "legacy" list encodings is painful and it > has > > > caused troubles in the past. > > > > > > It seems that parquet-cli mainly depends on parquet-avro so it also > > > requires > > > settings from parquet-avro to resolve list structure. Perhaps we can do > > > something similar to what parquet-cpp currently does for list encoding > > > resolution [1], which does not require extra information other than the > > > MessageType. > > > > > > [1] > > > > > > > > > https://github.com/apache/arrow/blob/976d547fba9b4bff4178e515ca8cdcb8a5db4d46/cpp/src/parquet/arrow/schema.cc#L706-L790 > > > > > > > > > Best, > > > Gang > > > > > > On Wed, Apr 1, 2026 at 2:08 AM Claire McGinty < > > [email protected]> > > > wrote: > > > > > > > Hi all, > > > > > > > > I wanted to bring up the topic of Parquet's supported encodings for > > List > > > > logical types > > > > < > > > > > > > > > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists > > > > > > > > > . > > > > > > > > Having multiple valid List encodings is becoming a pain point for my > > org, > > > > especially since we read and write Parquet from different engines > with > > > > different default values (for example, Ray/pyarrow > > > > < > > > > > > > > > > https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html > > > > > > > > > writes Parquet lists using the latest 3-level list encoding; writes > > from > > > > Scio <https://spotify.github.io/scio/io/Parquet.html> use the > default > > > > parquet-avro encoding, which uses an older encoding; we even have a > few > > > > datasets with primitive required list types that just encode using > one > > > > level, e.g. `repeated int32 my_element`). > > > > > > > > Parquet-cli > > > > < > > > > https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md > > > > > > > also doesn't work out of the box for all these encoding types, unless > > you > > > > manually specify a Configuration file specifying the encoding. > Overall, > > > > it's frustrating for our users reading these files to have to look up > > the > > > > write schema, then look up the right Configuration key, then figure > out > > > how > > > > to pass in that Configuration to parquet-cli or parquet-avro. > > > > > > > > So I'm wondering if there'd be any interest in: > > > > > > > > - Contributing a public utility method (to parquet-common? Or > maybe > > > > there's a better place for it) that accepts either a Parquet > > > > `MessageType` > > > > or a `Path` and detects which type of List encoding is being used. > > > > (This is > > > > probably easier said than done, but at least the > > > backwards-compatibility > > > > rules > > > > < > > > > > > > > > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules > > > > > > > > > are > > > > finite and clear to interpret.) > > > > - Integrating that utility method into parquet-cli/parquet-avro, > as > > > well > > > > as any other parquet formats that support Lists (i.e. > > > magnolify-parquet > > > > <https://spotify.github.io/magnolify/parquet.html>). > > > > > > > > One potential corner case I can think of is that I guess if you're > > > manually > > > > specifying your Parquet schema (rather than using an established > format > > > > like parquet-avro), there's nothing preventing you from mixing and > > > matching > > > > list encodings. But we could just have the utility method throw an > > > > exception in that case and force the user to specify a schema > > explicitly. > > > > > > > > Thanks, and let me know what you think, > > > > Claire > > > > > > > > > >
