Re: Auto-detecting list encodings in Parquet files

Claire McGinty Wed, 15 Apr 2026 11:29:04 -0700

I think that makes sense - arrow is creating this "SchemaManifest"
that normalizes the list encoding under the hood. So would the desired
state for parquet-java be something like this
<https://github.com/clairemcginty/parquet-mr/blob/48591ebe04e42648c31d97c121440ab5e93e981b/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetReaderListEncodings.java#L268-L306>,
where you can pass a reader schema using any of the valid list encodings
and it will work on any valid writer schema encoding? I could see that
getting complex to implement, but it would be quite nice as a Parquet
consumer.


Claire


On Tue, Apr 14, 2026 at 5:48 AM Gang Wu <[email protected]> wrote:

> IMO, this is a common feature not tied to avro so parquet-column looks
> better.
>
> On Tue, Apr 14, 2026 at 4:03 AM Claire McGinty <[email protected]
> >
> wrote:
>
> > Hey Gang, thanks for linking the Arrow code! That functionality would be
> > great to have in parquet-java. Would you see it living in the
> parquet-avro
> > reader code specifically (and therefore picked up by parquet-cli), or
> added
> > to the core reader functionality in parquet-column?
> >
> > - Claire
> >
> > On Wed, Apr 1, 2026 at 10:22 PM Gang Wu <[email protected]> wrote:
> >
> > > Hi Claire,
> > >
> > > I agree that supporting all "legacy" list encodings is painful and it
> has
> > > caused troubles in the past.
> > >
> > > It seems that parquet-cli mainly depends on parquet-avro so it also
> > > requires
> > > settings from parquet-avro to resolve list structure. Perhaps we can do
> > > something similar to what parquet-cpp currently does for list encoding
> > > resolution [1], which does not require extra information other than the
> > > MessageType.
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/arrow/blob/976d547fba9b4bff4178e515ca8cdcb8a5db4d46/cpp/src/parquet/arrow/schema.cc#L706-L790
> > >
> > >
> > > Best,
> > > Gang
> > >
> > > On Wed, Apr 1, 2026 at 2:08 AM Claire McGinty <
> > [email protected]>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I wanted to bring up the topic of Parquet's supported encodings for
> > List
> > > > logical types
> > > > <
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists
> > > > >
> > > > .
> > > >
> > > > Having multiple valid List encodings is becoming a pain point for my
> > org,
> > > > especially since we read and write Parquet from different engines
> with
> > > > different default values (for example, Ray/pyarrow
> > > > <
> > > >
> > >
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html
> > > > >
> > > > writes Parquet lists using the latest 3-level list encoding; writes
> > from
> > > > Scio <https://spotify.github.io/scio/io/Parquet.html> use the
> default
> > > > parquet-avro encoding, which uses an older encoding; we even have a
> few
> > > > datasets with primitive required list types that just encode using
> one
> > > > level, e.g. `repeated int32 my_element`).
> > > >
> > > > Parquet-cli
> > > > <
> > >
> https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md
> > >
> > > > also doesn't work out of the box for all these encoding types, unless
> > you
> > > > manually specify a Configuration file specifying the encoding.
> Overall,
> > > > it's frustrating for our users reading these files to have to look up
> > the
> > > > write schema, then look up the right Configuration key, then figure
> out
> > > how
> > > > to pass in that Configuration to parquet-cli or parquet-avro.
> > > >
> > > > So I'm wondering if there'd be any interest in:
> > > >
> > > >    - Contributing a public utility method (to parquet-common? Or
> maybe
> > > >    there's a better place for it) that accepts either a Parquet
> > > > `MessageType`
> > > >    or a `Path` and detects which type of List encoding is being used.
> > > > (This is
> > > >    probably easier said than done, but at least the
> > > backwards-compatibility
> > > >    rules
> > > >    <
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
> > > > >
> > > > are
> > > >    finite and clear to interpret.)
> > > >    - Integrating that utility method into parquet-cli/parquet-avro,
> as
> > > well
> > > >    as any other parquet formats that support Lists (i.e.
> > > magnolify-parquet
> > > >    <https://spotify.github.io/magnolify/parquet.html>).
> > > >
> > > > One potential corner case I can think of is that I guess if you're
> > > manually
> > > > specifying your Parquet schema (rather than using an established
> format
> > > > like parquet-avro), there's nothing preventing you from mixing and
> > > matching
> > > > list encodings. But we could just have the utility method throw an
> > > > exception in that case and force the user to specify a schema
> > explicitly.
> > > >
> > > > Thanks, and let me know what you think,
> > > > Claire
> > > >
> > >
> >
>

Re: Auto-detecting list encodings in Parquet files

Reply via email to