Since the projects this is relevant for include things like Iceberg
which utilize the Parquet field ids, so can we reach out to those
communities (dev@parquet and dev@iceberg) to solicit their feedback?

On Wed, May 12, 2021 at 2:21 PM Antoine Pitrou <[email protected]> wrote:
>
>
> Le 12/05/2021 à 21:19, Weston Pace a écrit :
> > The parquet format has a "field id" concept (unique integer identifier
> > for a column) that gets promoted in the C++ implementation to a
> > key/value pair in the field's metadata.
>
> I don't think anything says the "field id" should be unique. It's just
> an opaque application-specific identifier.
>
> Regards
>
> Antoine.
>
>
>
>    This has led me to a few
> > questions around how this field (or metadata in general) interacts
> > with higher level APIs.
> >
> > 1)
> >
> > At the moment it appears that metadata survives a simple scan which
> > seems correct.  It also seems pretty correct that the metadata should
> > be lost on a complex transformation (e.g. projecting columns 'a' and
> > 'b' into column 'c' = a/b, c should not have any of a or b's
> > metadata?)
> >
> > That leaves a large amount of "in between".  Should the metadata be
> > preserved on a cast?  What about a reordering operation?  What if a
> > projection leaves the data unchanged but changes the field name?
> >
> > Is there a good simple rule for this?
> >
> > 2) Do we need to account for the case where a dataset contains
> > multiple fragments where the fields are in a different order but the
> > field IDs are consistent?  For example, the first fragment has columns
> > [a/str, b/int] with field ids [1, 2] and the second fragment has
> > columns [b/int, a/str] with field ids [2, 1].  Today I'm pretty sure
> > we would fail to read this dataset.
> >
> > 3) A similar question is what happens if the column types are
> > consistent but the field IDs are not (e.g. [a/int, b/str] and [a/int,
> > b/str] with field ids [1, 2] and [2, 1]).  That's probably more
> > generally tied to schema evolution and I don't think we need to do
> > anything special there.
> >

Reply via email to