Since the projects this is relevant for include things like Iceberg which utilize the Parquet field ids, so can we reach out to those communities (dev@parquet and dev@iceberg) to solicit their feedback?
On Wed, May 12, 2021 at 2:21 PM Antoine Pitrou <[email protected]> wrote: > > > Le 12/05/2021 à 21:19, Weston Pace a écrit : > > The parquet format has a "field id" concept (unique integer identifier > > for a column) that gets promoted in the C++ implementation to a > > key/value pair in the field's metadata. > > I don't think anything says the "field id" should be unique. It's just > an opaque application-specific identifier. > > Regards > > Antoine. > > > > This has led me to a few > > questions around how this field (or metadata in general) interacts > > with higher level APIs. > > > > 1) > > > > At the moment it appears that metadata survives a simple scan which > > seems correct. It also seems pretty correct that the metadata should > > be lost on a complex transformation (e.g. projecting columns 'a' and > > 'b' into column 'c' = a/b, c should not have any of a or b's > > metadata?) > > > > That leaves a large amount of "in between". Should the metadata be > > preserved on a cast? What about a reordering operation? What if a > > projection leaves the data unchanged but changes the field name? > > > > Is there a good simple rule for this? > > > > 2) Do we need to account for the case where a dataset contains > > multiple fragments where the fields are in a different order but the > > field IDs are consistent? For example, the first fragment has columns > > [a/str, b/int] with field ids [1, 2] and the second fragment has > > columns [b/int, a/str] with field ids [2, 1]. Today I'm pretty sure > > we would fail to read this dataset. > > > > 3) A similar question is what happens if the column types are > > consistent but the field IDs are not (e.g. [a/int, b/str] and [a/int, > > b/str] with field ids [1, 2] and [2, 1]). That's probably more > > generally tied to schema evolution and I don't think we need to do > > anything special there. > >
