I reached out and got some feedback[1][2]. I think I've reached the conclusion that metadata is schema/control and compute is data. With that in mind I would argue the compute layer can (perhaps should?) always discard metadata. If a user is performing some query like "SELECT a/b AS c FROM table" and they want the resulting column to have some kind of metadata (e.g. explaining that c is a dynamic column based on a and b) then the generation of that combined metadata would belong to either the user, or the layer converting the query to execution plan, but it is not a responsibility of the compute layer.
[1] https://lists.apache.org/x/thread.html/r3396d802cb1b59c4f650f427f93f58290c5039995eac58f0a5459260@%3Cdev.iceberg.apache.org%3E [2] https://lists.apache.org/x/thread.html/rb053bbc19e8a75802a9fe3efd2905df725df7cb7a76968ae81bd6903@%3Cdev.parquet.apache.org%3E On Thu, May 13, 2021 at 5:52 AM Wes McKinney <[email protected]> wrote: > > Since the projects this is relevant for include things like Iceberg > which utilize the Parquet field ids, so can we reach out to those > communities (dev@parquet and dev@iceberg) to solicit their feedback? > > On Wed, May 12, 2021 at 2:21 PM Antoine Pitrou <[email protected]> wrote: > > > > > > Le 12/05/2021 à 21:19, Weston Pace a écrit : > > > The parquet format has a "field id" concept (unique integer identifier > > > for a column) that gets promoted in the C++ implementation to a > > > key/value pair in the field's metadata. > > > > I don't think anything says the "field id" should be unique. It's just > > an opaque application-specific identifier. > > > > Regards > > > > Antoine. > > > > > > > > This has led me to a few > > > questions around how this field (or metadata in general) interacts > > > with higher level APIs. > > > > > > 1) > > > > > > At the moment it appears that metadata survives a simple scan which > > > seems correct. It also seems pretty correct that the metadata should > > > be lost on a complex transformation (e.g. projecting columns 'a' and > > > 'b' into column 'c' = a/b, c should not have any of a or b's > > > metadata?) > > > > > > That leaves a large amount of "in between". Should the metadata be > > > preserved on a cast? What about a reordering operation? What if a > > > projection leaves the data unchanged but changes the field name? > > > > > > Is there a good simple rule for this? > > > > > > 2) Do we need to account for the case where a dataset contains > > > multiple fragments where the fields are in a different order but the > > > field IDs are consistent? For example, the first fragment has columns > > > [a/str, b/int] with field ids [1, 2] and the second fragment has > > > columns [b/int, a/str] with field ids [2, 1]. Today I'm pretty sure > > > we would fail to read this dataset. > > > > > > 3) A similar question is what happens if the column types are > > > consistent but the field IDs are not (e.g. [a/int, b/str] and [a/int, > > > b/str] with field ids [1, 2] and [2, 1]). That's probably more > > > generally tied to schema evolution and I don't think we need to do > > > anything special there. > > >
