Hi Weston, #1 is a problem and we should remove the auto-generation. The issue is that auto-generating an ID can result in a collision between Iceberg's field IDs and the generated IDs. Since Iceberg uses the ID to identify a field, that would result in unrelated data being mistaken for a column's data.
Your description above for #2 is a bit confusing for me. Field IDs are used to track fields across renames and other schema changes. Those schema changes don't happen in a single file. A file is written with some schema (which includes IDs) and later field resolution happens based on ID. I might have a table with fields `1: a int, 2: b string` that is later evolved to `1: x long, 3: b string`. Any given data file is written with only one version of the schema. From the IDs, you can see that field 1 was renamed and promoted to long, field 2 was deleted, and field 3 was added with field 2's original name. This ID-based approach is an alternative to name-based resolution (like Avro uses) or position-based resolution (like CSV uses). Both of those resolution methods are flawed and result in correctness issues: 1. Name-based resolution can't drop a column and add a new one with the same name 2. Position-based resolution can't drop a column in the middle of the schema Only ID-based resolution gives you the expected SQL behavior for table evolution (ADD/DROP/RENAME COLUMN). For your original questions: * Filtering a table is a matter of selecting columns by ID and running filters by ID. In Iceberg. we bind the current names in a SQL table to the field IDs to do this. * Filling in null values is done by identifying that a column ID is missing in a data file. Null values are used in place. * Casting or promoting data is done by strict rules in Iceberg. This is affected by ID because we know that a field is the same across files, like in my example above. * For combining fields, it sounds like you're thinking about operations on the data and when to carry IDs through an operation. I wouldn't recommend ever carrying IDs through. In Spark, we use the current schema's names to produce rows. SQL always uses the current names. And when we write back out to a table, we use SQL semantics, which are to align by position. I hope that helps. If it's not clear, I'm happy to jump on a call to talk through it with you. Ryan On Tue, May 18, 2021 at 1:48 PM Weston Pace <[email protected]> wrote: > Ok, this is matching my understanding of how field_id is used as well. > I believe #1 will not be an issue because I think Iceberg always sets > the field_id property when writing data? If that is the case then > Iceberg would never have noticed the old behavior. In other words, > Iceberg never relied on Arrow to set the field_id. > > For #2 I think your example is helpful. The `field_id` is sort of a > file-specific concept. Once you are at the dataset layer the Iceberg > schema takes precedence and the field_id is no longer necessary. > > Also, thinking about it more generally, metadata is really part of the > schema / control channel. The compute operations in Arrow are more > involved with the data channel. "Combining metadata" might be a > concern of tools that "combine schema" (e.g. dataset evolution) but > isn't a concern of tools that combine data (e.g. Arrow compute). So > in that sense the compute operations probably don't need to worry much > about preserving schema. > > This has been helpful to hear how this is used. I needed a concrete > example to bounce the idea around in my head with. > > Thanks, > > -Weston > > On Tue, May 18, 2021 at 5:48 AM Daniel Weeks <[email protected]> wrote: > > > > Hey Weston, > > > > From the Iceberg's perspective, the field_id is necessary to track the > evolution of the schema over time. It's best to think of the problem from > a dataset perspective as opposed to a file perspective. > > > > Iceberg maintains the mapping of the schema with respect to the field > ids because as the files in the datasets change, the field names may > change, but field id is intended to be persistent and referenceable > regardless of name or position within the file. > > > > For #1 above, I'm not sure I understand the issue of having the field > ids auto-generated. If you're not using the field ids to reference the > columns, does it matter if they are present or not? > > > > For #2, I would speculate that the field id is less relevant after the > initial projection and filtering (it really depends on how the engine wants > to track fields at that point, so I would suspect that maybe field id > wouldn't be ideal especially after various transforms or aggregations are > applied). However, it does matter when persisting the data as the field > ids need to be resolved to the target dataset. If it's a new dataset, new > field ids can be generated using the original approach. However, if the > data is being appended to an existing dataset, the field ids need to be > resolved against that target dataset and rewritten before persisting to > parquet so they align with the Iceberg schema (in SQL this is done > positionally). > > > > Let me know if any of that doesn't make sense. I'm still a little > unclear on the issue in #1, so it would be helpful if you could clarify > that for me. > > > > Thanks, > > Dan > > > > On Mon, May 17, 2021 at 8:50 PM Weston Pace <[email protected]> > wrote: > >> > >> Hello Iceberg devs, > >> > >> I'm Weston, I've been working on the Arrow project lately and I am > >> reviewing how we handle the parquet field_id (and also adding support > >> for specifying a field_id at write time) from parquet[1][2]. This > >> has brought up two questions. > >> > >> 1. The original PR adding field_id support[3][4] not only allowed the > >> field_id to pass through from parquet to arrow but also generated ids > >> (in a depth first fashion) for fields that did not have a field_id. > >> In retrospect, it seems this auto-generation of field_id was probably > >> not a good idea. Would it have any impact on Iceberg if we removed > >> it? Just to be clear, we will still have support for reading (and > >> now writing) the parquet field_id. I am only talking about removing > >> the auto-generation of missing values. > >> > >> 2. For the second question I'm looking for the Iceberg community's > >> opinion as users of Arrow. Arrow is enabling more support for > >> computation on data (e.g. relational operators) and I've been > >> wondering how those transformations should affect metadata (like the > >> field_id). For some examples: > >> > >> * Filtering a table by column (it seems the field_id/metadata should > >> remain unchanged) > >> * Filtering a table by rows (it seems the field_id/metadata should > >> remain unchanged) > >> * Filling in null values with a placeholder value (the data is changed > so ???) > >> * Casting a field to a different data type (the meaning of the data > >> has changed so ???) > >> * Combining two fields into a third field (it seems the > >> field_id/metadata should be erased in the third field but presumably > >> it could also be the joined metadata from the two origin fields) > >> > >> Thanks for your time, > >> > >> -Weston Pace > >> > >> [1] https://issues.apache.org/jira/browse/PARQUET-1798 > >> [2] https://github.com/apache/arrow/pull/10289 > >> [3] https://issues.apache.org/jira/browse/ARROW-7080 > >> [4] https://github.com/apache/arrow/pull/6408 > -- Ryan Blue
