Re: Usage of parquet field_id

Ryan Blue Tue, 18 May 2021 17:20:59 -0700

Hi Weston,

#1 is a problem and we should remove the auto-generation. The issue is that
auto-generating an ID can result in a collision between Iceberg's field IDs
and the generated IDs. Since Iceberg uses the ID to identify a field, that
would result in unrelated data being mistaken for a column's data.


Your description above for #2 is a bit confusing for me. Field IDs are used
to track fields across renames and other schema changes. Those schema
changes don't happen in a single file. A file is written with some schema
(which includes IDs) and later field resolution happens based on ID. I
might have a table with fields `1: a int, 2: b string` that is later
evolved to `1: x long, 3: b string`. Any given data file is written with
only one version of the schema. From the IDs, you can see that field 1 was
renamed and promoted to long, field 2 was deleted, and field 3 was added
with field 2's original name.

This ID-based approach is an alternative to name-based resolution (like
Avro uses) or position-based resolution (like CSV uses). Both of those
resolution methods are flawed and result in correctness issues:
1. Name-based resolution can't drop a column and add a new one with the
same name
2. Position-based resolution can't drop a column in the middle of the schema

Only ID-based resolution gives you the expected SQL behavior for table
evolution (ADD/DROP/RENAME COLUMN).

For your original questions:

* Filtering a table is a matter of selecting columns by ID and running
filters by ID. In Iceberg. we bind the current names in a SQL table to the
field IDs to do this.
* Filling in null values is done by identifying that a column ID is missing
in a data file. Null values are used in place.
* Casting or promoting data is done by strict rules in Iceberg. This is
affected by ID because we know that a field is the same across files, like
in my example above.
* For combining fields, it sounds like you're thinking about operations on
the data and when to carry IDs through an operation. I wouldn't recommend
ever carrying IDs through. In Spark, we use the current schema's names to
produce rows. SQL always uses the current names. And when we write back out
to a table, we use SQL semantics, which are to align by position.

I hope that helps. If it's not clear, I'm happy to jump on a call to talk
through it with you.

Ryan

On Tue, May 18, 2021 at 1:48 PM Weston Pace <[email protected]> wrote:

> Ok, this is matching my understanding of how field_id is used as well.
> I believe #1 will not be an issue because I think Iceberg always sets
> the field_id property when writing data?  If that is the case then
> Iceberg would never have noticed the old behavior.  In other words,
> Iceberg never relied on Arrow to set the field_id.
>
> For #2 I think your example is helpful.  The `field_id` is sort of a
> file-specific concept.  Once you are at the dataset layer the Iceberg
> schema takes precedence and the field_id is no longer necessary.
>
> Also, thinking about it more generally, metadata is really part of the
> schema / control channel.  The compute operations in Arrow are more
> involved with the data channel.  "Combining metadata" might be a
> concern of tools that "combine schema" (e.g. dataset evolution) but
> isn't a concern of tools that combine data (e.g. Arrow compute).  So
> in that sense the compute operations probably don't need to worry much
> about preserving schema.
>
> This has been helpful to hear how this is used.  I needed a concrete
> example to bounce the idea around in my head with.
>
> Thanks,
>
> -Weston
>
> On Tue, May 18, 2021 at 5:48 AM Daniel Weeks <[email protected]> wrote:
> >
> > Hey Weston,
> >
> > From the Iceberg's perspective, the field_id is necessary to track the
> evolution of the schema over time.  It's best to think of the problem from
> a dataset perspective as opposed to a file perspective.
> >
> > Iceberg maintains the mapping of the schema with respect to the field
> ids because as the files in the datasets change, the field names may
> change, but field id is intended to be persistent and referenceable
> regardless of name or position within the file.
> >
> > For #1 above, I'm not sure I understand the issue of having the field
> ids auto-generated.  If you're not using the field ids to reference the
> columns, does it matter if they are present or not?
> >
> > For #2, I would speculate that the field id is less relevant after the
> initial projection and filtering (it really depends on how the engine wants
> to track fields at that point, so I would suspect that maybe field id
> wouldn't be ideal especially after various transforms or aggregations are
> applied).  However, it does matter when persisting the data as the field
> ids need to be resolved to the target dataset.  If it's a new dataset, new
> field ids can be generated using the original approach.  However, if the
> data is being appended to an existing dataset, the field ids need to be
> resolved against that target dataset and rewritten before persisting to
> parquet so they align with the Iceberg schema (in SQL this is done
> positionally).
> >
> > Let me know if any of that doesn't make sense.  I'm still a little
> unclear on the issue in #1, so it would be helpful if you could clarify
> that for me.
> >
> > Thanks,
> > Dan
> >
> > On Mon, May 17, 2021 at 8:50 PM Weston Pace <[email protected]>
> wrote:
> >>
> >> Hello Iceberg devs,
> >>
> >> I'm Weston, I've been working on the Arrow project lately and I am
> >> reviewing how we handle the parquet field_id (and also adding support
> >> for specifying a field_id at write time) from parquet[1][2].   This
> >> has brought up two questions.
> >>
> >>  1. The original PR adding field_id support[3][4] not only allowed the
> >> field_id to pass through from parquet to arrow but also generated ids
> >> (in a depth first fashion) for fields that did not have a field_id.
> >> In retrospect, it seems this auto-generation of field_id was probably
> >> not a good idea.  Would it have any impact on Iceberg if we removed
> >> it?  Just to be clear, we will still have support  for reading (and
> >> now writing) the parquet field_id.  I am only talking about removing
> >> the auto-generation of missing values.
> >>
> >>  2. For the second question I'm looking for the Iceberg community's
> >> opinion as users of Arrow.  Arrow is enabling more support for
> >> computation on data (e.g. relational operators) and I've been
> >> wondering how those transformations should affect metadata (like the
> >> field_id).  For some examples:
> >>
> >>  * Filtering a table by column (it seems the field_id/metadata should
> >> remain unchanged)
> >>  * Filtering a table by rows (it seems the field_id/metadata should
> >> remain unchanged)
> >>  * Filling in null values with a placeholder value (the data is changed
> so ???)
> >>  * Casting a field to a different data type (the meaning of the data
> >> has changed so ???)
> >>  * Combining two fields into a third field (it seems the
> >> field_id/metadata should be erased in the third field but presumably
> >> it could also be the joined metadata from the two origin fields)
> >>
> >> Thanks for your time,
> >>
> >> -Weston Pace
> >>
> >> [1] https://issues.apache.org/jira/browse/PARQUET-1798
> >> [2] https://github.com/apache/arrow/pull/10289
> >> [3] https://issues.apache.org/jira/browse/ARROW-7080
> >> [4] https://github.com/apache/arrow/pull/6408
>


-- 
Ryan Blue

Re: Usage of parquet field_id

Reply via email to