Re: Usage of parquet field_id

Ted Gooch Thu, 20 May 2021 11:35:11 -0700

Hi Weston,

I can also give you more background on how we use this for read( and plan
on using for write) specifically in relation to arrow in the Iceberg python
client lib.


On Thu, May 20, 2021 at 7:23 AM Weston Pace <weston.p...@gmail.com> wrote:

> > #1 is a problem and we should remove the auto-generation.
>
> Sounds like we are aligned.
>
> > I hope that helps.
>
> Thanks for the extra details.  I've learned a lot and it helps to know
> how this is used.
>
> On Tue, May 18, 2021 at 2:20 PM Ryan Blue <b...@apache.org> wrote:
> >
> > Hi Weston,
> >
> > #1 is a problem and we should remove the auto-generation. The issue is
> that auto-generating an ID can result in a collision between Iceberg's
> field IDs and the generated IDs. Since Iceberg uses the ID to identify a
> field, that would result in unrelated data being mistaken for a column's
> data.
> >
> > Your description above for #2 is a bit confusing for me. Field IDs are
> used to track fields across renames and other schema changes. Those schema
> changes don't happen in a single file. A file is written with some schema
> (which includes IDs) and later field resolution happens based on ID. I
> might have a table with fields `1: a int, 2: b string` that is later
> evolved to `1: x long, 3: b string`. Any given data file is written with
> only one version of the schema. From the IDs, you can see that field 1 was
> renamed and promoted to long, field 2 was deleted, and field 3 was added
> with field 2's original name.
> >
> > This ID-based approach is an alternative to name-based resolution (like
> Avro uses) or position-based resolution (like CSV uses). Both of those
> resolution methods are flawed and result in correctness issues:
> > 1. Name-based resolution can't drop a column and add a new one with the
> same name
> > 2. Position-based resolution can't drop a column in the middle of the
> schema
> >
> > Only ID-based resolution gives you the expected SQL behavior for table
> evolution (ADD/DROP/RENAME COLUMN).
> >
> > For your original questions:
> >
> > * Filtering a table is a matter of selecting columns by ID and running
> filters by ID. In Iceberg. we bind the current names in a SQL table to the
> field IDs to do this.
> > * Filling in null values is done by identifying that a column ID is
> missing in a data file. Null values are used in place.
> > * Casting or promoting data is done by strict rules in Iceberg. This is
> affected by ID because we know that a field is the same across files, like
> in my example above.
> > * For combining fields, it sounds like you're thinking about operations
> on the data and when to carry IDs through an operation. I wouldn't
> recommend ever carrying IDs through. In Spark, we use the current schema's
> names to produce rows. SQL always uses the current names. And when we write
> back out to a table, we use SQL semantics, which are to align by position.
> >
> > I hope that helps. If it's not clear, I'm happy to jump on a call to
> talk through it with you.
> >
> > Ryan
> >
> > On Tue, May 18, 2021 at 1:48 PM Weston Pace <weston.p...@gmail.com>
> wrote:
> >>
> >> Ok, this is matching my understanding of how field_id is used as well.
> >> I believe #1 will not be an issue because I think Iceberg always sets
> >> the field_id property when writing data?  If that is the case then
> >> Iceberg would never have noticed the old behavior.  In other words,
> >> Iceberg never relied on Arrow to set the field_id.
> >>
> >> For #2 I think your example is helpful.  The `field_id` is sort of a
> >> file-specific concept.  Once you are at the dataset layer the Iceberg
> >> schema takes precedence and the field_id is no longer necessary.
> >>
> >> Also, thinking about it more generally, metadata is really part of the
> >> schema / control channel.  The compute operations in Arrow are more
> >> involved with the data channel.  "Combining metadata" might be a
> >> concern of tools that "combine schema" (e.g. dataset evolution) but
> >> isn't a concern of tools that combine data (e.g. Arrow compute).  So
> >> in that sense the compute operations probably don't need to worry much
> >> about preserving schema.
> >>
> >> This has been helpful to hear how this is used.  I needed a concrete
> >> example to bounce the idea around in my head with.
> >>
> >> Thanks,
> >>
> >> -Weston
> >>
> >> On Tue, May 18, 2021 at 5:48 AM Daniel Weeks <dwe...@apache.org> wrote:
> >> >
> >> > Hey Weston,
> >> >
> >> > From the Iceberg's perspective, the field_id is necessary to track
> the evolution of the schema over time.  It's best to think of the problem
> from a dataset perspective as opposed to a file perspective.
> >> >
> >> > Iceberg maintains the mapping of the schema with respect to the field
> ids because as the files in the datasets change, the field names may
> change, but field id is intended to be persistent and referenceable
> regardless of name or position within the file.
> >> >
> >> > For #1 above, I'm not sure I understand the issue of having the field
> ids auto-generated.  If you're not using the field ids to reference the
> columns, does it matter if they are present or not?
> >> >
> >> > For #2, I would speculate that the field id is less relevant after
> the initial projection and filtering (it really depends on how the engine
> wants to track fields at that point, so I would suspect that maybe field id
> wouldn't be ideal especially after various transforms or aggregations are
> applied).  However, it does matter when persisting the data as the field
> ids need to be resolved to the target dataset.  If it's a new dataset, new
> field ids can be generated using the original approach.  However, if the
> data is being appended to an existing dataset, the field ids need to be
> resolved against that target dataset and rewritten before persisting to
> parquet so they align with the Iceberg schema (in SQL this is done
> positionally).
> >> >
> >> > Let me know if any of that doesn't make sense.  I'm still a little
> unclear on the issue in #1, so it would be helpful if you could clarify
> that for me.
> >> >
> >> > Thanks,
> >> > Dan
> >> >
> >> > On Mon, May 17, 2021 at 8:50 PM Weston Pace <weston.p...@gmail.com>
> wrote:
> >> >>
> >> >> Hello Iceberg devs,
> >> >>
> >> >> I'm Weston, I've been working on the Arrow project lately and I am
> >> >> reviewing how we handle the parquet field_id (and also adding support
> >> >> for specifying a field_id at write time) from parquet[1][2].   This
> >> >> has brought up two questions.
> >> >>
> >> >>  1. The original PR adding field_id support[3][4] not only allowed
> the
> >> >> field_id to pass through from parquet to arrow but also generated ids
> >> >> (in a depth first fashion) for fields that did not have a field_id.
> >> >> In retrospect, it seems this auto-generation of field_id was probably
> >> >> not a good idea.  Would it have any impact on Iceberg if we removed
> >> >> it?  Just to be clear, we will still have support  for reading (and
> >> >> now writing) the parquet field_id.  I am only talking about removing
> >> >> the auto-generation of missing values.
> >> >>
> >> >>  2. For the second question I'm looking for the Iceberg community's
> >> >> opinion as users of Arrow.  Arrow is enabling more support for
> >> >> computation on data (e.g. relational operators) and I've been
> >> >> wondering how those transformations should affect metadata (like the
> >> >> field_id).  For some examples:
> >> >>
> >> >>  * Filtering a table by column (it seems the field_id/metadata should
> >> >> remain unchanged)
> >> >>  * Filtering a table by rows (it seems the field_id/metadata should
> >> >> remain unchanged)
> >> >>  * Filling in null values with a placeholder value (the data is
> changed so ???)
> >> >>  * Casting a field to a different data type (the meaning of the data
> >> >> has changed so ???)
> >> >>  * Combining two fields into a third field (it seems the
> >> >> field_id/metadata should be erased in the third field but presumably
> >> >> it could also be the joined metadata from the two origin fields)
> >> >>
> >> >> Thanks for your time,
> >> >>
> >> >> -Weston Pace
> >> >>
> >> >> [1] https://issues.apache.org/jira/browse/PARQUET-1798
> >> >> [2] https://github.com/apache/arrow/pull/10289
> >> >> [3] https://issues.apache.org/jira/browse/ARROW-7080
> >> >> [4] https://github.com/apache/arrow/pull/6408
> >
> >
> >
> > --
> > Ryan Blue
>

Re: Usage of parquet field_id

Reply via email to