Re: Parquet Column Resolution by ID

Jorge Cardoso Leitão Thu, 10 Feb 2022 21:05:04 -0800

Hi,

Thanks for the write-up!

Two questions:

* AFAIK most implementations identify which columns belong to a (nested)
field via the schema in path. (i.e. given field "a", give me all the
columns that are part of that field, e.g. "a.b.c", "a.d", etc.). How would
that work with field ids?

* The change

> With the support of column id resolution, the column ids must be unique
in the entire Parquet schema in order to identify a column correctly. In
the write path, an Exception will be thrown if the ids are not unique

Is backward incompatible? Could it make sense to rephrase it as:

* Writers MAY write a unique column id per field in order to identify a
column irrespectively of its name (e.g. column renames)
* If a reader identifies that a parquet file has unique column ids, it MAY
use column ids to identify columns (ignoring the column name).

This may be backward compatible and makes it an opt-in feature.

Best,
Jorge

On Fri, Feb 11, 2022 at 5:01 AM huaxin gao <huaxin.ga...@gmail.com> wrote:

> Hi Parquet community,
>
> Xinli and I drafted a design doc to support ID based column resolution in
> Parquet. Here is the link
> <
> https://docs.google.com/document/d/1hDLFIKuVhhnTNpA5bTo4nfD-MUZz8Iq4V9FXrr1WPsw/edit?usp=sharing
> >.
> We'd like to start a discussion on the doc and any feedback is welcome!
>
> Thanks,
> Huaxin
>

Re: Parquet Column Resolution by ID

Reply via email to