Re: [DISCUSS] Semantics of custom_metadata

Neal Richardson Wed, 11 Mar 2020 12:54:14 -0700

While the underlying storage may allow duplicate keys, it seems much more
likely that someone would end up with duplicate keys by accident than by
design. And although it may be up to the implementations to determine or
enforce uniqueness constraints, it might be a good idea to make a
project-level statement about this (a "should" rather than "must"
specification).


As for the integration test JSON format, "a list of key/value pairs like
{"key": $key, "value": $value}" does make sense, not so much because we
should support duplicated keys but because we need to preserve order, as
Wes pointed out.

Neal

On Wed, Mar 11, 2020 at 12:36 PM Wes McKinney <wesmck...@gmail.com> wrote:

> On Wed, Mar 11, 2020 at 2:22 PM Antoine Pitrou <solip...@pitrou.net>
> wrote:
> >
> > On Wed, 11 Mar 2020 12:44:26 -0500
> > Wes McKinney <wesmck...@gmail.com> wrote:
> > > On this note, in Python we should probably re-evaluate the data
> > > structure returned when accessing the "metadata" field.
> >
> > I think it's ok for the convenience API to return a dict, if we also
> > expose e.g. a "metadata_items" that returns an iterable of key/value
> > pairs.
>
> The alternative would be a dict-like object that implements some of
> the dict APIs. Might also be useful to expose KeyValueMetadata::Merge
> in Python
>
> > Regards
> >
> > Antoine.
> >
> >
> > >
> > > On Wed, Mar 11, 2020 at 12:42 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> > > >
> > > > In the C++ library at least, uniqueness is never asserted when
> reading
> > > > and writing the IPC metadata [1] [2]. If you use
> > > > KeyValueMetadata::FindKey and the keys are non-unique, it will return
> > > > the first one it finds. KeyValueMetadata::Merge assumes uniqueness,
> > > > and the KeyValueMetadata::ToUnorderedMap function will drop all but
> > > > one duplicate.
> > > >
> > > > In Parquet, the metadata is also a list of KeyValue pairs with no
> > > > qualifications [3]
> > > >
> > > > My weak preference is to leave it to applications to make assertions
> > > > about uniqueness. In either case since the metadata is ordered in the
> > > > integration tests it would make sense to serialize as a list of
> > > > key/value pairs like {"key": $key, "value": $value}
> > > >
> > > > [1]:
> https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/src/arrow/ipc/metadata_internal.cc#L463
> > > > [2]:
> https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/src/arrow/ipc/metadata_internal.cc#L471
> > > > [3]:
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728
> > > >
> > > > On Wed, Mar 11, 2020 at 12:11 PM Ben Kietzman <
> ben.kietz...@rstudio.com> wrote:
> > > > >
> > > > > While working on https://issues.apache.org/jira/browse/ARROW-2255
> > > > > (serialize custom_metadata in the integration tests), we had the
> following
> > > > > discussion on GitHub:
> > > > >
> https://github.com/apache/arrow/pull/6556#pullrequestreview-372405940
> > > > >
> > > > > In short, although in Schema.fbs custom_metadata is declared as an
> array of
> > > > > KeyValue pairs (so duplicate keys would be possible), all reference
> > > > > implementations assume it to represent an associative map with
> unique keys.
> > > > >
> > > > > Is there a use case for duplicate metadata keys? It seems that an
> > > > > acceptable resolution might be to note in Schema.fbs that
> implementations
> > > > > are allowed to assume that keys are unique
> > > > >
> > > > > Ben
> > >
> >
> >
> >
>

Re: [DISCUSS] Semantics of custom_metadata

Reply via email to