I realize it's a not-insignificant change, and I'm not (yet) proposing such
a change without more discussion and thought into the consequences.

But, I don't think this would actually break any protocol, so I don't want
to prematurely preclude this as a possible future direction.

My understanding is that Arrow implementations are required to preserve and
pass along all unknown metadata. So if extensions started adding/reading
extra metadata that would all just be handled transparently by any
non-extension-aware code as it is today with no changes.
In fact, as far as I can tell, there's not really anything precluding us
from doing this today in our project internally -- the only limitation is
that we can't make use of any existing library ExtensionType
implementations and just have to implement our serializer / deserializer
logic on top of the lower-level/raw APIs where we have access to the
metadata.

If we were to make such a change in the official implementation though,
from a backwards-compatibility perspective, any existing extensions would
continue using the "ARROW:extension:metadata" key, so legacy extension code
would continue to be protocol-compatible.
And from a code-migration perspective that could even continue to be passed
through as a pre-extracted string with a combined interface like:
```
  arrow::Result<std::shared_ptr<arrow::DataType>> Deserialize(
      std::shared_ptr<arrow::DataType> storage_type,
      const std::string& serialized_data,
      std::shared_ptr<KeyValueMetadata> metadata) const;
```
So that migrating any existing extensions would just be a matter of adding
an unused parameter to their interface when updating to whatever version of
arrow enabled this feature.

I think the only material consequence would be that new extension types
that use this feature would end up with a minimum-version of the arrow
library they are compatible with. But that's not really any different than
it is today -- no implementation is required to support any specific
extension and extensions that have an internal dependency on new features
obviously can't be used with old versions of the library.

Best,
Jeremy

On Wed, Aug 16, 2023 at 5:53 PM Antoine Pitrou <anto...@python.org> wrote:

>
> Hmm, you're right that letting the extension type peek at the entire
> metadata values would have been another solution.
>
> That said, for protocol compatibility reasons, we cannot easily change
> this anymore.
>
> Regards
>
> Antoine.
>
>
>
> Le 16/08/2023 à 17:48, Jeremy Leibs a écrit :
> > Thanks for the context, Antoine.
> >
> > However, even in those examples, I don't really see how coercing the
> > metadata to a single string makes much of a difference.
> > I believe the main difference of what I'm proposing would be that the
> > ExtensionType::Deserialize interface:
> > https://github.com/apache/arrow/blob/main/r/src/extension.h#L49-L51
> >
> > Would instead look like:
> > ```
> >    arrow::Result<std::shared_ptr<arrow::DataType>> Deserialize(
> >        std::shared_ptr<arrow::DataType> storage_type,
> >        std::shared_ptr<KeyValueMetadata> metadata) const;
> > ```
> >
> > In both of those cases though it seems like a
> > valid std::shared_ptr<KeyValueMetadata> is available to be passed to the
> > extension.
> >
> > I suspect the more challenging case might be related to DataType equality
> > checks? It would not be possible for generic code to know whether it can
> > validly do things like concatenate two extension arrays without knowledge
> > of which metadata keys are relevant to the extension.  That said, with
> the
> > current adhoc serialization of metadata to a string, different
> > encoder-implementations still might still produce non-comparable strings,
> > resulting in falsely reported datatype mismatches, but at least avoiding
> > the case of false positives.
> >
> > On Wed, Aug 16, 2023 at 5:19 PM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> >>
> >> Hi Jeremy,
> >>
> >> A single key makes it easier for generic code to recreate extension
> >> types it does not know about.
> >>
> >> Here is an example in the C++ IPC layer:
> >>
> >>
> https://github.com/apache/arrow/blob/641201416c1075edfd05d78b539275065daac31d/cpp/src/arrow/ipc/metadata_internal.cc#L823-L845
> >>
> >> Here is similar logic in the C++ bridge for the C Data Interface:
> >>
> >>
> https://github.com/apache/arrow/blob/641201416c1075edfd05d78b539275065daac31d/cpp/src/arrow/c/bridge.cc#L1021-L1029
> >>
> >> It is probably expected that many extension types will be parameter-less
> >> (such as UUID, JSON, BSON...).
> >>
> >> It does imply that extension types with sophisticated parameterization
> >> must implement a custom (de)serialization mechanism themselves. I'm not
> >> sure this tradeoff was discussed at the time, perhaps other people (Wes?
> >> Jacques?) may chime in.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >>
> >> Le 16/08/2023 à 16:32, Jeremy Leibs a écrit :
> >>> Hello,
> >>>
> >>> I've recently started working with extension types as part of our
> project
> >>> and I was surprised to discover that extension types are required to
> pack
> >>> all of their own metadata into a single string value of the
> >>> "ARROW:extension:metadata" key.
> >>>
> >>> In turn this then means we have to endure arbitrary unstructured /
> >>> hard-to-validate strings with custom encodings (e.g. JSON inside
> >>> flatbuffer) when dealing with extensions.
> >>>
> >>> Can anyone provide some context on the rationale for this design
> >> decision?
> >>>
> >>> Given that we already have (1) a perfectly good metadata keyvalue store
> >>> already in place, and (2) established recommendations for
> >>> namespaced scoping of keys, why would we not just use that to store the
> >>> metadata for the extension. For example:
> >>>
> >>> "ARROW:extension:name": "myorg.myextension",
> >>> "myorg:myextension:meta1": "value1",
> >>> "myorg:myextension:meta2": "value2",
> >>>
> >>> Thanks for any insights,
> >>> Jeremy
> >>>
> >>
> >
>

Reply via email to