Thanks for the context, Antoine.

However, even in those examples, I don't really see how coercing the
metadata to a single string makes much of a difference.
I believe the main difference of what I'm proposing would be that the
ExtensionType::Deserialize interface:
https://github.com/apache/arrow/blob/main/r/src/extension.h#L49-L51

Would instead look like:
```
  arrow::Result<std::shared_ptr<arrow::DataType>> Deserialize(
      std::shared_ptr<arrow::DataType> storage_type,
      std::shared_ptr<KeyValueMetadata> metadata) const;
```

In both of those cases though it seems like a
valid std::shared_ptr<KeyValueMetadata> is available to be passed to the
extension.

I suspect the more challenging case might be related to DataType equality
checks? It would not be possible for generic code to know whether it can
validly do things like concatenate two extension arrays without knowledge
of which metadata keys are relevant to the extension.  That said, with the
current adhoc serialization of metadata to a string, different
encoder-implementations still might still produce non-comparable strings,
resulting in falsely reported datatype mismatches, but at least avoiding
the case of false positives.

On Wed, Aug 16, 2023 at 5:19 PM Antoine Pitrou <anto...@python.org> wrote:

>
> Hi Jeremy,
>
> A single key makes it easier for generic code to recreate extension
> types it does not know about.
>
> Here is an example in the C++ IPC layer:
>
> https://github.com/apache/arrow/blob/641201416c1075edfd05d78b539275065daac31d/cpp/src/arrow/ipc/metadata_internal.cc#L823-L845
>
> Here is similar logic in the C++ bridge for the C Data Interface:
>
> https://github.com/apache/arrow/blob/641201416c1075edfd05d78b539275065daac31d/cpp/src/arrow/c/bridge.cc#L1021-L1029
>
> It is probably expected that many extension types will be parameter-less
> (such as UUID, JSON, BSON...).
>
> It does imply that extension types with sophisticated parameterization
> must implement a custom (de)serialization mechanism themselves. I'm not
> sure this tradeoff was discussed at the time, perhaps other people (Wes?
> Jacques?) may chime in.
>
> Regards
>
> Antoine.
>
>
>
> Le 16/08/2023 à 16:32, Jeremy Leibs a écrit :
> > Hello,
> >
> > I've recently started working with extension types as part of our project
> > and I was surprised to discover that extension types are required to pack
> > all of their own metadata into a single string value of the
> > "ARROW:extension:metadata" key.
> >
> > In turn this then means we have to endure arbitrary unstructured /
> > hard-to-validate strings with custom encodings (e.g. JSON inside
> > flatbuffer) when dealing with extensions.
> >
> > Can anyone provide some context on the rationale for this design
> decision?
> >
> > Given that we already have (1) a perfectly good metadata keyvalue store
> > already in place, and (2) established recommendations for
> > namespaced scoping of keys, why would we not just use that to store the
> > metadata for the extension. For example:
> >
> > "ARROW:extension:name": "myorg.myextension",
> > "myorg:myextension:meta1": "value1",
> > "myorg:myextension:meta2": "value2",
> >
> > Thanks for any insights,
> > Jeremy
> >
>

Reply via email to