Hi Antoine,

Option 1 seems reasonable to me...given that we got 10 years out of
the current specification without anybody noticing, I bet we can get
another 10 out of KeyValueMetadata on the dictionary encoding :)

If there are no objections I can put together a PR for this (may take
me a few weeks).

Cheers,

-dewey

On Tue, Apr 14, 2026 at 2:53 AM Antoine Pitrou <[email protected]> wrote:
>
>
> Hi Dewey,
>
> That's an interesting finding. Indeed, the IPC serialization of data
> types (the Schema table) is currently not able to distinguish between
> those two cases, simply because the dictionary type is not represented
> separately from its value type.
>
> I think there are two possible ways to improve this:
>
>
> 1. An additional field in the DictionaryEncoding table that allows
> specifying custom KeyValue metadata for the dictionary value type.
>
> Pros:
> - easy to implement
> - gracefully degrades to legacy readers that will happily deserialize
> the storage type.
>
> Cons:
> - does not fully solve the general problem for more complex nestings of
> dictionary and extension types (e.g. an extension type with a dictionary
> storage type with extension values).
>
>
> 2. A new Dictionary table that participates in the Type union, where the
> dictionary index type would be serialized in Field::children[0] and the
> value type in Field::children[1].
>
> Pros:
> - fully general, as it allows to represent arbitrary nestings of
> dictionary and extension types.
>
> Cons:
> - implementation is more involved
> - legacy readers will not understand this and error out on the
> unrecognized type
> - writers will have to decide whether to use the new or the old way of
> representing dictionaries (the old way being preferable for compatibility).
>
>
> I would say we probably don't need 2) and can live with 1). But, of
> course, perhaps in 5 years we will regret this decision :-D
>
> Regards
>
> Antoine.
>
>
> Le 10/04/2026 à 17:16, Dewey Dunnington a écrit :
> > Hi all,
> >
> > In implementing dictionary decoding for nanoarrow's IPC reader [1] I
> > discovered that it is not possible to represent a dictionary-encoded
> > extension type in the IPC schema serialization. I've filed an issue
> > with the details at [2]...the summary is that a Dictionary with
> > Extension values is exported identically to a Extension with
> > Dictionary storage, which usually leads to an error on read (because
> > no extension types actually support dictionary storage types, except
> > maybe arrow.opaque because it can have arbitrary storage). I was also
> > reminded that arrow-rs can't represent dictionary-encoded extension
> > values at all [3].
> >
> > Given that there are a number of canonical extension types now, I
> > wonder if there should be a more clear route to roundtripping
> > dictionary-encoded extension types over IPC (either by making this
> > possible to represent in IPC or by making it clear that extension type
> > implementations must handle dictionary encoded storage). Somewhere in
> > the middle would be handling the error on deserialization (i.e., if
> > the extension type in the registry doesn't support dictionary encoded
> > storage, fall back to a dictionary with extension values).
> >
> > Cheers,
> >
> > -dewey
> >
> > [1] https://github.com/apache/arrow-nanoarrow/pull/861
> > [2] https://github.com/apache/arrow/issues/49704
> > [3] https://github.com/apache/arrow-rs/issues/7982
>

Reply via email to