Hi Antoine, Option 1 seems reasonable to me...given that we got 10 years out of the current specification without anybody noticing, I bet we can get another 10 out of KeyValueMetadata on the dictionary encoding :)
If there are no objections I can put together a PR for this (may take me a few weeks). Cheers, -dewey On Tue, Apr 14, 2026 at 2:53 AM Antoine Pitrou <[email protected]> wrote: > > > Hi Dewey, > > That's an interesting finding. Indeed, the IPC serialization of data > types (the Schema table) is currently not able to distinguish between > those two cases, simply because the dictionary type is not represented > separately from its value type. > > I think there are two possible ways to improve this: > > > 1. An additional field in the DictionaryEncoding table that allows > specifying custom KeyValue metadata for the dictionary value type. > > Pros: > - easy to implement > - gracefully degrades to legacy readers that will happily deserialize > the storage type. > > Cons: > - does not fully solve the general problem for more complex nestings of > dictionary and extension types (e.g. an extension type with a dictionary > storage type with extension values). > > > 2. A new Dictionary table that participates in the Type union, where the > dictionary index type would be serialized in Field::children[0] and the > value type in Field::children[1]. > > Pros: > - fully general, as it allows to represent arbitrary nestings of > dictionary and extension types. > > Cons: > - implementation is more involved > - legacy readers will not understand this and error out on the > unrecognized type > - writers will have to decide whether to use the new or the old way of > representing dictionaries (the old way being preferable for compatibility). > > > I would say we probably don't need 2) and can live with 1). But, of > course, perhaps in 5 years we will regret this decision :-D > > Regards > > Antoine. > > > Le 10/04/2026 à 17:16, Dewey Dunnington a écrit : > > Hi all, > > > > In implementing dictionary decoding for nanoarrow's IPC reader [1] I > > discovered that it is not possible to represent a dictionary-encoded > > extension type in the IPC schema serialization. I've filed an issue > > with the details at [2]...the summary is that a Dictionary with > > Extension values is exported identically to a Extension with > > Dictionary storage, which usually leads to an error on read (because > > no extension types actually support dictionary storage types, except > > maybe arrow.opaque because it can have arbitrary storage). I was also > > reminded that arrow-rs can't represent dictionary-encoded extension > > values at all [3]. > > > > Given that there are a number of canonical extension types now, I > > wonder if there should be a more clear route to roundtripping > > dictionary-encoded extension types over IPC (either by making this > > possible to represent in IPC or by making it clear that extension type > > implementations must handle dictionary encoded storage). Somewhere in > > the middle would be handling the error on deserialization (i.e., if > > the extension type in the registry doesn't support dictionary encoded > > storage, fall back to a dictionary with extension values). > > > > Cheers, > > > > -dewey > > > > [1] https://github.com/apache/arrow-nanoarrow/pull/861 > > [2] https://github.com/apache/arrow/issues/49704 > > [3] https://github.com/apache/arrow-rs/issues/7982 >
