In the past we have discussed adding a canonical type for UUID and JSON. I still think this is a good idea and could improve ergonomics in downstream language bindings (e.g. by exposing JSON querying function or automatically boxing UUIDs in built-in UUID types, like the Python uuid library). Has anyone done any work on this to anyone's knowledge?
On Wed, Apr 10, 2024 at 3:05 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Norman, > Arrow has a concept of extension types [1] along with the possibility of > proposing new canonical extension types [2]. This seems to cover the > use-cases you mention but I might be misunderstanding? > > Thanks, > Micah > > [1] > > https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types > [2] https://arrow.apache.org/docs/format/CanonicalExtensions.html > > On Wed, Apr 10, 2024 at 11:44 AM Norman Jordan > <norman.jor...@improving.com.invalid> wrote: > > > Problem Description > > > > Currently Arrow schemas can only contain columns of types supported by > > Arrow. In some cases an Arrow schema maps to an external schema. This can > > result in the Arrow schema not being able to support all the columns from > > the external schema. > > > > Consider an external system that contains a column of type UUID. To model > > the schema in Arrow, the user has two choices: > > > > 1. Do not include the UUID column in the Arrow schema > > > > 2. Map the column to an existing Arrow type. This will not include the > > original type information. A UUID can be mapped to a FixedSizeBinary, but > > consumers of the Arrow schema will be unable to distinguish a > > FixedSizeBinary field from a UUID field. > > > > Possible Solution > > > > * Add a new type code that represents unsupported types > > > > * Values for the new type are represented as variable length binary > > > > Some drivers can expose data even when they don’t understand the data > > type. For example, the PostgreSQL driver will return the raw bytes for > > fields of an unknown type. Using an explicit type lets clients know that > > they should convert values if they were able to determine the actual data > > type. > > > > Questions > > > > * What is the impact on existing clients when they encounter fields > of > > the unsupported type? > > > > * Is it safe to assume that all unsupported values can safely be > > converted to a variable length binary? > > > > * How can we preserve information about the original type? > > > > >