I've put up a draft PR here: https://github.com/apache/arrow/pull/41823
On Wed, Apr 17, 2024, at 23:34, David Li wrote: > Yes, this would be for an extension type. > > On Wed, Apr 17, 2024, at 23:25, Weston Pace wrote: >>> people generally find use in Arrow schemas independently of concrete data. >> >> This makes sense. I think we do want to encourage use of Arrow as a "type >> system" even if there is no data involved. And, given that we cannot >> easily change a field's data type property to "optional" it makes sense to >> use a dedicated type and I so I would be in favor of such a proposal (we >> may eventually add an "unknown type" concept in Substrait as well, it's >> come up several times, and so we could use this in that context). >> >> I think that I would still prefer a canonical extension type (with storage >> type null) over a new dedicated type. >> >> On Wed, Apr 17, 2024 at 5:39 AM Antoine Pitrou <[email protected]> wrote: >> >>> >>> Ah! Well, I think this could be an interesting proposal, but someone >>> should put a more formal proposal, perhaps as a draft PR. >>> >>> Regards >>> >>> Antoine. >>> >>> >>> Le 17/04/2024 à 11:57, David Li a écrit : >>> > For an unsupported/other extension type. >>> > >>> > On Wed, Apr 17, 2024, at 18:32, Antoine Pitrou wrote: >>> >> What is "this proposal"? >>> >> >>> >> >>> >> Le 17/04/2024 à 10:38, David Li a écrit : >>> >>> Should I take it that this proposal is dead in the water? While we >>> could define our own Unknown/Other type for say the ADBC PostgreSQL driver >>> it might be useful to have a singular type for consumers to latch on to. >>> >>> >>> >>> On Fri, Apr 12, 2024, at 07:32, David Li wrote: >>> >>>> I think an "Other" extension type is slightly different than an >>> >>>> arbitrary extension type, though: the latter may be understood >>> >>>> downstream but the former represents a point at which a component >>> >>>> explicitly declares it does not know how to handle a field. In this >>> >>>> example, the PostgreSQL ADBC driver might be able to provide a >>> >>>> representation regardless, but a different driver (or say, the JDBC >>> >>>> adapter, which cannot necessarily get a bytestring for an arbitrary >>> >>>> JDBC type) may want an Other type to signal that it would fail if >>> asked >>> >>>> to provide particular columns. >>> >>>> >>> >>>> On Fri, Apr 12, 2024, at 02:30, Dewey Dunnington wrote: >>> >>>>> Depending where your Arrow-encoded data is used, either extension >>> >>>>> types or generic field metadata are options. We have this problem in >>> >>>>> the ADBC Postgres driver, where we can convert *most* Postgres types >>> >>>>> to an Arrow type but there are some others where we can't or don't >>> >>>>> know or don't implement a conversion. Currently for these we return >>> >>>>> opaque binary (the Postgres COPY representation of the value) but put >>> >>>>> field metadata so that a consumer can implement a workaround for an >>> >>>>> unsupported type. It would be arguably better to have implemented >>> this >>> >>>>> as an extension type; however, field metadata felt like less of a >>> >>>>> commitment when I first worked on this. >>> >>>>> >>> >>>>> Cheers, >>> >>>>> >>> >>>>> -dewey >>> >>>>> >>> >>>>> On Thu, Apr 11, 2024 at 1:20 PM Norman Jordan >>> >>>>> <[email protected]> wrote: >>> >>>>>> >>> >>>>>> I was using UUID as an example. It looks like extension types >>> covers my original request. >>> >>>>>> ________________________________ >>> >>>>>> From: Felipe Oliveira Carvalho <[email protected]> >>> >>>>>> Sent: Thursday, April 11, 2024 7:15 AM >>> >>>>>> To: [email protected] <[email protected]> >>> >>>>>> Subject: Re: Unsupported/Other Type >>> >>>>>> >>> >>>>>> The OP used UUID as an example. Would that be enough or the request >>> is for >>> >>>>>> a flexible mechanism that allows the creation of one-off nominal >>> types for >>> >>>>>> very specific use-cases? >>> >>>>>> >>> >>>>>> — >>> >>>>>> Felipe >>> >>>>>> >>> >>>>>> On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou <[email protected]> >>> wrote: >>> >>>>>> >>> >>>>>>> >>> >>>>>>> Yes, JSON and UUID are obvious candidates for new canonical >>> extension >>> >>>>>>> types. XML also comes to mind, but I'm not sure there's much of a >>> use >>> >>>>>>> case for it. >>> >>>>>>> >>> >>>>>>> Regards >>> >>>>>>> >>> >>>>>>> Antoine. >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> Le 10/04/2024 à 22:55, Wes McKinney a écrit : >>> >>>>>>>> In the past we have discussed adding a canonical type for UUID >>> and JSON. >>> >>>>>>> I >>> >>>>>>>> still think this is a good idea and could improve ergonomics in >>> >>>>>>> downstream >>> >>>>>>>> language bindings (e.g. by exposing JSON querying function or >>> >>>>>>> automatically >>> >>>>>>>> boxing UUIDs in built-in UUID types, like the Python uuid >>> library). Has >>> >>>>>>>> anyone done any work on this to anyone's knowledge? >>> >>>>>>>> >>> >>>>>>>> On Wed, Apr 10, 2024 at 3:05 PM Micah Kornfield < >>> [email protected]> >>> >>>>>>>> wrote: >>> >>>>>>>> >>> >>>>>>>>> Hi Norman, >>> >>>>>>>>> Arrow has a concept of extension types [1] along with the >>> possibility of >>> >>>>>>>>> proposing new canonical extension types [2]. This seems to >>> cover the >>> >>>>>>>>> use-cases you mention but I might be misunderstanding? >>> >>>>>>>>> >>> >>>>>>>>> Thanks, >>> >>>>>>>>> Micah >>> >>>>>>>>> >>> >>>>>>>>> [1] >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>> >>> https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types >>> >>>>>>>>> [2] >>> https://arrow.apache.org/docs/format/CanonicalExtensions.html >>> >>>>>>>>> >>> >>>>>>>>> On Wed, Apr 10, 2024 at 11:44 AM Norman Jordan >>> >>>>>>>>> <[email protected]> wrote: >>> >>>>>>>>> >>> >>>>>>>>>> Problem Description >>> >>>>>>>>>> >>> >>>>>>>>>> Currently Arrow schemas can only contain columns of types >>> supported by >>> >>>>>>>>>> Arrow. In some cases an Arrow schema maps to an external >>> schema. This >>> >>>>>>> can >>> >>>>>>>>>> result in the Arrow schema not being able to support all the >>> columns >>> >>>>>>> from >>> >>>>>>>>>> the external schema. >>> >>>>>>>>>> >>> >>>>>>>>>> Consider an external system that contains a column of type >>> UUID. To >>> >>>>>>> model >>> >>>>>>>>>> the schema in Arrow, the user has two choices: >>> >>>>>>>>>> >>> >>>>>>>>>> 1. Do not include the UUID column in the Arrow schema >>> >>>>>>>>>> >>> >>>>>>>>>> 2. Map the column to an existing Arrow type. This will >>> not include >>> >>>>>>> the >>> >>>>>>>>>> original type information. A UUID can be mapped to a >>> FixedSizeBinary, >>> >>>>>>> but >>> >>>>>>>>>> consumers of the Arrow schema will be unable to distinguish a >>> >>>>>>>>>> FixedSizeBinary field from a UUID field. >>> >>>>>>>>>> >>> >>>>>>>>>> Possible Solution >>> >>>>>>>>>> >>> >>>>>>>>>> * Add a new type code that represents unsupported types >>> >>>>>>>>>> >>> >>>>>>>>>> * Values for the new type are represented as variable >>> length >>> >>>>>>> binary >>> >>>>>>>>>> >>> >>>>>>>>>> Some drivers can expose data even when they don’t understand >>> the data >>> >>>>>>>>>> type. For example, the PostgreSQL driver will return the raw >>> bytes for >>> >>>>>>>>>> fields of an unknown type. Using an explicit type lets clients >>> know >>> >>>>>>> that >>> >>>>>>>>>> they should convert values if they were able to determine the >>> actual >>> >>>>>>> data >>> >>>>>>>>>> type. >>> >>>>>>>>>> >>> >>>>>>>>>> Questions >>> >>>>>>>>>> >>> >>>>>>>>>> * What is the impact on existing clients when they >>> encounter >>> >>>>>>> fields >>> >>>>>>>>> of >>> >>>>>>>>>> the unsupported type? >>> >>>>>>>>>> >>> >>>>>>>>>> * Is it safe to assume that all unsupported values can >>> safely be >>> >>>>>>>>>> converted to a variable length binary? >>> >>>>>>>>>> >>> >>>>>>>>>> * How can we preserve information about the original >>> type? >>> >>>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>> >>> >>>>>>> >>> >>>>>> Warning: The sender of this message could not be validated and may >>> not be the actual sender. >>>
