Re: Unsupported/Other Type

David Li Fri, 24 May 2024 19:47:49 -0700

I've put up a draft PR here: https://github.com/apache/arrow/pull/41823


On Wed, Apr 17, 2024, at 23:34, David Li wrote:
> Yes, this would be for an extension type. 
>
> On Wed, Apr 17, 2024, at 23:25, Weston Pace wrote:
>>> people generally find use in Arrow schemas independently of concrete data.
>>
>> This makes sense.  I think we do want to encourage use of Arrow as a "type
>> system" even if there is no data involved.  And, given that we cannot
>> easily change a field's data type property to "optional" it makes sense to
>> use a dedicated type and I so I would be in favor of such a proposal (we
>> may eventually add an "unknown type" concept in Substrait as well, it's
>> come up several times, and so we could use this in that context).
>>
>> I think that I would still prefer a canonical extension type (with storage
>> type null) over a new dedicated type.
>>
>> On Wed, Apr 17, 2024 at 5:39 AM Antoine Pitrou <anto...@python.org> wrote:
>>
>>>
>>> Ah! Well, I think this could be an interesting proposal, but someone
>>> should put a more formal proposal, perhaps as a draft PR.
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>> Le 17/04/2024 à 11:57, David Li a écrit :
>>> > For an unsupported/other extension type.
>>> >
>>> > On Wed, Apr 17, 2024, at 18:32, Antoine Pitrou wrote:
>>> >> What is "this proposal"?
>>> >>
>>> >>
>>> >> Le 17/04/2024 à 10:38, David Li a écrit :
>>> >>> Should I take it that this proposal is dead in the water? While we
>>> could define our own Unknown/Other type for say the ADBC PostgreSQL driver
>>> it might be useful to have a singular type for consumers to latch on to.
>>> >>>
>>> >>> On Fri, Apr 12, 2024, at 07:32, David Li wrote:
>>> >>>> I think an "Other" extension type is slightly different than an
>>> >>>> arbitrary extension type, though: the latter may be understood
>>> >>>> downstream but the former represents a point at which a component
>>> >>>> explicitly declares it does not know how to handle a field. In this
>>> >>>> example, the PostgreSQL ADBC driver might be able to provide a
>>> >>>> representation regardless, but a different driver (or say, the JDBC
>>> >>>> adapter, which cannot necessarily get a bytestring for an arbitrary
>>> >>>> JDBC type) may want an Other type to signal that it would fail if
>>> asked
>>> >>>> to provide particular columns.
>>> >>>>
>>> >>>> On Fri, Apr 12, 2024, at 02:30, Dewey Dunnington wrote:
>>> >>>>> Depending where your Arrow-encoded data is used, either extension
>>> >>>>> types or generic field metadata are options. We have this problem in
>>> >>>>> the ADBC Postgres driver, where we can convert *most* Postgres types
>>> >>>>> to an Arrow type but there are some others where we can't or don't
>>> >>>>> know or don't implement a conversion. Currently for these we return
>>> >>>>> opaque binary (the Postgres COPY representation of the value) but put
>>> >>>>> field metadata so that a consumer can implement a workaround for an
>>> >>>>> unsupported type. It would be arguably better to have implemented
>>> this
>>> >>>>> as an extension type; however, field metadata felt like less of a
>>> >>>>> commitment when I first worked on this.
>>> >>>>>
>>> >>>>> Cheers,
>>> >>>>>
>>> >>>>> -dewey
>>> >>>>>
>>> >>>>> On Thu, Apr 11, 2024 at 1:20 PM Norman Jordan
>>> >>>>> <norman.jor...@improving.com.invalid> wrote:
>>> >>>>>>
>>> >>>>>> I was using UUID as an example. It looks like extension types
>>> covers my original request.
>>> >>>>>> ________________________________
>>> >>>>>> From: Felipe Oliveira Carvalho <felipe...@gmail.com>
>>> >>>>>> Sent: Thursday, April 11, 2024 7:15 AM
>>> >>>>>> To: dev@arrow.apache.org <dev@arrow.apache.org>
>>> >>>>>> Subject: Re: Unsupported/Other Type
>>> >>>>>>
>>> >>>>>> The OP used UUID as an example. Would that be enough or the request
>>> is for
>>> >>>>>> a flexible mechanism that allows the creation of one-off nominal
>>> types for
>>> >>>>>> very specific use-cases?
>>> >>>>>>
>>> >>>>>> —
>>> >>>>>> Felipe
>>> >>>>>>
>>> >>>>>> On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou <anto...@python.org>
>>> wrote:
>>> >>>>>>
>>> >>>>>>>
>>> >>>>>>> Yes, JSON and UUID are obvious candidates for new canonical
>>> extension
>>> >>>>>>> types. XML also comes to mind, but I'm not sure there's much of a
>>> use
>>> >>>>>>> case for it.
>>> >>>>>>>
>>> >>>>>>> Regards
>>> >>>>>>>
>>> >>>>>>> Antoine.
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> Le 10/04/2024 à 22:55, Wes McKinney a écrit :
>>> >>>>>>>> In the past we have discussed adding a canonical type for UUID
>>> and JSON.
>>> >>>>>>> I
>>> >>>>>>>> still think this is a good idea and could improve ergonomics in
>>> >>>>>>> downstream
>>> >>>>>>>> language bindings (e.g. by exposing JSON querying function or
>>> >>>>>>> automatically
>>> >>>>>>>> boxing UUIDs in built-in UUID types, like the Python uuid
>>> library). Has
>>> >>>>>>>> anyone done any work on this to anyone's knowledge?
>>> >>>>>>>>
>>> >>>>>>>> On Wed, Apr 10, 2024 at 3:05 PM Micah Kornfield <
>>> emkornfi...@gmail.com>
>>> >>>>>>>> wrote:
>>> >>>>>>>>
>>> >>>>>>>>> Hi Norman,
>>> >>>>>>>>> Arrow has a concept of extension types [1] along with the
>>> possibility of
>>> >>>>>>>>> proposing new canonical extension types [2].  This seems to
>>> cover the
>>> >>>>>>>>> use-cases you mention but I might be misunderstanding?
>>> >>>>>>>>>
>>> >>>>>>>>> Thanks,
>>> >>>>>>>>> Micah
>>> >>>>>>>>>
>>> >>>>>>>>> [1]
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>
>>> https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types
>>> >>>>>>>>> [2]
>>> https://arrow.apache.org/docs/format/CanonicalExtensions.html
>>> >>>>>>>>>
>>> >>>>>>>>> On Wed, Apr 10, 2024 at 11:44 AM Norman Jordan
>>> >>>>>>>>> <norman.jor...@improving.com.invalid> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>>> Problem Description
>>> >>>>>>>>>>
>>> >>>>>>>>>> Currently Arrow schemas can only contain columns of types
>>> supported by
>>> >>>>>>>>>> Arrow. In some cases an Arrow schema maps to an external
>>> schema. This
>>> >>>>>>> can
>>> >>>>>>>>>> result in the Arrow schema not being able to support all the
>>> columns
>>> >>>>>>> from
>>> >>>>>>>>>> the external schema.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Consider an external system that contains a column of type
>>> UUID. To
>>> >>>>>>> model
>>> >>>>>>>>>> the schema in Arrow, the user has two choices:
>>> >>>>>>>>>>
>>> >>>>>>>>>>      1.  Do not include the UUID column in the Arrow schema
>>> >>>>>>>>>>
>>> >>>>>>>>>>      2.  Map the column to an existing Arrow type. This will
>>> not include
>>> >>>>>>> the
>>> >>>>>>>>>> original type information. A UUID can be mapped to a
>>> FixedSizeBinary,
>>> >>>>>>> but
>>> >>>>>>>>>> consumers of the Arrow schema will be unable to distinguish a
>>> >>>>>>>>>> FixedSizeBinary field from a UUID field.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Possible Solution
>>> >>>>>>>>>>
>>> >>>>>>>>>>      *   Add a new type code that represents unsupported types
>>> >>>>>>>>>>
>>> >>>>>>>>>>      *   Values for the new type are represented as variable
>>> length
>>> >>>>>>> binary
>>> >>>>>>>>>>
>>> >>>>>>>>>> Some drivers can expose data even when they don’t understand
>>> the data
>>> >>>>>>>>>> type. For example, the PostgreSQL driver will return the raw
>>> bytes for
>>> >>>>>>>>>> fields of an unknown type. Using an explicit type lets clients
>>> know
>>> >>>>>>> that
>>> >>>>>>>>>> they should convert values if they were able to determine the
>>> actual
>>> >>>>>>> data
>>> >>>>>>>>>> type.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Questions
>>> >>>>>>>>>>
>>> >>>>>>>>>>      *   What is the impact on existing clients when they
>>> encounter
>>> >>>>>>> fields
>>> >>>>>>>>> of
>>> >>>>>>>>>> the unsupported type?
>>> >>>>>>>>>>
>>> >>>>>>>>>>      *   Is it safe to assume that all unsupported values can
>>> safely be
>>> >>>>>>>>>> converted to a variable length binary?
>>> >>>>>>>>>>
>>> >>>>>>>>>>      *   How can we preserve information about the original
>>> type?
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>
>>> >>>>>> Warning: The sender of this message could not be validated and may
>>> not be the actual sender.
>>>

Re: Unsupported/Other Type

Reply via email to