Re: Unsupported/Other Type

David Li Wed, 17 Apr 2024 05:22:38 -0700

I'll see if I can write this out more.

@Weston, indeed this is some sort of "planning stage" but I think a concrete 
type is still useful. For example, wherever we use Arrow and adapt a foreign 
catalog, we may need _something_ to indicate the presence of a column that we 
do not know how to interpret. It would be bad to simply pretend the column does 
not exist, and it would be inconvenient for the user to have a hard error. This 
comes up with the Java JDBC adapter, where currently we just give a hard error 
when we don't know how to convert a type, even if the user is just inquiring 
about the schema of the table, as well as the ADBC Postgres driver, as 
discussed.


Otherwise, we'd have to come up with our own encoding of Arrow schemas that 
allows for Option<DataType>, and invent our own conventions in each language/in 
ADBC, and so on. Perhaps we could call this an abuse of Arrow schemas given 
that Arrow was meant to describe concrete in-memory data, but I think user 
requests for features like JSON encodings of Arrow schemas (even if we've made 
no progress on them) show that people generally find use in Arrow schemas 
independently of concrete data.

On Wed, Apr 17, 2024, at 20:09, Antoine Pitrou wrote:
> Ah! Well, I think this could be an interesting proposal, but someone 
> should put a more formal proposal, perhaps as a draft PR.
>
> Regards
>
> Antoine.
>
>
> Le 17/04/2024 à 11:57, David Li a écrit :
>> For an unsupported/other extension type.
>> 
>> On Wed, Apr 17, 2024, at 18:32, Antoine Pitrou wrote:
>>> What is "this proposal"?
>>>
>>>
>>> Le 17/04/2024 à 10:38, David Li a écrit :
>>>> Should I take it that this proposal is dead in the water? While we could 
>>>> define our own Unknown/Other type for say the ADBC PostgreSQL driver it 
>>>> might be useful to have a singular type for consumers to latch on to.
>>>>
>>>> On Fri, Apr 12, 2024, at 07:32, David Li wrote:
>>>>> I think an "Other" extension type is slightly different than an
>>>>> arbitrary extension type, though: the latter may be understood
>>>>> downstream but the former represents a point at which a component
>>>>> explicitly declares it does not know how to handle a field. In this
>>>>> example, the PostgreSQL ADBC driver might be able to provide a
>>>>> representation regardless, but a different driver (or say, the JDBC
>>>>> adapter, which cannot necessarily get a bytestring for an arbitrary
>>>>> JDBC type) may want an Other type to signal that it would fail if asked
>>>>> to provide particular columns.
>>>>>
>>>>> On Fri, Apr 12, 2024, at 02:30, Dewey Dunnington wrote:
>>>>>> Depending where your Arrow-encoded data is used, either extension
>>>>>> types or generic field metadata are options. We have this problem in
>>>>>> the ADBC Postgres driver, where we can convert *most* Postgres types
>>>>>> to an Arrow type but there are some others where we can't or don't
>>>>>> know or don't implement a conversion. Currently for these we return
>>>>>> opaque binary (the Postgres COPY representation of the value) but put
>>>>>> field metadata so that a consumer can implement a workaround for an
>>>>>> unsupported type. It would be arguably better to have implemented this
>>>>>> as an extension type; however, field metadata felt like less of a
>>>>>> commitment when I first worked on this.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> -dewey
>>>>>>
>>>>>> On Thu, Apr 11, 2024 at 1:20 PM Norman Jordan
>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>> I was using UUID as an example. It looks like extension types covers my 
>>>>>>> original request.
>>>>>>> ________________________________
>>>>>>> From: Felipe Oliveira Carvalho <[email protected]>
>>>>>>> Sent: Thursday, April 11, 2024 7:15 AM
>>>>>>> To: [email protected] <[email protected]>
>>>>>>> Subject: Re: Unsupported/Other Type
>>>>>>>
>>>>>>> The OP used UUID as an example. Would that be enough or the request is 
>>>>>>> for
>>>>>>> a flexible mechanism that allows the creation of one-off nominal types 
>>>>>>> for
>>>>>>> very specific use-cases?
>>>>>>>
>>>>>>> —
>>>>>>> Felipe
>>>>>>>
>>>>>>> On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou <[email protected]> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Yes, JSON and UUID are obvious candidates for new canonical extension
>>>>>>>> types. XML also comes to mind, but I'm not sure there's much of a use
>>>>>>>> case for it.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> Antoine.
>>>>>>>>
>>>>>>>>
>>>>>>>> Le 10/04/2024 à 22:55, Wes McKinney a écrit :
>>>>>>>>> In the past we have discussed adding a canonical type for UUID and 
>>>>>>>>> JSON.
>>>>>>>> I
>>>>>>>>> still think this is a good idea and could improve ergonomics in
>>>>>>>> downstream
>>>>>>>>> language bindings (e.g. by exposing JSON querying function or
>>>>>>>> automatically
>>>>>>>>> boxing UUIDs in built-in UUID types, like the Python uuid library). 
>>>>>>>>> Has
>>>>>>>>> anyone done any work on this to anyone's knowledge?
>>>>>>>>>
>>>>>>>>> On Wed, Apr 10, 2024 at 3:05 PM Micah Kornfield 
>>>>>>>>> <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Norman,
>>>>>>>>>> Arrow has a concept of extension types [1] along with the 
>>>>>>>>>> possibility of
>>>>>>>>>> proposing new canonical extension types [2].  This seems to cover the
>>>>>>>>>> use-cases you mention but I might be misunderstanding?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Micah
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types
>>>>>>>>>> [2] https://arrow.apache.org/docs/format/CanonicalExtensions.html
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 10, 2024 at 11:44 AM Norman Jordan
>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Problem Description
>>>>>>>>>>>
>>>>>>>>>>> Currently Arrow schemas can only contain columns of types supported 
>>>>>>>>>>> by
>>>>>>>>>>> Arrow. In some cases an Arrow schema maps to an external schema. 
>>>>>>>>>>> This
>>>>>>>> can
>>>>>>>>>>> result in the Arrow schema not being able to support all the columns
>>>>>>>> from
>>>>>>>>>>> the external schema.
>>>>>>>>>>>
>>>>>>>>>>> Consider an external system that contains a column of type UUID. To
>>>>>>>> model
>>>>>>>>>>> the schema in Arrow, the user has two choices:
>>>>>>>>>>>
>>>>>>>>>>>      1.  Do not include the UUID column in the Arrow schema
>>>>>>>>>>>
>>>>>>>>>>>      2.  Map the column to an existing Arrow type. This will not 
>>>>>>>>>>> include
>>>>>>>> the
>>>>>>>>>>> original type information. A UUID can be mapped to a 
>>>>>>>>>>> FixedSizeBinary,
>>>>>>>> but
>>>>>>>>>>> consumers of the Arrow schema will be unable to distinguish a
>>>>>>>>>>> FixedSizeBinary field from a UUID field.
>>>>>>>>>>>
>>>>>>>>>>> Possible Solution
>>>>>>>>>>>
>>>>>>>>>>>      *   Add a new type code that represents unsupported types
>>>>>>>>>>>
>>>>>>>>>>>      *   Values for the new type are represented as variable length
>>>>>>>> binary
>>>>>>>>>>>
>>>>>>>>>>> Some drivers can expose data even when they don’t understand the 
>>>>>>>>>>> data
>>>>>>>>>>> type. For example, the PostgreSQL driver will return the raw bytes 
>>>>>>>>>>> for
>>>>>>>>>>> fields of an unknown type. Using an explicit type lets clients know
>>>>>>>> that
>>>>>>>>>>> they should convert values if they were able to determine the actual
>>>>>>>> data
>>>>>>>>>>> type.
>>>>>>>>>>>
>>>>>>>>>>> Questions
>>>>>>>>>>>
>>>>>>>>>>>      *   What is the impact on existing clients when they encounter
>>>>>>>> fields
>>>>>>>>>> of
>>>>>>>>>>> the unsupported type?
>>>>>>>>>>>
>>>>>>>>>>>      *   Is it safe to assume that all unsupported values can 
>>>>>>>>>>> safely be
>>>>>>>>>>> converted to a variable length binary?
>>>>>>>>>>>
>>>>>>>>>>>      *   How can we preserve information about the original type?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>> Warning: The sender of this message could not be validated and may not 
>>>>>>> be the actual sender.

Re: Unsupported/Other Type

Reply via email to