Re: guidance on extension types

Micah Kornfield Wed, 21 Sep 2022 21:17:29 -0700

Also, note I've raised a similar issue (
https://issues.apache.org/jira/browse/ARROW-17535) for to_pandas calls.
One thing that I think would be nice is to be able to hook into the python
conversion when necessary translate to Python objects when necessary.




On Wed, Sep 21, 2022 at 8:49 PM Chang She <[email protected]> wrote:

> Thanks Wes.
>
> => Array.to_numpy : I opened ARROW-17813
> <https://issues.apache.org/jira/browse/ARROW-17813> as you suggested and
> added some details / repro code. There's also a follow-up thing about the
> other direction, converting from a pandas DataFrame column to an Arrow
> list<extension>.
>
> => You're right, I was a little hasty in the description and it wasn't
> very accurate:
>
> Scenario 1:
>
> If I have a non-nested ExtensionArray whose storage is a DictionaryArray,
> `pc.field("extension") == 'string'` would be a valid filter but
> currently triggers the "function 'equal' has no kernel matching input
> types" error.
> This is the path used by DuckDB if you add something like
> `extension=='string'` in the where clause.
> If Arrow/Acero is also able to automatically lower to storage type for the
> functions then it would make running compute on extension types a lot
> easier. Even for a list<label> column, at least in duckdb you could use
> "UNNEST" to make it work.
>
>
> Scenario 2:
>
> The trouble with using UNNEST is it makes the query a lot more complicated
> and has perf implications. If we're working a lot with nested data types,
> it would be easier to have a set of array functions.
> If there's a nested ExtensionArray, then something like a list-contains
> function would make things a lot easier. However, I think this is a lot
> more work (and depends on other systems like duckdb to integrate with these
> functions as well).
>
>
> Would it make sense for me to create a JIRA for scenario 1 to continue
> further discussion?
>
>
> Thanks again.
>
>
> On Tue, Sep 20, 2022 at 6:11 PM Wes McKinney <[email protected]> wrote:
>
>> hi Chang,
>>
>> There are a few rough edges here that you've run into:
>>
>> * It looks like Array.to_numpy does not "automatically lower" to the
>> storage type when trying to convert to NumPy format. In the absence of
>> some other conversion rule, converting to the storage type seems like
>> a reasonable alternative to failing. Can you open a Jira issue about
>> this? This could probably be fixed easily in time for the 10.0.0
>> release, much more easily than the next issue
>>
>> * On the query, it looks like the filter portion at least is being
>> handled by Arrow/Acero — the syntax / UX relating to nested types here
>> is relatively unexplored relative to non-nested types. Here comparing
>> the label type (itself a list of dictionary-encoded strings) to a
>> string seems invalid, probably you would need to check for inclusion
>> of the string in the label list-of-strings. I do not know what the
>> syntax for this would be with DuckDB (to check for inclusion of a
>> string in a list of strings) but in principle this is something that
>> should be able to be made to work with some effort
>>
>> - Wes
>>
>> On Sun, Sep 18, 2022 at 8:23 PM Chang She <[email protected]> wrote:
>> >
>> > Hey y'all, thanks in advance for the discussion.
>> >
>> > I'm creating Arrow extensions for computer vision and I'm running into
>> issues in two scenarios. I couldn't find the answers in the archive so I
>> thought I'd post here.
>> >
>> > Example:
>> > I make an extension type called "Label" that has storage type
>> "dictionary<int8, string>". This is an object detection dataset so each row
>> represents an image and has multiple detected objects that needs to be
>> labeled. So there's a "name" column that is "list<label>":
>> >
>> > Example table schema:
>> > image_id: int
>> > uri: string
>> > label: list<label>   # list<dictionary<int8, string>>  storage type
>> >
>> >
>> > Problems:
>> > 1. `to_numpy` does not seem to work with a nested column. e.g., if I
>> try to call `to_numpy` on the `label` column, then I get "Not implemented
>> type for Arrow list to pandas: extension<label<LabelType>>"
>> > 2. If I'm querying this dataset using duckdb, running "select * from
>> dataset where label='person'" results in: "Function 'equal' has no kernel
>> matching input types (extension<label<LabelType>>, string)"
>> >
>> > Am I missing an alternate path to make this work with extension types?
>> > Does implementing this in Arrow consist of checking if something is an
>> extension type and if so, use the storage type instead? Is this something
>> that's already on the roadmap at all?
>> >
>> > Thanks!
>> >
>> > Chang She
>>
>

Re: guidance on extension types

Reply via email to