Also, note I've raised a similar issue ( https://issues.apache.org/jira/browse/ARROW-17535) for to_pandas calls. One thing that I think would be nice is to be able to hook into the python conversion when necessary translate to Python objects when necessary.
On Wed, Sep 21, 2022 at 8:49 PM Chang She <[email protected]> wrote: > Thanks Wes. > > => Array.to_numpy : I opened ARROW-17813 > <https://issues.apache.org/jira/browse/ARROW-17813> as you suggested and > added some details / repro code. There's also a follow-up thing about the > other direction, converting from a pandas DataFrame column to an Arrow > list<extension>. > > => You're right, I was a little hasty in the description and it wasn't > very accurate: > > Scenario 1: > > If I have a non-nested ExtensionArray whose storage is a DictionaryArray, > `pc.field("extension") == 'string'` would be a valid filter but > currently triggers the "function 'equal' has no kernel matching input > types" error. > This is the path used by DuckDB if you add something like > `extension=='string'` in the where clause. > If Arrow/Acero is also able to automatically lower to storage type for the > functions then it would make running compute on extension types a lot > easier. Even for a list<label> column, at least in duckdb you could use > "UNNEST" to make it work. > > > Scenario 2: > > The trouble with using UNNEST is it makes the query a lot more complicated > and has perf implications. If we're working a lot with nested data types, > it would be easier to have a set of array functions. > If there's a nested ExtensionArray, then something like a list-contains > function would make things a lot easier. However, I think this is a lot > more work (and depends on other systems like duckdb to integrate with these > functions as well). > > > Would it make sense for me to create a JIRA for scenario 1 to continue > further discussion? > > > Thanks again. > > > On Tue, Sep 20, 2022 at 6:11 PM Wes McKinney <[email protected]> wrote: > >> hi Chang, >> >> There are a few rough edges here that you've run into: >> >> * It looks like Array.to_numpy does not "automatically lower" to the >> storage type when trying to convert to NumPy format. In the absence of >> some other conversion rule, converting to the storage type seems like >> a reasonable alternative to failing. Can you open a Jira issue about >> this? This could probably be fixed easily in time for the 10.0.0 >> release, much more easily than the next issue >> >> * On the query, it looks like the filter portion at least is being >> handled by Arrow/Acero — the syntax / UX relating to nested types here >> is relatively unexplored relative to non-nested types. Here comparing >> the label type (itself a list of dictionary-encoded strings) to a >> string seems invalid, probably you would need to check for inclusion >> of the string in the label list-of-strings. I do not know what the >> syntax for this would be with DuckDB (to check for inclusion of a >> string in a list of strings) but in principle this is something that >> should be able to be made to work with some effort >> >> - Wes >> >> On Sun, Sep 18, 2022 at 8:23 PM Chang She <[email protected]> wrote: >> > >> > Hey y'all, thanks in advance for the discussion. >> > >> > I'm creating Arrow extensions for computer vision and I'm running into >> issues in two scenarios. I couldn't find the answers in the archive so I >> thought I'd post here. >> > >> > Example: >> > I make an extension type called "Label" that has storage type >> "dictionary<int8, string>". This is an object detection dataset so each row >> represents an image and has multiple detected objects that needs to be >> labeled. So there's a "name" column that is "list<label>": >> > >> > Example table schema: >> > image_id: int >> > uri: string >> > label: list<label> # list<dictionary<int8, string>> storage type >> > >> > >> > Problems: >> > 1. `to_numpy` does not seem to work with a nested column. e.g., if I >> try to call `to_numpy` on the `label` column, then I get "Not implemented >> type for Arrow list to pandas: extension<label<LabelType>>" >> > 2. If I'm querying this dataset using duckdb, running "select * from >> dataset where label='person'" results in: "Function 'equal' has no kernel >> matching input types (extension<label<LabelType>>, string)" >> > >> > Am I missing an alternate path to make this work with extension types? >> > Does implementing this in Arrow consist of checking if something is an >> extension type and if so, use the storage type instead? Is this something >> that's already on the roadmap at all? >> > >> > Thanks! >> > >> > Chang She >> >
