Thanks Wes. => Array.to_numpy : I opened ARROW-17813 <https://issues.apache.org/jira/browse/ARROW-17813> as you suggested and added some details / repro code. There's also a follow-up thing about the other direction, converting from a pandas DataFrame column to an Arrow list<extension>.
=> You're right, I was a little hasty in the description and it wasn't very accurate: Scenario 1: If I have a non-nested ExtensionArray whose storage is a DictionaryArray, `pc.field("extension") == 'string'` would be a valid filter but currently triggers the "function 'equal' has no kernel matching input types" error. This is the path used by DuckDB if you add something like `extension=='string'` in the where clause. If Arrow/Acero is also able to automatically lower to storage type for the functions then it would make running compute on extension types a lot easier. Even for a list<label> column, at least in duckdb you could use "UNNEST" to make it work. Scenario 2: The trouble with using UNNEST is it makes the query a lot more complicated and has perf implications. If we're working a lot with nested data types, it would be easier to have a set of array functions. If there's a nested ExtensionArray, then something like a list-contains function would make things a lot easier. However, I think this is a lot more work (and depends on other systems like duckdb to integrate with these functions as well). Would it make sense for me to create a JIRA for scenario 1 to continue further discussion? Thanks again. On Tue, Sep 20, 2022 at 6:11 PM Wes McKinney <wesmck...@gmail.com> wrote: > hi Chang, > > There are a few rough edges here that you've run into: > > * It looks like Array.to_numpy does not "automatically lower" to the > storage type when trying to convert to NumPy format. In the absence of > some other conversion rule, converting to the storage type seems like > a reasonable alternative to failing. Can you open a Jira issue about > this? This could probably be fixed easily in time for the 10.0.0 > release, much more easily than the next issue > > * On the query, it looks like the filter portion at least is being > handled by Arrow/Acero — the syntax / UX relating to nested types here > is relatively unexplored relative to non-nested types. Here comparing > the label type (itself a list of dictionary-encoded strings) to a > string seems invalid, probably you would need to check for inclusion > of the string in the label list-of-strings. I do not know what the > syntax for this would be with DuckDB (to check for inclusion of a > string in a list of strings) but in principle this is something that > should be able to be made to work with some effort > > - Wes > > On Sun, Sep 18, 2022 at 8:23 PM Chang She <ch...@eto.ai> wrote: > > > > Hey y'all, thanks in advance for the discussion. > > > > I'm creating Arrow extensions for computer vision and I'm running into > issues in two scenarios. I couldn't find the answers in the archive so I > thought I'd post here. > > > > Example: > > I make an extension type called "Label" that has storage type > "dictionary<int8, string>". This is an object detection dataset so each row > represents an image and has multiple detected objects that needs to be > labeled. So there's a "name" column that is "list<label>": > > > > Example table schema: > > image_id: int > > uri: string > > label: list<label> # list<dictionary<int8, string>> storage type > > > > > > Problems: > > 1. `to_numpy` does not seem to work with a nested column. e.g., if I try > to call `to_numpy` on the `label` column, then I get "Not implemented type > for Arrow list to pandas: extension<label<LabelType>>" > > 2. If I'm querying this dataset using duckdb, running "select * from > dataset where label='person'" results in: "Function 'equal' has no kernel > matching input types (extension<label<LabelType>>, string)" > > > > Am I missing an alternate path to make this work with extension types? > > Does implementing this in Arrow consist of checking if something is an > extension type and if so, use the storage type instead? Is this something > that's already on the roadmap at all? > > > > Thanks! > > > > Chang She >