[ 
https://issues.apache.org/jira/browse/ARROW-18273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miles Granger updated ARROW-18273:
----------------------------------
    Labels: Python extension-type  (was: )

> [Python] For extension types, compute kernels should default to storage types?
> ------------------------------------------------------------------------------
>
>                 Key: ARROW-18273
>                 URL: https://issues.apache.org/jira/browse/ARROW-18273
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>    Affects Versions: 10.0.0
>            Reporter: Chang She
>            Priority: Major
>              Labels: Python, extension-type
>
> Currently, compute kernels don't recognize extensions types so that if you 
> were to define semantic types to indicate things like "this string column is 
> an image label", you then cannot do things like equals on it.
> For example, take the LabelType from 
> [https://github.com/apache/arrow/blob/c3824db8530075e0f8fd26974c193a310003c17a/python/pyarrow/tests/test_extension_type.py]
> {code:python}
> In [1]: import pyarrow as pa
> In [2]: import pyarrow.compute as pc
> In [3]: class LabelType(pa.PyExtensionType):
> ...:
> ...:     def __init__(self):
> ...:         pa.PyExtensionType.__init__(self, pa.string())
> ...:
> ...:     def __reduce__(self):
> ...:         return LabelType, ()
> ...:
> In [4]: tbl = 
> pa.Table.from_arrays([pa.ExtensionArray.from_storage(LabelType(), 
> pa.array(['cat', 'dog', 'person']))], names=['label'])
> In [5]: tbl.filter(pc.field('label') == 'cat')
> ---------------------------------------------------------------------------
> ArrowNotImplementedError Traceback (most recent call last)
> Cell In [5], line 1
> ----> 1 tbl.filter(pc.field('label') == 'cat')
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/table.pxi:2953, in 
> pyarrow.lib.Table.filter()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/_exec_plan.pyx:391, 
> in pyarrow._exec_plan._filter_table()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/_exec_plan.pyx:128, 
> in pyarrow._exec_plan.execplan()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:144, in 
> pyarrow.lib.pyarrow_internal_check_status()
> File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:121, in 
> pyarrow.lib.check_status()
> ArrowNotImplementedError: Function 'equal' has no kernel matching input types 
> (extension<arrow.py_extension_type<LabelType>>, string)
> {code}
> for query systems that push some of the compute down to Arrow (e.g., DuckDB), 
> it also means that it's much harder for users to work with datasets with 
> extension types because you don't know which functions will actually work.
> Instead, if we can make the compute kernels default to the storage type, it 
> would make the extension system a lot easier to work with in Arrow.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to