[ https://issues.apache.org/jira/browse/ARROW-18273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Miles Granger updated ARROW-18273: ---------------------------------- Labels: Python extension-type (was: ) > [Python] For extension types, compute kernels should default to storage types? > ------------------------------------------------------------------------------ > > Key: ARROW-18273 > URL: https://issues.apache.org/jira/browse/ARROW-18273 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python > Affects Versions: 10.0.0 > Reporter: Chang She > Priority: Major > Labels: Python, extension-type > > Currently, compute kernels don't recognize extensions types so that if you > were to define semantic types to indicate things like "this string column is > an image label", you then cannot do things like equals on it. > For example, take the LabelType from > [https://github.com/apache/arrow/blob/c3824db8530075e0f8fd26974c193a310003c17a/python/pyarrow/tests/test_extension_type.py] > {code:python} > In [1]: import pyarrow as pa > In [2]: import pyarrow.compute as pc > In [3]: class LabelType(pa.PyExtensionType): > ...: > ...: def __init__(self): > ...: pa.PyExtensionType.__init__(self, pa.string()) > ...: > ...: def __reduce__(self): > ...: return LabelType, () > ...: > In [4]: tbl = > pa.Table.from_arrays([pa.ExtensionArray.from_storage(LabelType(), > pa.array(['cat', 'dog', 'person']))], names=['label']) > In [5]: tbl.filter(pc.field('label') == 'cat') > --------------------------------------------------------------------------- > ArrowNotImplementedError Traceback (most recent call last) > Cell In [5], line 1 > ----> 1 tbl.filter(pc.field('label') == 'cat') > File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/table.pxi:2953, in > pyarrow.lib.Table.filter() > File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/_exec_plan.pyx:391, > in pyarrow._exec_plan._filter_table() > File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/_exec_plan.pyx:128, > in pyarrow._exec_plan.execplan() > File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:144, in > pyarrow.lib.pyarrow_internal_check_status() > File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:121, in > pyarrow.lib.check_status() > ArrowNotImplementedError: Function 'equal' has no kernel matching input types > (extension<arrow.py_extension_type<LabelType>>, string) > {code} > for query systems that push some of the compute down to Arrow (e.g., DuckDB), > it also means that it's much harder for users to work with datasets with > extension types because you don't know which functions will actually work. > Instead, if we can make the compute kernels default to the storage type, it > would make the extension system a lot easier to work with in Arrow. -- This message was sent by Atlassian Jira (v8.20.10#820010)