Hello,
When upgrading from Arrow 3 to 4, I noticed a change in behavior with dataset
expressions.
Here's a barebones example.
>>> import pyarrow as pa
>>> import pyarrow.dataset as ds
>>> import pyarrow.parquet as pq
>>> tbl = pa.Table.from_pydict(
... {"id": [1, 2, 3, 4, 5],
... "valid_from": [date(2021, 1, 1), date(2021, 1, 1), date(2021, 1, 1),
date(2021, 1, 1), date(2021, 1, 1)],
... "valid_to": [date(2021, 1, 31), date(2021, 1, 31), date(2021, 2, 15),
date(2021, 2, 15), date(2021, 3, 1)]
... }
... )
>>> tbl = tbl.cast(pa.schema([("id", pa.int64()), ("valid_from",
>>> pa.timestamp("ns")), ("valid_to", pa.timestamp("ns"))]))
>>> pq.write_table(tbl, "/home/tzimmerman/test.parquet")
>>> data = ds.dataset("/home/tzimmerman/test.parquet")
>>> expr = (ds.field("valid_from") <= "2021-01-01") & ("2021-01-01" <
>>> ds.field("valid_to"))
With Python 3.8 & Arrow 3.0.0, it appears the date literal ("2021-01-1") is
cast to a compatible dtype and the expression works as expected.
>>> data.to_table(filter=expr)
pyarrow.Table
id: int64
valid_from: timestamp[us]
valid_to: timestamp[us]
With Python 3.8 & Arrow 4.0.0, there is a kernel mismatch error.
>>> data.to_table(filter=expr)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/_dataset.pyx", line 458, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 363, in pyarrow._dataset.Dataset._scanner
File "pyarrow/_dataset.pyx", line 2786, in
pyarrow._dataset.Scanner.from_dataset
File "pyarrow/_dataset.pyx", line 2683, in pyarrow._dataset._populate_builder
File "pyarrow/_dataset.pyx", line 737, in pyarrow._dataset._bind
File "pyarrow/error.pxi", line 141, in
pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 118, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Function less_equal has no kernel
matching input types (array[timestamp[us]], scalar[string])
I just want to confirm that this Is expected, and the implicit casting/coercing
is now unsupported?
Best,
Troy