Hi Neil,

Thanks for the explanation. I use the ast module to parse expressions and map 
them to the dataset API, and it never dawned on me until the upgrade what I was 
getting for free. 😊 It’s easy enough to handle a cast function of our own in 
the ast visitor.

Best,
Troy

From: Neal Richardson <[email protected]>
Sent: Friday, June 4, 2021 4:14 PM
To: [email protected]
Subject: Re: Dataset filter expression in Arrow 3 vs 4

[https://s3.amazonaws.com/staticmediafiles/media/sights/iron-icon-color.png]
IRONSCALES couldn't recognize this email as this is the first time you received 
an email from this sender 
[email protected]<mailto:[email protected]>

[This message has originated from an EXTERNAL SENDER]
Implicit casting is still supported, it's just different in 4.0. AFAIK the 
clearest statement we have of the change is on the R changelog [1]:

"Arrow C++ compute functions now do more systematic type promotion when called 
on data with different types (e.g. int32 and float64). Previously, Scalars in 
an expressions were always cast to match the type of the corresponding Array, 
so this new type promotion enables, among other things, operations on two 
columns (Arrays) in a dataset. As a side effect, some comparisons that worked 
in prior versions are no longer supported: for example, 
dplyr::filter(arrow_dataset, string_column == 3) will error with a message 
about the type mismatch between the numeric 3 and the string type of 
string_column."

So in your case, you might use `date(2021, 1, 1)` instead of the string 
"2021-01-01".

Neal

[1]: 
https://arrow.apache.org/docs/r/news/index.html<https://urldefense.com/v3/__https:/arrow.apache.org/docs/r/news/index.html__;!!GSt_xZU7050wKg!6ZCJgQ17SilHsfm_s51ch27gP42pXLhjlHQrGw7PDcfMjT-jEzVQWSoBL-mjAGItoDbo%24>

On Fri, Jun 4, 2021 at 1:27 PM Troy Zimmerman 
<[email protected]<mailto:[email protected]>> wrote:
Hello,

When upgrading from Arrow 3 to 4, I noticed a change in behavior with dataset 
expressions.

Here’s a barebones example.

>>> import pyarrow as pa
>>> import pyarrow.dataset as ds
>>> import pyarrow.parquet as pq
>>> tbl = pa.Table.from_pydict(
…     {"id": [1, 2, 3, 4, 5],
…      "valid_from": [date(2021, 1, 1), date(2021, 1, 1), date(2021, 1, 1), 
date(2021, 1, 1), date(2021, 1, 1)],
…      "valid_to": [date(2021, 1, 31), date(2021, 1, 31), date(2021, 2, 15), 
date(2021, 2, 15), date(2021, 3, 1)]
…     }
… )
>>> tbl = tbl.cast(pa.schema([("id", pa.int64()), ("valid_from", 
>>> pa.timestamp("ns")), ("valid_to", pa.timestamp("ns"))]))
>>> pq.write_table(tbl, "/home/tzimmerman/test.parquet")
>>> data = ds.dataset("/home/tzimmerman/test.parquet")
>>> expr = (ds.field("valid_from") <= "2021-01-01") & ("2021-01-01" < 
>>> ds.field("valid_to"))

With Python 3.8 & Arrow 3.0.0, it appears the date literal (“2021-01-1”) is 
cast to a compatible dtype and the expression works as expected.

>>> data.to_table(filter=expr)
pyarrow.Table
id: int64
valid_from: timestamp[us]
valid_to: timestamp[us]

With Python 3.8 & Arrow 4.0.0, there is a kernel mismatch error.

>>> data.to_table(filter=expr)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
  File "pyarrow/_dataset.pyx", line 458, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 363, in pyarrow._dataset.Dataset._scanner
  File "pyarrow/_dataset.pyx", line 2786, in 
pyarrow._dataset.Scanner.from_dataset
  File "pyarrow/_dataset.pyx", line 2683, in pyarrow._dataset._populate_builder
  File "pyarrow/_dataset.pyx", line 737, in pyarrow._dataset._bind
  File "pyarrow/error.pxi", line 141, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 118, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Function less_equal has no kernel 
matching input types (array[timestamp[us]], scalar[string])

I just want to confirm that this Is expected, and the implicit casting/coercing 
is now unsupported?

Best,
Troy

Reply via email to