Hi Neil, Thanks for the explanation. I use the ast module to parse expressions and map them to the dataset API, and it never dawned on me until the upgrade what I was getting for free. 😊 It’s easy enough to handle a cast function of our own in the ast visitor.
Best, Troy From: Neal Richardson <[email protected]> Sent: Friday, June 4, 2021 4:14 PM To: [email protected] Subject: Re: Dataset filter expression in Arrow 3 vs 4 [https://s3.amazonaws.com/staticmediafiles/media/sights/iron-icon-color.png] IRONSCALES couldn't recognize this email as this is the first time you received an email from this sender [email protected]<mailto:[email protected]> [This message has originated from an EXTERNAL SENDER] Implicit casting is still supported, it's just different in 4.0. AFAIK the clearest statement we have of the change is on the R changelog [1]: "Arrow C++ compute functions now do more systematic type promotion when called on data with different types (e.g. int32 and float64). Previously, Scalars in an expressions were always cast to match the type of the corresponding Array, so this new type promotion enables, among other things, operations on two columns (Arrays) in a dataset. As a side effect, some comparisons that worked in prior versions are no longer supported: for example, dplyr::filter(arrow_dataset, string_column == 3) will error with a message about the type mismatch between the numeric 3 and the string type of string_column." So in your case, you might use `date(2021, 1, 1)` instead of the string "2021-01-01". Neal [1]: https://arrow.apache.org/docs/r/news/index.html<https://urldefense.com/v3/__https:/arrow.apache.org/docs/r/news/index.html__;!!GSt_xZU7050wKg!6ZCJgQ17SilHsfm_s51ch27gP42pXLhjlHQrGw7PDcfMjT-jEzVQWSoBL-mjAGItoDbo%24> On Fri, Jun 4, 2021 at 1:27 PM Troy Zimmerman <[email protected]<mailto:[email protected]>> wrote: Hello, When upgrading from Arrow 3 to 4, I noticed a change in behavior with dataset expressions. Here’s a barebones example. >>> import pyarrow as pa >>> import pyarrow.dataset as ds >>> import pyarrow.parquet as pq >>> tbl = pa.Table.from_pydict( … {"id": [1, 2, 3, 4, 5], … "valid_from": [date(2021, 1, 1), date(2021, 1, 1), date(2021, 1, 1), date(2021, 1, 1), date(2021, 1, 1)], … "valid_to": [date(2021, 1, 31), date(2021, 1, 31), date(2021, 2, 15), date(2021, 2, 15), date(2021, 3, 1)] … } … ) >>> tbl = tbl.cast(pa.schema([("id", pa.int64()), ("valid_from", >>> pa.timestamp("ns")), ("valid_to", pa.timestamp("ns"))])) >>> pq.write_table(tbl, "/home/tzimmerman/test.parquet") >>> data = ds.dataset("/home/tzimmerman/test.parquet") >>> expr = (ds.field("valid_from") <= "2021-01-01") & ("2021-01-01" < >>> ds.field("valid_to")) With Python 3.8 & Arrow 3.0.0, it appears the date literal (“2021-01-1”) is cast to a compatible dtype and the expression works as expected. >>> data.to_table(filter=expr) pyarrow.Table id: int64 valid_from: timestamp[us] valid_to: timestamp[us] With Python 3.8 & Arrow 4.0.0, there is a kernel mismatch error. >>> data.to_table(filter=expr) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "pyarrow/_dataset.pyx", line 458, in pyarrow._dataset.Dataset.to_table File "pyarrow/_dataset.pyx", line 363, in pyarrow._dataset.Dataset._scanner File "pyarrow/_dataset.pyx", line 2786, in pyarrow._dataset.Scanner.from_dataset File "pyarrow/_dataset.pyx", line 2683, in pyarrow._dataset._populate_builder File "pyarrow/_dataset.pyx", line 737, in pyarrow._dataset._bind File "pyarrow/error.pxi", line 141, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 118, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: Function less_equal has no kernel matching input types (array[timestamp[us]], scalar[string]) I just want to confirm that this Is expected, and the implicit casting/coercing is now unsupported? Best, Troy
