Diana Clarke created ARROW-12114: ------------------------------------ Summary: Dataset to table filter expression API change Key: ARROW-12114 URL: https://issues.apache.org/jira/browse/ARROW-12114 Project: Apache Arrow Issue Type: Bug Reporter: Diana Clarke
Ben: Can you please confirm that we're aware and okay with the following API change? Thanks! {code} import pyarrow.dataset path_prefix = "ursa-labs-taxi-data-repartitioned-10k/" paths = [ f"ursa-labs-taxi-data-repartitioned-10k/{year}/{month:02}/{part:04}/data.parquet" for year in range(2009, 2020) for month in range(1, 13) for part in range(101) if not (year == 2019 and month > 6) # Data ends in 2019/06 and not (year == 2010 and month == 3) # Data is missing in 2010/03 ] partitioning = pyarrow.dataset.DirectoryPartitioning.discover( field_names=["year", "month", "part"], infer_dictionary=True, ) for source in self.get_sources(source): s3 = pyarrow.fs.S3FileSystem(region="us-east-2") dataset = pyarrow.dataset.dataset( paths, format="parquet", filesystem=s3, partitioning=partitioning, partition_base_dir=path_prefix, ) year = pyarrow.dataset.field("year") month = pyarrow.dataset.field("month") part = pyarrow.dataset.field("part") filter_expr = (year == "2011") & (month == 1) & (part == 2) dataset.to_table(filter=filter_expr) {code} In arrow 3.0, the above code executes without error. On head, {{year == "2011"}}, which should be {{year == 2011}} (no quotes) raises the following exception. {code} pyarrow.lib.ArrowNotImplementedError: Function equal has no kernel matching input types (array[int32], scalar[string]) {code} This API change appears to have been introduced in ARROW-8919. Perhaps it was intentional, just figured we should double check. Thanks again! -- This message was sent by Atlassian Jira (v8.3.4#803005)