[ https://issues.apache.org/jira/browse/ARROW-12114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Diana Clarke updated ARROW-12114: --------------------------------- Description: Ben: Can you please confirm that we're aware and okay with the following API change? Thanks! {code} import pyarrow.dataset path_prefix = "ursa-labs-taxi-data-repartitioned-10k/" paths = [ f"ursa-labs-taxi-data-repartitioned-10k/{year}/{month:02}/{part:04}/data.parquet" for year in range(2009, 2020) for month in range(1, 13) for part in range(101) if not (year == 2019 and month > 6) # Data ends in 2019/06 and not (year == 2010 and month == 3) # Data is missing in 2010/03 ] partitioning = pyarrow.dataset.DirectoryPartitioning.discover( field_names=["year", "month", "part"], infer_dictionary=True, ) s3 = pyarrow.fs.S3FileSystem(region="us-east-2") dataset = pyarrow.dataset.dataset( paths, format="parquet", filesystem=s3, partitioning=partitioning, partition_base_dir=path_prefix, ) year = pyarrow.dataset.field("year") month = pyarrow.dataset.field("month") part = pyarrow.dataset.field("part") filter_expr = (year == "2011") & (month == 1) & (part == 2) dataset.to_table(filter=filter_expr) {code} In arrow 3.0, the above code executes without error. On head[1], {{year == "2011"}}, which should be {{year == 2011}} (no quotes) raises the following exception. {code} pyarrow.lib.ArrowNotImplementedError: Function equal has no kernel matching input types (array[int32], scalar[string]) {code} This API change appears to have been introduced in ARROW-8919. Perhaps it was intentional, just figured we should double check. Thanks again! [1] {{51c97799b8302466b9dfbb657dc23fd3f0cd8e61}} was: Ben: Can you please confirm that we're aware and okay with the following API change? Thanks! {code} import pyarrow.dataset path_prefix = "ursa-labs-taxi-data-repartitioned-10k/" paths = [ f"ursa-labs-taxi-data-repartitioned-10k/{year}/{month:02}/{part:04}/data.parquet" for year in range(2009, 2020) for month in range(1, 13) for part in range(101) if not (year == 2019 and month > 6) # Data ends in 2019/06 and not (year == 2010 and month == 3) # Data is missing in 2010/03 ] partitioning = pyarrow.dataset.DirectoryPartitioning.discover( field_names=["year", "month", "part"], infer_dictionary=True, ) for source in self.get_sources(source): s3 = pyarrow.fs.S3FileSystem(region="us-east-2") dataset = pyarrow.dataset.dataset( paths, format="parquet", filesystem=s3, partitioning=partitioning, partition_base_dir=path_prefix, ) year = pyarrow.dataset.field("year") month = pyarrow.dataset.field("month") part = pyarrow.dataset.field("part") filter_expr = (year == "2011") & (month == 1) & (part == 2) dataset.to_table(filter=filter_expr) {code} In arrow 3.0, the above code executes without error. On head[1], {{year == "2011"}}, which should be {{year == 2011}} (no quotes) raises the following exception. {code} pyarrow.lib.ArrowNotImplementedError: Function equal has no kernel matching input types (array[int32], scalar[string]) {code} This API change appears to have been introduced in ARROW-8919. Perhaps it was intentional, just figured we should double check. Thanks again! [1] {{51c97799b8302466b9dfbb657dc23fd3f0cd8e61}} > [C++] Dataset to table filter expression API change > --------------------------------------------------- > > Key: ARROW-12114 > URL: https://issues.apache.org/jira/browse/ARROW-12114 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Reporter: Diana Clarke > Assignee: Ben Kietzman > Priority: Major > > Ben: > Can you please confirm that we're aware and okay with the following API > change? Thanks! > {code} > import pyarrow.dataset > path_prefix = "ursa-labs-taxi-data-repartitioned-10k/" > paths = [ > > f"ursa-labs-taxi-data-repartitioned-10k/{year}/{month:02}/{part:04}/data.parquet" > for year in range(2009, 2020) > for month in range(1, 13) > for part in range(101) > if not (year == 2019 and month > 6) # Data ends in 2019/06 > and not (year == 2010 and month == 3) # Data is missing in 2010/03 > ] > partitioning = pyarrow.dataset.DirectoryPartitioning.discover( > field_names=["year", "month", "part"], > infer_dictionary=True, > ) > s3 = pyarrow.fs.S3FileSystem(region="us-east-2") > dataset = pyarrow.dataset.dataset( > paths, > format="parquet", > filesystem=s3, > partitioning=partitioning, > partition_base_dir=path_prefix, > ) > year = pyarrow.dataset.field("year") > month = pyarrow.dataset.field("month") > part = pyarrow.dataset.field("part") > filter_expr = (year == "2011") & (month == 1) & (part == 2) > dataset.to_table(filter=filter_expr) > {code} > In arrow 3.0, the above code executes without error. > On head[1], {{year == "2011"}}, which should be {{year == 2011}} (no quotes) > raises the following exception. > {code} > pyarrow.lib.ArrowNotImplementedError: Function equal has no kernel matching > input types (array[int32], scalar[string]) > {code} > This API change appears to have been introduced in ARROW-8919. Perhaps it was > intentional, just figured we should double check. Thanks again! > [1] {{51c97799b8302466b9dfbb657dc23fd3f0cd8e61}} -- This message was sent by Atlassian Jira (v8.3.4#803005)