[ 
https://issues.apache.org/jira/browse/ARROW-12114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Diana Clarke updated ARROW-12114:
---------------------------------
    Description: 
Ben:

Can you please confirm that we're aware and okay with the following API change? 
Thanks!

{code}
import pyarrow.dataset

path_prefix = "ursa-labs-taxi-data-repartitioned-10k/"
paths = [
    
f"ursa-labs-taxi-data-repartitioned-10k/{year}/{month:02}/{part:04}/data.parquet"
    for year in range(2009, 2020)
    for month in range(1, 13)
    for part in range(101)
    if not (year == 2019 and month > 6)  # Data ends in 2019/06
    and not (year == 2010 and month == 3)  # Data is missing in 2010/03
]
partitioning = pyarrow.dataset.DirectoryPartitioning.discover(
    field_names=["year", "month", "part"],
    infer_dictionary=True,
)
s3 = pyarrow.fs.S3FileSystem(region="us-east-2")
dataset = pyarrow.dataset.dataset(
    paths,
    format="parquet",
    filesystem=s3,
    partitioning=partitioning,
    partition_base_dir=path_prefix,
)
year = pyarrow.dataset.field("year")
month = pyarrow.dataset.field("month")
part = pyarrow.dataset.field("part")
filter_expr = (year == "2011") & (month == 1) & (part == 2)
dataset.to_table(filter=filter_expr)
{code}

In arrow 3.0, the above code executes without error.

On head[1], {{year == "2011"}}, which should be {{year == 2011}} (no quotes) 
raises the following exception.

{code}
pyarrow.lib.ArrowNotImplementedError: Function equal has no kernel matching 
input types (array[int32], scalar[string])
{code}

This API change appears to have been introduced in ARROW-8919. Perhaps it was 
intentional, just figured we should double check. Thanks again!

[1] {{51c97799b8302466b9dfbb657dc23fd3f0cd8e61}}

  was:
Ben:

Can you please confirm that we're aware and okay with the following API change? 
Thanks!

{code}
import pyarrow.dataset

path_prefix = "ursa-labs-taxi-data-repartitioned-10k/"
paths = [
    
f"ursa-labs-taxi-data-repartitioned-10k/{year}/{month:02}/{part:04}/data.parquet"
    for year in range(2009, 2020)
    for month in range(1, 13)
    for part in range(101)
    if not (year == 2019 and month > 6)  # Data ends in 2019/06
    and not (year == 2010 and month == 3)  # Data is missing in 2010/03
]
partitioning = pyarrow.dataset.DirectoryPartitioning.discover(
    field_names=["year", "month", "part"],
    infer_dictionary=True,
)
for source in self.get_sources(source):
    s3 = pyarrow.fs.S3FileSystem(region="us-east-2")
    dataset = pyarrow.dataset.dataset(
        paths,
        format="parquet",
        filesystem=s3,
        partitioning=partitioning,
        partition_base_dir=path_prefix,
    )
year = pyarrow.dataset.field("year")
month = pyarrow.dataset.field("month")
part = pyarrow.dataset.field("part")
filter_expr = (year == "2011") & (month == 1) & (part == 2)
dataset.to_table(filter=filter_expr)
{code}

In arrow 3.0, the above code executes without error.

On head[1], {{year == "2011"}}, which should be {{year == 2011}} (no quotes) 
raises the following exception.

{code}
pyarrow.lib.ArrowNotImplementedError: Function equal has no kernel matching 
input types (array[int32], scalar[string])
{code}

This API change appears to have been introduced in ARROW-8919. Perhaps it was 
intentional, just figured we should double check. Thanks again!

[1] {{51c97799b8302466b9dfbb657dc23fd3f0cd8e61}}


> [C++] Dataset to table filter expression API change
> ---------------------------------------------------
>
>                 Key: ARROW-12114
>                 URL: https://issues.apache.org/jira/browse/ARROW-12114
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Diana Clarke
>            Assignee: Ben Kietzman
>            Priority: Major
>
> Ben:
> Can you please confirm that we're aware and okay with the following API 
> change? Thanks!
> {code}
> import pyarrow.dataset
> path_prefix = "ursa-labs-taxi-data-repartitioned-10k/"
> paths = [
>     
> f"ursa-labs-taxi-data-repartitioned-10k/{year}/{month:02}/{part:04}/data.parquet"
>     for year in range(2009, 2020)
>     for month in range(1, 13)
>     for part in range(101)
>     if not (year == 2019 and month > 6)  # Data ends in 2019/06
>     and not (year == 2010 and month == 3)  # Data is missing in 2010/03
> ]
> partitioning = pyarrow.dataset.DirectoryPartitioning.discover(
>     field_names=["year", "month", "part"],
>     infer_dictionary=True,
> )
> s3 = pyarrow.fs.S3FileSystem(region="us-east-2")
> dataset = pyarrow.dataset.dataset(
>     paths,
>     format="parquet",
>     filesystem=s3,
>     partitioning=partitioning,
>     partition_base_dir=path_prefix,
> )
> year = pyarrow.dataset.field("year")
> month = pyarrow.dataset.field("month")
> part = pyarrow.dataset.field("part")
> filter_expr = (year == "2011") & (month == 1) & (part == 2)
> dataset.to_table(filter=filter_expr)
> {code}
> In arrow 3.0, the above code executes without error.
> On head[1], {{year == "2011"}}, which should be {{year == 2011}} (no quotes) 
> raises the following exception.
> {code}
> pyarrow.lib.ArrowNotImplementedError: Function equal has no kernel matching 
> input types (array[int32], scalar[string])
> {code}
> This API change appears to have been introduced in ARROW-8919. Perhaps it was 
> intentional, just figured we should double check. Thanks again!
> [1] {{51c97799b8302466b9dfbb657dc23fd3f0cd8e61}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to