[jira] [Updated] (ARROW-13578) Inconsistent handling of integer-valued partitions in dataset filters API

Matt Nizol (Jira) Fri, 06 Aug 2021 09:52:44 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Matt Nizol updated ARROW-13578:
-------------------------------
    Description: 
When creating a partitioned data set via the pandas.to_parquet() method, 
partition columns are ostensibly cast to strings in the partition metadata.  
When reading specific partitions via the filters parameter in 
pandas.read_parquet(), string values must be used for filter operands _except 
when_ the partition column has an integer value.  

Consider the following example:
{code:python}
import datetime
import pandas as pd

df = pd.DataFrame({
    "key1": ['0', '1', '2'], 
    "key2": [0, 1, 2],
    "key3": ['a', 'b', 'c'],
    "key4": [1.1, 2.2, 3.3],
    "key5": [True, False, True],
    "key6": [datetime.date(2021, 6, 2), datetime.date(2021, 6, 3), 
datetime.date(2021, 6, 4)],
    "data": ["foo", "bar", "baz"]
})

df['key6'] = pd.to_datetime(df['key6'])

df.to_parquet('./test.parquet', partition_cols=['key1', 'key2', 'key3', 'key4', 
'key5', 'key6'])
{code}
Reading into a ParquetDataset and inspecting the partition levels suggests that 
partition keys have been cast to integer, regardless of the original type:
{code:python}
import pyarrow.parquet as pq
ds = pq.ParquetDataset('./test.parquet')
for level in ds.partitions.levels:
    print(f"{level.name}: {level.keys}")
{code}

Output:
{noformat}
key1: ['0', '1', '2']
key2: ['0', '1', '2']
key3: ['a', 'b', 'c']
key4: ['1.1', '2.2', '3.3']
key5: ['True', 'False']
key6: ['2021-06-02 00:00:00', '2021-06-03 00:00:00', '2021-06-04 
00:00:00']{noformat}

Filtering the dataset using any of the non-integer partition keys along with 
string-valued operands works as expected:

{code:python}
df2=pd.read_parquet('./test.parquet', filters=[('key4','=','1.1'), ('key5', 
'=', 'True')])
df2.head()
{code}

Output:
{noformat}
        data    key1    key2    key3    key4    key5    key6
0       foo     0       0       a       1.1     True    2021-06-02 00:00:00
{noformat}

However, filtering the dataset using either of the integer-valued partition 
keys with a string-valued operand raises an exception, *even when the original 
column's data type is string*:

{code:python}
df2=pd.read_parquet('./test.parquet', filters=[('key1','=','1')])
df2.head()
{code}

{noformat}
ArrowNotImplementedError: Function equal has no kernel matching input types 
(array[int32], scalar[string])
{noformat}

It would seem to be less surprising / more consistent if filter operands either 
(a) are always cast to string, or (b) always retain their original type.

Note, this issue may be related to ARROW-12114.

  was:
When creating a partitioned data set via pandas.to_parquet() method, partition 
columns are ostensibly cast to strings in the partition metadata.  When reading 
specific partitions via the filters parameter in pandas.read_parquet(), string 
values must be used for filter operands _except when_ the partition column has 
an integer value.  

Consider the following example:
{code:python}
import datetime
import pandas as pd

df = pd.DataFrame({
    "key1": ['0', '1', '2'], 
    "key2": [0, 1, 2],
    "key3": ['a', 'b', 'c'],
    "key4": [1.1, 2.2, 3.3],
    "key5": [True, False, True],
    "key6": [datetime.date(2021, 6, 2), datetime.date(2021, 6, 3), 
datetime.date(2021, 6, 4)],
    "data": ["foo", "bar", "baz"]
})

df['key6'] = pd.to_datetime(df['key6'])

df.to_parquet('./test.parquet', partition_cols=['key1', 'key2', 'key3', 'key4', 
'key5', 'key6'])
{code}
Reading into a ParquetDataset and inspecting the partition levels suggests that 
partition keys have been cast to integer, regardless of the original type:
{code:python}
ds = pq.ParquetDataset('./test.parquet')
for level in ds.partitions.levels:
    print(f"{level.name}: {level.keys}")
{code}

Output:
{noformat}
key1: ['0', '1', '2']
key2: ['0', '1', '2']
key3: ['a', 'b', 'c']
key4: ['1.1', '2.2', '3.3']
key5: ['True', 'False']
key6: ['2021-06-02 00:00:00', '2021-06-03 00:00:00', '2021-06-04 
00:00:00']{noformat}

Filtering the dataset using any of the non-integer partition keys along with 
string-valued operands works as expected:

{code:python}
df2=pd.read_parquet('./test.parquet', filters=[('key4','=','1.1'), ('key5', 
'=', 'True')])
df2.head()
{code}

Output:
{noformat}
        data    key1    key2    key3    key4    key5    key6
0       foo     0       0       a       1.1     True    2021-06-02 00:00:00
{noformat}

However, filtering the dataset using either of the integer-valued partition 
keys with a string-valued operand raises an exception, *even when the original 
column's data type is string*:

{code:python}
df2=pd.read_parquet('./test.parquet', filters=[('key1','=','1')])
df2.head()
{code}

{noformat}
ArrowNotImplementedError: Function equal has no kernel matching input types 
(array[int32], scalar[string])
{noformat}

It would seem to be less surprising / more consistent if filter operands either 
(a) are always cast to string, or (b) always retain their original type.

Note, this issue may be related to ARROW-12114.


> Inconsistent handling of integer-valued partitions in dataset filters API
> -------------------------------------------------------------------------
>
>                 Key: ARROW-13578
>                 URL: https://issues.apache.org/jira/browse/ARROW-13578
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Matt Nizol
>            Priority: Minor
>
> When creating a partitioned data set via the pandas.to_parquet() method, 
> partition columns are ostensibly cast to strings in the partition metadata.  
> When reading specific partitions via the filters parameter in 
> pandas.read_parquet(), string values must be used for filter operands _except 
> when_ the partition column has an integer value.  
> Consider the following example:
> {code:python}
> import datetime
> import pandas as pd
> df = pd.DataFrame({
>     "key1": ['0', '1', '2'], 
>     "key2": [0, 1, 2],
>     "key3": ['a', 'b', 'c'],
>     "key4": [1.1, 2.2, 3.3],
>     "key5": [True, False, True],
>     "key6": [datetime.date(2021, 6, 2), datetime.date(2021, 6, 3), 
> datetime.date(2021, 6, 4)],
>     "data": ["foo", "bar", "baz"]
> })
> df['key6'] = pd.to_datetime(df['key6'])
> df.to_parquet('./test.parquet', partition_cols=['key1', 'key2', 'key3', 
> 'key4', 'key5', 'key6'])
> {code}
> Reading into a ParquetDataset and inspecting the partition levels suggests 
> that partition keys have been cast to integer, regardless of the original 
> type:
> {code:python}
> import pyarrow.parquet as pq
> ds = pq.ParquetDataset('./test.parquet')
> for level in ds.partitions.levels:
>     print(f"{level.name}: {level.keys}")
> {code}
> Output:
> {noformat}
> key1: ['0', '1', '2']
> key2: ['0', '1', '2']
> key3: ['a', 'b', 'c']
> key4: ['1.1', '2.2', '3.3']
> key5: ['True', 'False']
> key6: ['2021-06-02 00:00:00', '2021-06-03 00:00:00', '2021-06-04 
> 00:00:00']{noformat}
> Filtering the dataset using any of the non-integer partition keys along with 
> string-valued operands works as expected:
> {code:python}
> df2=pd.read_parquet('./test.parquet', filters=[('key4','=','1.1'), ('key5', 
> '=', 'True')])
> df2.head()
> {code}
> Output:
> {noformat}
>       data    key1    key2    key3    key4    key5    key6
> 0     foo     0       0       a       1.1     True    2021-06-02 00:00:00
> {noformat}
> However, filtering the dataset using either of the integer-valued partition 
> keys with a string-valued operand raises an exception, *even when the 
> original column's data type is string*:
> {code:python}
> df2=pd.read_parquet('./test.parquet', filters=[('key1','=','1')])
> df2.head()
> {code}
> {noformat}
> ArrowNotImplementedError: Function equal has no kernel matching input types 
> (array[int32], scalar[string])
> {noformat}
> It would seem to be less surprising / more consistent if filter operands 
> either (a) are always cast to string, or (b) always retain their original 
> type.
> Note, this issue may be related to ARROW-12114.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13578) Inconsistent handling of integer-valued partitions in dataset filters API

Reply via email to