[
https://issues.apache.org/jira/browse/ARROW-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matt Nizol updated ARROW-13578:
-------------------------------
Description:
When creating a partitioned data set via the pandas.to_parquet() method,
partition columns are ostensibly cast to strings in the partition metadata.
When reading specific partitions via the filters parameter in
pandas.read_parquet(), string values must be used for filter operands _except
when_ the partition column has an integer value.
Consider the following example:
{code:python}
import datetime
import pandas as pd
df = pd.DataFrame({
"key1": ['0', '1', '2'],
"key2": [0, 1, 2],
"key3": ['a', 'b', 'c'],
"key4": [1.1, 2.2, 3.3],
"key5": [True, False, True],
"key6": [datetime.date(2021, 6, 2), datetime.date(2021, 6, 3),
datetime.date(2021, 6, 4)],
"data": ["foo", "bar", "baz"]
})
df['key6'] = pd.to_datetime(df['key6'])
df.to_parquet('./test.parquet', partition_cols=['key1', 'key2', 'key3', 'key4',
'key5', 'key6'])
{code}
Reading into a ParquetDataset and inspecting the partition levels suggests that
partition keys have been cast to integer, regardless of the original type:
{code:python}
import pyarrow.parquet as pq
ds = pq.ParquetDataset('./test.parquet')
for level in ds.partitions.levels:
print(f"{level.name}: {level.keys}")
{code}
Output:
{noformat}
key1: ['0', '1', '2']
key2: ['0', '1', '2']
key3: ['a', 'b', 'c']
key4: ['1.1', '2.2', '3.3']
key5: ['True', 'False']
key6: ['2021-06-02 00:00:00', '2021-06-03 00:00:00', '2021-06-04
00:00:00']{noformat}
Filtering the dataset using any of the non-integer partition keys along with
string-valued operands works as expected:
{code:python}
df2=pd.read_parquet('./test.parquet', filters=[('key4','=','1.1'), ('key5',
'=', 'True')])
df2.head()
{code}
Output:
{noformat}
data key1 key2 key3 key4 key5 key6
0 foo 0 0 a 1.1 True 2021-06-02 00:00:00
{noformat}
However, filtering the dataset using either of the integer-valued partition
keys with a string-valued operand raises an exception, *even when the original
column's data type is string*:
{code:python}
df2=pd.read_parquet('./test.parquet', filters=[('key1','=','1')])
df2.head()
{code}
{noformat}
ArrowNotImplementedError: Function equal has no kernel matching input types
(array[int32], scalar[string])
{noformat}
It would seem to be less surprising / more consistent if filter operands either
(a) are always cast to string, or (b) always retain their original type.
Note, this issue may be related to ARROW-12114.
was:
When creating a partitioned data set via pandas.to_parquet() method, partition
columns are ostensibly cast to strings in the partition metadata. When reading
specific partitions via the filters parameter in pandas.read_parquet(), string
values must be used for filter operands _except when_ the partition column has
an integer value.
Consider the following example:
{code:python}
import datetime
import pandas as pd
df = pd.DataFrame({
"key1": ['0', '1', '2'],
"key2": [0, 1, 2],
"key3": ['a', 'b', 'c'],
"key4": [1.1, 2.2, 3.3],
"key5": [True, False, True],
"key6": [datetime.date(2021, 6, 2), datetime.date(2021, 6, 3),
datetime.date(2021, 6, 4)],
"data": ["foo", "bar", "baz"]
})
df['key6'] = pd.to_datetime(df['key6'])
df.to_parquet('./test.parquet', partition_cols=['key1', 'key2', 'key3', 'key4',
'key5', 'key6'])
{code}
Reading into a ParquetDataset and inspecting the partition levels suggests that
partition keys have been cast to integer, regardless of the original type:
{code:python}
ds = pq.ParquetDataset('./test.parquet')
for level in ds.partitions.levels:
print(f"{level.name}: {level.keys}")
{code}
Output:
{noformat}
key1: ['0', '1', '2']
key2: ['0', '1', '2']
key3: ['a', 'b', 'c']
key4: ['1.1', '2.2', '3.3']
key5: ['True', 'False']
key6: ['2021-06-02 00:00:00', '2021-06-03 00:00:00', '2021-06-04
00:00:00']{noformat}
Filtering the dataset using any of the non-integer partition keys along with
string-valued operands works as expected:
{code:python}
df2=pd.read_parquet('./test.parquet', filters=[('key4','=','1.1'), ('key5',
'=', 'True')])
df2.head()
{code}
Output:
{noformat}
data key1 key2 key3 key4 key5 key6
0 foo 0 0 a 1.1 True 2021-06-02 00:00:00
{noformat}
However, filtering the dataset using either of the integer-valued partition
keys with a string-valued operand raises an exception, *even when the original
column's data type is string*:
{code:python}
df2=pd.read_parquet('./test.parquet', filters=[('key1','=','1')])
df2.head()
{code}
{noformat}
ArrowNotImplementedError: Function equal has no kernel matching input types
(array[int32], scalar[string])
{noformat}
It would seem to be less surprising / more consistent if filter operands either
(a) are always cast to string, or (b) always retain their original type.
Note, this issue may be related to ARROW-12114.
> Inconsistent handling of integer-valued partitions in dataset filters API
> -------------------------------------------------------------------------
>
> Key: ARROW-13578
> URL: https://issues.apache.org/jira/browse/ARROW-13578
> Project: Apache Arrow
> Issue Type: Bug
> Reporter: Matt Nizol
> Priority: Minor
>
> When creating a partitioned data set via the pandas.to_parquet() method,
> partition columns are ostensibly cast to strings in the partition metadata.
> When reading specific partitions via the filters parameter in
> pandas.read_parquet(), string values must be used for filter operands _except
> when_ the partition column has an integer value.
> Consider the following example:
> {code:python}
> import datetime
> import pandas as pd
> df = pd.DataFrame({
> "key1": ['0', '1', '2'],
> "key2": [0, 1, 2],
> "key3": ['a', 'b', 'c'],
> "key4": [1.1, 2.2, 3.3],
> "key5": [True, False, True],
> "key6": [datetime.date(2021, 6, 2), datetime.date(2021, 6, 3),
> datetime.date(2021, 6, 4)],
> "data": ["foo", "bar", "baz"]
> })
> df['key6'] = pd.to_datetime(df['key6'])
> df.to_parquet('./test.parquet', partition_cols=['key1', 'key2', 'key3',
> 'key4', 'key5', 'key6'])
> {code}
> Reading into a ParquetDataset and inspecting the partition levels suggests
> that partition keys have been cast to integer, regardless of the original
> type:
> {code:python}
> import pyarrow.parquet as pq
> ds = pq.ParquetDataset('./test.parquet')
> for level in ds.partitions.levels:
> print(f"{level.name}: {level.keys}")
> {code}
> Output:
> {noformat}
> key1: ['0', '1', '2']
> key2: ['0', '1', '2']
> key3: ['a', 'b', 'c']
> key4: ['1.1', '2.2', '3.3']
> key5: ['True', 'False']
> key6: ['2021-06-02 00:00:00', '2021-06-03 00:00:00', '2021-06-04
> 00:00:00']{noformat}
> Filtering the dataset using any of the non-integer partition keys along with
> string-valued operands works as expected:
> {code:python}
> df2=pd.read_parquet('./test.parquet', filters=[('key4','=','1.1'), ('key5',
> '=', 'True')])
> df2.head()
> {code}
> Output:
> {noformat}
> data key1 key2 key3 key4 key5 key6
> 0 foo 0 0 a 1.1 True 2021-06-02 00:00:00
> {noformat}
> However, filtering the dataset using either of the integer-valued partition
> keys with a string-valued operand raises an exception, *even when the
> original column's data type is string*:
> {code:python}
> df2=pd.read_parquet('./test.parquet', filters=[('key1','=','1')])
> df2.head()
> {code}
> {noformat}
> ArrowNotImplementedError: Function equal has no kernel matching input types
> (array[int32], scalar[string])
> {noformat}
> It would seem to be less surprising / more consistent if filter operands
> either (a) are always cast to string, or (b) always retain their original
> type.
> Note, this issue may be related to ARROW-12114.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)