[ https://issues.apache.org/jira/browse/ARROW-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matt Nizol updated ARROW-13578: ------------------------------- Description: When creating a partitioned data set via the pandas.to_parquet() method, partition columns are ostensibly cast to strings in the partition metadata. When reading specific partitions via the filters parameter in pandas.read_parquet(), string values must be used for filter operands _except when_ the partition column has an integer value. Consider the following example: {code:python} import datetime import pandas as pd df = pd.DataFrame({ "key1": ['0', '1', '2'], "key2": [0, 1, 2], "key3": ['a', 'b', 'c'], "key4": [1.1, 2.2, 3.3], "key5": [True, False, True], "key6": [datetime.date(2021, 6, 2), datetime.date(2021, 6, 3), datetime.date(2021, 6, 4)], "data": ["foo", "bar", "baz"] }) df['key6'] = pd.to_datetime(df['key6']) df.to_parquet('./test.parquet', partition_cols=['key1', 'key2', 'key3', 'key4', 'key5', 'key6']) {code} Reading into a ParquetDataset and inspecting the partition levels suggests that partition keys have been cast to string, regardless of the original type: {code:python} import pyarrow.parquet as pq ds = pq.ParquetDataset('./test.parquet') for level in ds.partitions.levels: print(f"{level.name}: {level.keys}") {code} Output: {noformat} key1: ['0', '1', '2'] key2: ['0', '1', '2'] key3: ['a', 'b', 'c'] key4: ['1.1', '2.2', '3.3'] key5: ['True', 'False'] key6: ['2021-06-02 00:00:00', '2021-06-03 00:00:00', '2021-06-04 00:00:00']{noformat} Filtering the dataset using any of the non-integer partition keys along with string-valued operands works as expected: {code:python} df2=pd.read_parquet('./test.parquet', filters=[('key4','=','1.1'), ('key5', '=', 'True')]) df2.head() {code} Output: {noformat} data key1 key2 key3 key4 key5 key6 0 foo 0 0 a 1.1 True 2021-06-02 00:00:00 {noformat} However, filtering the dataset using either of the integer-valued partition keys with a string-valued operand raises an exception, *even when the original column's data type is string*: {code:python} df2=pd.read_parquet('./test.parquet', filters=[('key1','=','1')]) df2.head() {code} {noformat} ArrowNotImplementedError: Function equal has no kernel matching input types (array[int32], scalar[string]) {noformat} It would seem to be less surprising / more consistent if filter operands either (a) are always cast to string, or (b) always retain their original type. Note, this issue may be related to ARROW-12114. was: When creating a partitioned data set via the pandas.to_parquet() method, partition columns are ostensibly cast to strings in the partition metadata. When reading specific partitions via the filters parameter in pandas.read_parquet(), string values must be used for filter operands _except when_ the partition column has an integer value. Consider the following example: {code:python} import datetime import pandas as pd df = pd.DataFrame({ "key1": ['0', '1', '2'], "key2": [0, 1, 2], "key3": ['a', 'b', 'c'], "key4": [1.1, 2.2, 3.3], "key5": [True, False, True], "key6": [datetime.date(2021, 6, 2), datetime.date(2021, 6, 3), datetime.date(2021, 6, 4)], "data": ["foo", "bar", "baz"] }) df['key6'] = pd.to_datetime(df['key6']) df.to_parquet('./test.parquet', partition_cols=['key1', 'key2', 'key3', 'key4', 'key5', 'key6']) {code} Reading into a ParquetDataset and inspecting the partition levels suggests that partition keys have been cast to integer, regardless of the original type: {code:python} import pyarrow.parquet as pq ds = pq.ParquetDataset('./test.parquet') for level in ds.partitions.levels: print(f"{level.name}: {level.keys}") {code} Output: {noformat} key1: ['0', '1', '2'] key2: ['0', '1', '2'] key3: ['a', 'b', 'c'] key4: ['1.1', '2.2', '3.3'] key5: ['True', 'False'] key6: ['2021-06-02 00:00:00', '2021-06-03 00:00:00', '2021-06-04 00:00:00']{noformat} Filtering the dataset using any of the non-integer partition keys along with string-valued operands works as expected: {code:python} df2=pd.read_parquet('./test.parquet', filters=[('key4','=','1.1'), ('key5', '=', 'True')]) df2.head() {code} Output: {noformat} data key1 key2 key3 key4 key5 key6 0 foo 0 0 a 1.1 True 2021-06-02 00:00:00 {noformat} However, filtering the dataset using either of the integer-valued partition keys with a string-valued operand raises an exception, *even when the original column's data type is string*: {code:python} df2=pd.read_parquet('./test.parquet', filters=[('key1','=','1')]) df2.head() {code} {noformat} ArrowNotImplementedError: Function equal has no kernel matching input types (array[int32], scalar[string]) {noformat} It would seem to be less surprising / more consistent if filter operands either (a) are always cast to string, or (b) always retain their original type. Note, this issue may be related to ARROW-12114. > Inconsistent handling of integer-valued partitions in dataset filters API > ------------------------------------------------------------------------- > > Key: ARROW-13578 > URL: https://issues.apache.org/jira/browse/ARROW-13578 > Project: Apache Arrow > Issue Type: Bug > Reporter: Matt Nizol > Priority: Minor > > When creating a partitioned data set via the pandas.to_parquet() method, > partition columns are ostensibly cast to strings in the partition metadata. > When reading specific partitions via the filters parameter in > pandas.read_parquet(), string values must be used for filter operands _except > when_ the partition column has an integer value. > Consider the following example: > {code:python} > import datetime > import pandas as pd > df = pd.DataFrame({ > "key1": ['0', '1', '2'], > "key2": [0, 1, 2], > "key3": ['a', 'b', 'c'], > "key4": [1.1, 2.2, 3.3], > "key5": [True, False, True], > "key6": [datetime.date(2021, 6, 2), datetime.date(2021, 6, 3), > datetime.date(2021, 6, 4)], > "data": ["foo", "bar", "baz"] > }) > df['key6'] = pd.to_datetime(df['key6']) > df.to_parquet('./test.parquet', partition_cols=['key1', 'key2', 'key3', > 'key4', 'key5', 'key6']) > {code} > Reading into a ParquetDataset and inspecting the partition levels suggests > that partition keys have been cast to string, regardless of the original type: > {code:python} > import pyarrow.parquet as pq > ds = pq.ParquetDataset('./test.parquet') > for level in ds.partitions.levels: > print(f"{level.name}: {level.keys}") > {code} > Output: > {noformat} > key1: ['0', '1', '2'] > key2: ['0', '1', '2'] > key3: ['a', 'b', 'c'] > key4: ['1.1', '2.2', '3.3'] > key5: ['True', 'False'] > key6: ['2021-06-02 00:00:00', '2021-06-03 00:00:00', '2021-06-04 > 00:00:00']{noformat} > Filtering the dataset using any of the non-integer partition keys along with > string-valued operands works as expected: > {code:python} > df2=pd.read_parquet('./test.parquet', filters=[('key4','=','1.1'), ('key5', > '=', 'True')]) > df2.head() > {code} > Output: > {noformat} > data key1 key2 key3 key4 key5 key6 > 0 foo 0 0 a 1.1 True 2021-06-02 00:00:00 > {noformat} > However, filtering the dataset using either of the integer-valued partition > keys with a string-valued operand raises an exception, *even when the > original column's data type is string*: > {code:python} > df2=pd.read_parquet('./test.parquet', filters=[('key1','=','1')]) > df2.head() > {code} > {noformat} > ArrowNotImplementedError: Function equal has no kernel matching input types > (array[int32], scalar[string]) > {noformat} > It would seem to be less surprising / more consistent if filter operands > either (a) are always cast to string, or (b) always retain their original > type. > Note, this issue may be related to ARROW-12114. -- This message was sent by Atlassian Jira (v8.3.4#803005)