Vadym Dytyniak created ARROW-18269:
--------------------------------------

             Summary: Slash character in partition value handling
                 Key: ARROW-18269
                 URL: https://issues.apache.org/jira/browse/ARROW-18269
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 10.0.0
            Reporter: Vadym Dytyniak


 

Provided example shows that pyarrow does not handle partition value that 
contains '/' correctly:
{code:java}
import pandas as pd
import pyarrow as pa

from pyarrow import dataset as ds

df = pd.DataFrame({
    'value': [1, 2],
    'instrument_id': ['A/Z', 'B'],
})

ds.write_dataset(
    data=pa.Table.from_pandas(df),
    base_dir='data',
    format='parquet',
    partitioning=['instrument_id'],
    partitioning_flavor='hive',
)

table = ds.dataset(
    source='data',
    format='parquet',
    partitioning='hive',
).to_table()

tables = [table]

df = pa.concat_tables(tables).to_pandas()  tables = [table]

df = pa.concat_tables(tables).to_pandas() 

print(df.head()){code}
 
{code:java}
   value instrument_id
0      1             A
1      2             B {code}
Expected behaviour:
Option 1: Result should be:
{code:java}
   value instrument_id
0      1             A/Z
1      2             B {code}
Option 2: Error should be raised to avoid '/' in partition value.

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to