[ 
https://issues.apache.org/jira/browse/ARROW-18269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632043#comment-17632043
 ] 

Vibhatha Lakmal Abeykoon commented on ARROW-18269:
--------------------------------------------------

[~dytyniak] Imagine the following situation

{code:python}
    import pandas as pd
    import pyarrow as pa

    from pyarrow import dataset as ds
    
    path = tmpdir / "slash-writer-x"
    
    # pathx = "/Users/vibhatha/data/parquet/slash-writer-x"

    df = pd.DataFrame({
        'exp_id': [1, 2, 1, 3, 6],
        'exp_meta': ["experiment/A/f.csv", "experiment/B/f.csv", 
                     "experiment/A/d.csv", "experiment/C/k.csv",
                     "experiment/M/i.csv"],
    })
    
    dt_table = pa.Table.from_pandas(df)

    ds.write_dataset(
        data=dt_table,
        base_dir=path,
        format='parquet',
        partitioning=['exp_meta'],
        partitioning_flavor='hive',
    )

    table = ds.dataset(
        source=path,
        format='parquet',
        partitioning='hive',
        schema = pa.schema([pa.field("exp_id", pa.int32()), 
pa.field("exp_meta", pa.utf8())])
    ).to_table()
    
    print(table)

    df = pa.concat_tables([table]).to_pandas()  

    print(df.head())
{code}

If we go for option2, the users won't be able to handle such situation. But we 
could suggest them to do it in a different way. But it would require them to 
encode and decode the URIs. If this is billions of raws, and that would be 
really expensive. 

WDYT? 

cc [~westonpace]

> [C++] Slash character in partition value handling
> -------------------------------------------------
>
>                 Key: ARROW-18269
>                 URL: https://issues.apache.org/jira/browse/ARROW-18269
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 10.0.0
>            Reporter: Vadym Dytyniak
>            Assignee: Vibhatha Lakmal Abeykoon
>            Priority: Major
>              Labels: good-first-issue
>
>  
> Provided example shows that pyarrow does not handle partition value that 
> contains '/' correctly:
> {code:java}
> import pandas as pd
> import pyarrow as pa
> from pyarrow import dataset as ds
> df = pd.DataFrame({
>     'value': [1, 2],
>     'instrument_id': ['A/Z', 'B'],
> })
> ds.write_dataset(
>     data=pa.Table.from_pandas(df),
>     base_dir='data',
>     format='parquet',
>     partitioning=['instrument_id'],
>     partitioning_flavor='hive',
> )
> table = ds.dataset(
>     source='data',
>     format='parquet',
>     partitioning='hive',
> ).to_table()
> tables = [table]
> df = pa.concat_tables(tables).to_pandas()  tables = [table]
> df = pa.concat_tables(tables).to_pandas() 
> print(df.head()){code}
> Result:
> {code:java}
>    value instrument_id
> 0      1             A
> 1      2             B {code}
> Expected behaviour:
> Option 1: Result should be:
> {code:java}
>    value instrument_id
> 0      1             A/Z
> 1      2             B {code}
> Option 2: Error should be raised to avoid '/' in partition value.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to