[ 
https://issues.apache.org/jira/browse/ARROW-4538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Thiel updated ARROW-4538:
-----------------------------------
    Description: 
When using {{pa.Table.from_pandas()}} with preserve_index=True and 
dataframe.index.name!=None the prefix {{__index_level_}} is not added to the 
respective schema name. This breaks {{write_to_dataset}} with active partition 
columns.
{code}
import pyarrow as pa
import pyarrow.parquet as pq
import os
import shutil
import pandas as pd
import numpy as np

PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'

if os.path.exists(PATH_PYARROW_MANUAL):
    shutil.rmtree(PATH_PYARROW_MANUAL)
os.mkdir(PATH_PYARROW_MANUAL)

arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
df['arrays'] = pd.Series(arrays)

df.index.name='ID'

table = pa.Table.from_pandas(df, preserve_index=True)
print(table.schema.names)

pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, 
                    partition_cols=['partition_column'],
                    preserve_index=True
                   )
{code}
Removing {{df.index.name='ID'}} works. Also disabling {{partition_cols}} in 
{{write_to_dataset}} works.

  was:
When using {{pa.Table.from_pandas()}} with preserve_index=True and 
dataframe.index.name!=None the prefix {{__index_level_}} is not added to the 
respective schema name. This breaks {{write_to_dataset}} with active partition 
columns.
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import os
import shutil
import pandas as pd
import numpy as np

PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'

if os.path.exists(PATH_PYARROW_MANUAL):
    shutil.rmtree(PATH_PYARROW_MANUAL)
os.mkdir(PATH_PYARROW_MANUAL)

arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
df['arrays'] = pd.Series(arrays)

df.index.name='ID'

table = pa.Table.from_pandas(df, preserve_index=True)
print(table.schema.names)

pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, 
                    partition_cols=['partition_column'],
                    preserve_index=True
                   )
{code}
Removing {{df.index.name='ID'}} works. Also disabling {{partition_cols}} in 
{{write_to_dataset}} works.


> [PYTHON] write_to_dataset() breaks with dataframe with valid index name
> -----------------------------------------------------------------------
>
>                 Key: ARROW-4538
>                 URL: https://issues.apache.org/jira/browse/ARROW-4538
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.12.0
>            Reporter: Christian Thiel
>            Priority: Major
>
> When using {{pa.Table.from_pandas()}} with preserve_index=True and 
> dataframe.index.name!=None the prefix {{__index_level_}} is not added to the 
> respective schema name. This breaks {{write_to_dataset}} with active 
> partition columns.
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import os
> import shutil
> import pandas as pd
> import numpy as np
> PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
> if os.path.exists(PATH_PYARROW_MANUAL):
>     shutil.rmtree(PATH_PYARROW_MANUAL)
> os.mkdir(PATH_PYARROW_MANUAL)
> arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
> df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
> df['arrays'] = pd.Series(arrays)
> df.index.name='ID'
> table = pa.Table.from_pandas(df, preserve_index=True)
> print(table.schema.names)
> pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, 
>                     partition_cols=['partition_column'],
>                     preserve_index=True
>                    )
> {code}
> Removing {{df.index.name='ID'}} works. Also disabling {{partition_cols}} in 
> {{write_to_dataset}} works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to