Naga created ARROW-6114:
---------------------------

             Summary: Datatypes are not preserved when a pandas dataframe 
partitioned and saved as parquet file using pyarrow
                 Key: ARROW-6114
                 URL: https://issues.apache.org/jira/browse/ARROW-6114
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.14.1
         Environment: Python 3.7.3
pyarrow 0.14.1
            Reporter: Naga


h3. Datatypes are not preserved when a pandas data frame is *partitioned* and 
saved as parquet file using pyarrow but that's not the case when the data frame 
is not partitioned.

*Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
{code:java}
# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame( \{'age': [77,32,234],'name':['agan','bbobby','test'] }
)
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, 
preserve_index=False)

 # Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
{code}
*Output:*
{code:java}
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
name object
age category
dtype: object
>From the above output, we could see that the data type for age is int64 in the 
>original pandas data frame but it got changed to object when we saved to local 
>and loaded back.
{code}
*Case 2: Non-partitioned dataset - Data types are preserved*
{code:java}
import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without 
partitioning using pyarrow')
df = pd.DataFrame(

{'age': [77,32,234],'name':['agan','bbobby','test'] }

)
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
 # Loading a non-partioned parquet file from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

{code}
*Output:*
{code:java}
Saving a Pandas Dataframe to Local as a parquet file without partitioning using 
pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object

Datatypes after loading the dataset
age int64
name object
dtype: object
{code}

*Versions*
 * Python 3.7.3
 * pyarrow 0.14.1



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to