Naga created ARROW-6114:
---------------------------
Summary: Datatypes are not preserved when a pandas dataframe
partitioned and saved as parquet file using pyarrow
Key: ARROW-6114
URL: https://issues.apache.org/jira/browse/ARROW-6114
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.14.1
Environment: Python 3.7.3
pyarrow 0.14.1
Reporter: Naga
h3. Datatypes are not preserved when a pandas data frame is *partitioned* and
saved as parquet file using pyarrow but that's not the case when the data frame
is not partitioned.
*Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
{code:java}
# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame( \{'age': [77,32,234],'name':['agan','bbobby','test'] }
)
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols,
preserve_index=False)
# Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
{code}
*Output:*
{code:java}
Datatypes before saving the dataset
age int64
name object
dtype: object
Datatypes after loading the dataset
name object
age category
dtype: object
>From the above output, we could see that the data type for age is int64 in the
>original pandas data frame but it got changed to object when we saved to local
>and loaded back.
{code}
*Case 2: Non-partitioned dataset - Data types are preserved*
{code:java}
import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without
partitioning using pyarrow')
df = pd.DataFrame(
{'age': [77,32,234],'name':['agan','bbobby','test'] }
)
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)
# Loading a non-partioned parquet file from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)
{code}
*Output:*
{code:java}
Saving a Pandas Dataframe to Local as a parquet file without partitioning using
pyarrow
Datatypes before saving the dataset
age int64
name object
dtype: object
Datatypes after loading the dataset
age int64
name object
dtype: object
{code}
*Versions*
* Python 3.7.3
* pyarrow 0.14.1
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)