Naga created ARROW-6114: --------------------------- Summary: Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow Key: ARROW-6114 URL: https://issues.apache.org/jira/browse/ARROW-6114 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.14.1 Environment: Python 3.7.3 pyarrow 0.14.1 Reporter: Naga
h3. Datatypes are not preserved when a pandas data frame is *partitioned* and saved as parquet file using pyarrow but that's not the case when the data frame is not partitioned. *Case 1: Saving a partitioned dataset - Data Types are NOT preserved* {code:java} # Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow import pandas as pd df = pd.DataFrame( \{'age': [77,32,234],'name':['agan','bbobby','test'] } ) path = 'test' partition_cols=['age'] print('Datatypes before saving the dataset') print(df.dtypes) table = pa.Table.from_pandas(df) pq.write_to_dataset(table, path, partition_cols=partition_cols, preserve_index=False) # Loading a dataset partioned parquet dataset from local df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas() print('\nDatatypes after loading the dataset') print(df.dtypes) {code} *Output:* {code:java} Datatypes before saving the dataset age int64 name object dtype: object Datatypes after loading the dataset name object age category dtype: object >From the above output, we could see that the data type for age is int64 in the >original pandas data frame but it got changed to object when we saved to local >and loaded back. {code} *Case 2: Non-partitioned dataset - Data types are preserved* {code:java} import pandas as pd print('Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow') df = pd.DataFrame( {'age': [77,32,234],'name':['agan','bbobby','test'] } ) path = 'test_without_partition' print('Datatypes before saving the dataset') print(df.dtypes) table = pa.Table.from_pandas(df) pq.write_to_dataset(table, path, preserve_index=False) # Loading a non-partioned parquet file from local df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas() print('\nDatatypes after loading the dataset') print(df.dtypes) {code} *Output:* {code:java} Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow Datatypes before saving the dataset age int64 name object dtype: object Datatypes after loading the dataset age int64 name object dtype: object {code} *Versions* * Python 3.7.3 * pyarrow 0.14.1 -- This message was sent by Atlassian JIRA (v7.6.14#76016)