[ https://issues.apache.org/jira/browse/ARROW-14767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gavin updated ARROW-14767: -------------------------- Description: When converting from a pandas dataframe to a table, categorical variables are by default given an index type int8 (presumably because there are fewer than 128 categories) in the schema. When this is written to a parquet file, the schema changes such that the index type is int32 instead. This causes an inconsistency between the schemas of tables derived from pandas and those read from disk. A minimal recreation of the issue is as follows: {code:java} import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq df = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": ["a", "a", "b", "c", "b"]}) dtypes = { "A": np.dtype("int8"), "B": pd.CategoricalDtype(categories=["a", "b", "c"], ordered=None), } df = df.astype(dtypes) tbl = pa.Table.from_pandas( df, ) where = "tmp.parquet" filesystem = pa.fs.LocalFileSystem() pq.write_table( tbl, filesystem.open_output_stream( where, compression=None, ), version="2.0", ) schema = tbl.schema read_schema = pq.ParquetFile( filesystem.open_input_file(where), ).schema_arrow{code} By printing schema and read_schema, you can the inconsistency. I have workarounds in place for this, but am raising the issue anyway so that you can resolve it properly. was: When converting from a pandas dataframe to a table, categorical variables are by default given an index type int8 (presumably because there are fewer than 128 categories) in the schema. When this is written to a parquet file, the schema changes such that the index type is int32 instead. This causes an inconsistency between the schemas of tables derived from pandas and those read from disk. A minimal recreation of the issue is as follows: {code:java} import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq df = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": ["a", "a", "b", "c", "b"]}) dtypes = { "A": np.dtype("int8"), "B": pd.CategoricalDtype(categories=["a", "b", "c"], ordered=None), } df = df.astype(dtypes) tbl = pa.Table.from_pandas( df, ) where = "tmp.parquet" filesystem = pa.fs.LocalFileSystem() pq.write_table( tbl, filesystem.open_output_stream( where, compression=None, ), version="2.0", ) schema = tbl.schema read_schema = pq.ParquetFile( filesystem.open_input_file(where), #buffer_size=_BUFFER_SIZE, pre_buffer=True ).schema_arrow{code} By printing schema and read_schema, you can the inconsistency. I have workarounds in place for this, but am raising the issue anyway so that you can resolve it properly. > Categorical int8 index types written as int32 in parquet files > -------------------------------------------------------------- > > Key: ARROW-14767 > URL: https://issues.apache.org/jira/browse/ARROW-14767 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 5.0.0 > Environment: NAME="CentOS Linux" > VERSION="7 (Core)" > Reporter: Gavin > Priority: Minor > > When converting from a pandas dataframe to a table, categorical variables are > by default given an index type int8 (presumably because there are fewer than > 128 categories) in the schema. When this is written to a parquet file, the > schema changes such that the index type is int32 instead. This causes an > inconsistency between the schemas of tables derived from pandas and those > read from disk. > A minimal recreation of the issue is as follows: > {code:java} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > df = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": ["a", "a", "b", "c", "b"]}) > dtypes = { > "A": np.dtype("int8"), > "B": pd.CategoricalDtype(categories=["a", "b", "c"], ordered=None), > } > df = df.astype(dtypes) > tbl = pa.Table.from_pandas( > df, > ) > where = "tmp.parquet" > filesystem = pa.fs.LocalFileSystem() > pq.write_table( > tbl, > filesystem.open_output_stream( > where, > compression=None, > ), > version="2.0", > ) > schema = tbl.schema > read_schema = pq.ParquetFile( > filesystem.open_input_file(where), > ).schema_arrow{code} > By printing schema and read_schema, you can the inconsistency. > I have workarounds in place for this, but am raising the issue anyway so that > you can resolve it properly. -- This message was sent by Atlassian Jira (v8.20.1#820001)