[ 
https://issues.apache.org/jira/browse/ARROW-14767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gavin updated ARROW-14767:
--------------------------
    Description: 
When converting from a pandas dataframe to a table, categorical variables are 
by default given an index type int8 (presumably because there are fewer than 
128 categories) in the schema. When this is written to a parquet file, the 
schema changes such that the index type is int32 instead. This causes an 
inconsistency between the schemas of tables derived from pandas and those read 
from disk.

A minimal recreation of the issue is as follows:
{code:java}
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": ["a", "a", "b", "c", "b"]})
dtypes = {
    "A": np.dtype("int8"),
    "B": pd.CategoricalDtype(categories=["a", "b", "c"], ordered=None),
}
df = df.astype(dtypes)

tbl = pa.Table.from_pandas(
    df, 
)  
where = "tmp.parquet"
filesystem = pa.fs.LocalFileSystem()

pq.write_table(
    tbl,
    filesystem.open_output_stream(
        where,
        compression=None,
    ),
    version="2.0",
)

schema = tbl.schema

read_schema = pq.ParquetFile(
    filesystem.open_input_file(where),
).schema_arrow{code}
By printing schema and read_schema, you can the inconsistency.

I have workarounds in place for this, but am raising the issue anyway so that 
you can resolve it properly.

  was:
When converting from a pandas dataframe to a table, categorical variables are 
by default given an index type int8 (presumably because there are fewer than 
128 categories) in the schema. When this is written to a parquet file, the 
schema changes such that the index type is int32 instead. This causes an 
inconsistency between the schemas of tables derived from pandas and those read 
from disk.

A minimal recreation of the issue is as follows:
{code:java}
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": ["a", "a", "b", "c", "b"]})
dtypes = {
    "A": np.dtype("int8"),
    "B": pd.CategoricalDtype(categories=["a", "b", "c"], ordered=None),
}
df = df.astype(dtypes)

tbl = pa.Table.from_pandas(
    df, 
)  
where = "tmp.parquet"
filesystem = pa.fs.LocalFileSystem()

pq.write_table(
    tbl,
    filesystem.open_output_stream(
        where,
        compression=None,
    ),
    version="2.0",
)

schema = tbl.schema

read_schema = pq.ParquetFile(
    filesystem.open_input_file(where), #buffer_size=_BUFFER_SIZE, 
pre_buffer=True
).schema_arrow{code}
By printing schema and read_schema, you can the inconsistency.

I have workarounds in place for this, but am raising the issue anyway so that 
you can resolve it properly.


> Categorical int8 index types written as int32 in parquet files
> --------------------------------------------------------------
>
>                 Key: ARROW-14767
>                 URL: https://issues.apache.org/jira/browse/ARROW-14767
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 5.0.0
>         Environment: NAME="CentOS Linux"
> VERSION="7 (Core)"
>            Reporter: Gavin
>            Priority: Minor
>
> When converting from a pandas dataframe to a table, categorical variables are 
> by default given an index type int8 (presumably because there are fewer than 
> 128 categories) in the schema. When this is written to a parquet file, the 
> schema changes such that the index type is int32 instead. This causes an 
> inconsistency between the schemas of tables derived from pandas and those 
> read from disk.
> A minimal recreation of the issue is as follows:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": ["a", "a", "b", "c", "b"]})
> dtypes = {
>     "A": np.dtype("int8"),
>     "B": pd.CategoricalDtype(categories=["a", "b", "c"], ordered=None),
> }
> df = df.astype(dtypes)
> tbl = pa.Table.from_pandas(
>     df, 
> )  
> where = "tmp.parquet"
> filesystem = pa.fs.LocalFileSystem()
> pq.write_table(
>     tbl,
>     filesystem.open_output_stream(
>         where,
>         compression=None,
>     ),
>     version="2.0",
> )
> schema = tbl.schema
> read_schema = pq.ParquetFile(
>     filesystem.open_input_file(where),
> ).schema_arrow{code}
> By printing schema and read_schema, you can the inconsistency.
> I have workarounds in place for this, but am raising the issue anyway so that 
> you can resolve it properly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to