[ https://issues.apache.org/jira/browse/ARROW-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weston Pace updated ARROW-17852: -------------------------------- Summary: [python] `dtype` of `Categorical` category columns are not preserved (was: `dtype` of `Categorical` category columns are not preserved) > [python] `dtype` of `Categorical` category columns are not preserved > -------------------------------------------------------------------- > > Key: ARROW-17852 > URL: https://issues.apache.org/jira/browse/ARROW-17852 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 9.0.0 > Reporter: Ryan Ballard > Priority: Major > Labels: categorical, pandas, pyarrow > > Hi there, > First time submitting an issue here so apologies if there's anything I've > missed. > I see the below bug, where by the {{dtype}} of the categories themselves > (within a {{pd.Categorical}} are not preserved on a round trip via pyarrow. > Hopefully the snippet below demonstrates the issue. > The reason this causes an issue, is because the dtypes need to be the same in > order for the categories to be considered the same (so they can then be > concatenated, for example). > Current workaround is to store as a plain {{pd.StringDtype()}} and then > convert to {{pd.Categorical}} in memory with Pandas (which infers from the > underlying type, but in doing so sacrifices disk saving of storing as a > dictionary). > Using pyarrow 9.0.0 and pandas 1.4.4. > Thanks > > {{import pandas as pd}} > {{import pyarrow as pa}} > > {{{}# note, Categorical column B is constructed from > `pd.{}}}{{{}StringDtype`{}}} > {{df = pd.DataFrame(\{"A": ["a", "b", "c", "a"]\}, dtype=pd.StringDtype())}} > {{df["B"] = df["A"].astype("category")}} > {{print(df["B"].cat.categories)}} > {{# Index(['a', 'b', 'c'], dtype='string')}} > > {{# however, this is downcast to `object` during a roundtrip}} > {{print(pa.Table.from_pandas(df).to_pandas()["B"].cat.categories)}} > {{# Index(['a', 'b', 'c'], dtype='object')}} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)