[ 
https://issues.apache.org/jira/browse/ARROW-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-17852:
--------------------------------
    Summary: [python] `dtype` of `Categorical` category columns are not 
preserved  (was: `dtype` of `Categorical` category columns are not preserved)

> [python] `dtype` of `Categorical` category columns are not preserved
> --------------------------------------------------------------------
>
>                 Key: ARROW-17852
>                 URL: https://issues.apache.org/jira/browse/ARROW-17852
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 9.0.0
>            Reporter: Ryan Ballard
>            Priority: Major
>              Labels: categorical, pandas, pyarrow
>
> Hi there,
> First time submitting an issue here so apologies if there's anything I've 
> missed.
> I see the below bug, where by the {{dtype}} of the categories themselves 
> (within a {{pd.Categorical}} are not preserved on a round trip via pyarrow. 
> Hopefully the snippet below demonstrates the issue.
> The reason this causes an issue, is because the dtypes need to be the same in 
> order for the categories to be considered the same (so they can then be 
> concatenated, for example).
> Current workaround is to store as a plain {{pd.StringDtype()}} and then 
> convert to {{pd.Categorical}} in memory with Pandas (which infers from the 
> underlying type, but in doing so sacrifices disk saving of storing as a 
> dictionary).
> Using pyarrow 9.0.0 and pandas 1.4.4.
> Thanks
>  
> {{import pandas as pd}}
> {{import pyarrow as pa}}
>  
> {{{}# note, Categorical column B is constructed from 
> `pd.{}}}{{{}StringDtype`{}}}
> {{df = pd.DataFrame(\{"A": ["a", "b", "c", "a"]\}, dtype=pd.StringDtype())}}
> {{df["B"] = df["A"].astype("category")}}
> {{print(df["B"].cat.categories)}}
> {{# Index(['a', 'b', 'c'], dtype='string')}}
>  
> {{# however, this is downcast to `object` during a roundtrip}}
> {{print(pa.Table.from_pandas(df).to_pandas()["B"].cat.categories)}}
> {{# Index(['a', 'b', 'c'], dtype='object')}}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to