[ https://issues.apache.org/jira/browse/ARROW-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney updated ARROW-8812: -------------------------------- Summary: [Python] Columns of type CategoricalIndex fails to be read back (was: Columns of type CategoricalIndex fails to be read back) > [Python] Columns of type CategoricalIndex fails to be read back > --------------------------------------------------------------- > > Key: ARROW-8812 > URL: https://issues.apache.org/jira/browse/ARROW-8812 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.15.1 > Environment: Python 3.7.7 > MacOS (Darwin-19.4.0-x86_64-i386-64bit) > Pandas 1.0.3 > Pyarrow 0.15.1 > Reporter: Jonas Nelle > Priority: Minor > Labels: parquet > > When columns are of type {{CategoricalIndex}}, saving and reading the table > back causes a {{TypeError: data type "categorical" not understood}}: > {code:python} > import pandas as pd > from pyarrow import parquet, Table > base_df = pd.DataFrame([['foo', 'j', "1"], > ['bar', 'j', "1"], > ['foo', 'j', "1"], > ['foobar', 'j', "1"]], > columns=['my_cat', 'var', 'for_count']) > base_df['my_cat'] = base_df['my_cat'].astype('category') > df = ( > base_df > .groupby(["my_cat", "var"], observed=True) > .agg({"for_count": "count"}) > .rename(columns={"for_count": "my_cat_counts"}) > .unstack(level="my_cat", fill_value=0) > ) > print(df) > {code} > The resulting data frame looks something like this: > || ||my_cat_counts|| || || > |my_cat|foo|bar|foobar| > |var| | | | > |j|2|1|1| > Then, writing and reading causes the {{KeyError}}: > {code:python} > parquet.write_table(Table.from_pandas(df), "test.pqt") > parquet.read_table("test.pqt").to_pandas() > > TypeError: data type "categorical" not understood > {code} > In the example, the column is also a MultiIndex, but that isn't the problem: > {code:python} > df.columns = df.columns.get_level_values(1) > parquet.write_table(Table.from_pandas(df), "test.pqt") > parquet.read_table("test.pqt").to_pandas() > > TypeError: data type "categorical" not understood > {code} > This is the workaround [suggested on > stackoverflow|https://stackoverflow.com/questions/55749399/how-to-fix-the-issue-of-categoricalindex-column-in-pandas]: > {code:python} > df.columns = pd.Index(list(df.columns)) # suggested fix for the time being > parquet.write_table(Table.from_pandas(df), "test.pqt") > parquet.read_table("test.pqt").to_pandas() # no error > {code} > Are there any plans to support the pattern described here in the future? -- This message was sent by Atlassian Jira (v8.3.4#803005)