[ https://issues.apache.org/jira/browse/ARROW-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522871#comment-17522871 ]
Joris Van den Bossche commented on ARROW-5480: ---------------------------------------------- [~kukughking] there are already several other issues about this. The underlying issue is that we need to support reading other types than BYTE_ARRAY into dictionary type for Parquet, which is covered in ARROW-6140. As a result, for now the categorical dtype is only preserved for string data and not boolean or numeric data types, see eg ARROW-13342 and ARROW-11157 as other reports on this topic with some additional explanation/discussion. > [Python] Pandas categorical type doesn't survive a round-trip through parquet > ----------------------------------------------------------------------------- > > Key: ARROW-5480 > URL: https://issues.apache.org/jira/browse/ARROW-5480 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Affects Versions: 0.11.1, 0.13.0 > Environment: python: 3.7.3.final.0 > python-bits: 64 > OS: Linux > OS-release: 5.0.0-15-generic > machine: x86_64 > processor: x86_64 > byteorder: little > pandas: 0.24.2 > numpy: 1.16.4 > pyarrow: 0.13.0 > Reporter: Karl Dunkle Werner > Assignee: Wes McKinney > Priority: Minor > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 2.5h > Remaining Estimate: 0h > > Writing a string categorical variable to from pandas parquet is read back as > string (object dtype). I expected it to be read as category. > The same thing happens if the category is numeric -- a numeric category is > read back as int64. > In the code below, I tried out an in-memory arrow Table, which successfully > translates categories back to pandas. However, when I write to a parquet > file, it's not. > In the scheme of things, this isn't a big deal, but it's a small surprise. > {code:python} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'x': pd.Categorical(['a', 'a', 'b', 'b'])}) > df.dtypes # category > # This works: > pa.Table.from_pandas(df).to_pandas().dtypes # category > df.to_parquet("categories.parquet") > # This reads back object, but I expected category > pd.read_parquet("categories.parquet").dtypes # object > # Numeric categories have the same issue: > df_num = pd.DataFrame({'x': pd.Categorical([1, 1, 2, 2])}) > df_num.dtypes # category > pa.Table.from_pandas(df_num).to_pandas().dtypes # category > df_num.to_parquet("categories_num.parquet") > # This reads back int64, but I expected category > pd.read_parquet("categories_num.parquet").dtypes # int64 > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)