[ https://issues.apache.org/jira/browse/ARROW-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856967#comment-16856967 ]
Joris Van den Bossche commented on ARROW-5480: ---------------------------------------------- [~wesmckinn] I think this can be closed as duplicate of the other issue? > [Python] Pandas categorical type doesn't survive a round-trip through parquet > ----------------------------------------------------------------------------- > > Key: ARROW-5480 > URL: https://issues.apache.org/jira/browse/ARROW-5480 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Affects Versions: 0.11.1, 0.13.0 > Environment: python: 3.7.3.final.0 > python-bits: 64 > OS: Linux > OS-release: 5.0.0-15-generic > machine: x86_64 > processor: x86_64 > byteorder: little > pandas: 0.24.2 > numpy: 1.16.4 > pyarrow: 0.13.0 > Reporter: Karl Dunkle Werner > Priority: Minor > > Writing a string categorical variable to from pandas parquet is read back as > string (object dtype). I expected it to be read as category. > The same thing happens if the category is numeric -- a numeric category is > read back as int64. > In the code below, I tried out an in-memory arrow Table, which successfully > translates categories back to pandas. However, when I write to a parquet > file, it's not. > In the scheme of things, this isn't a big deal, but it's a small surprise. > {code:python} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'x': pd.Categorical(['a', 'a', 'b', 'b'])}) > df.dtypes # category > # This works: > pa.Table.from_pandas(df).to_pandas().dtypes # category > df.to_parquet("categories.parquet") > # This reads back object, but I expected category > pd.read_parquet("categories.parquet").dtypes # object > # Numeric categories have the same issue: > df_num = pd.DataFrame({'x': pd.Categorical([1, 1, 2, 2])}) > df_num.dtypes # category > pa.Table.from_pandas(df_num).to_pandas().dtypes # category > df_num.to_parquet("categories_num.parquet") > # This reads back int64, but I expected category > pd.read_parquet("categories_num.parquet").dtypes # int64 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)