[jira] [Commented] (ARROW-5480) [Python] Pandas categorical type doesn't survive a round-trip through parquet

Joris Van den Bossche (Jira) Fri, 15 Apr 2022 08:33:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522871#comment-17522871
 ]


Joris Van den Bossche commented on ARROW-5480:
----------------------------------------------

[~kukughking] there are already several other issues about this. The underlying 
issue is that we need to support reading other types than BYTE_ARRAY into 
dictionary type for Parquet, which is covered in ARROW-6140. As a result, for 
now the categorical dtype is only preserved for string data and not boolean or 
numeric data types, see eg ARROW-13342 and ARROW-11157 as other reports on this 
topic with some additional explanation/discussion.

> [Python] Pandas categorical type doesn't survive a round-trip through parquet
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-5480
>                 URL: https://issues.apache.org/jira/browse/ARROW-5480
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 0.11.1, 0.13.0
>         Environment: python: 3.7.3.final.0
> python-bits: 64
> OS: Linux
> OS-release: 5.0.0-15-generic
> machine: x86_64
> processor: x86_64
> byteorder: little
> pandas: 0.24.2
> numpy: 1.16.4
> pyarrow: 0.13.0
>            Reporter: Karl Dunkle Werner
>            Assignee: Wes McKinney
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 0.15.0
>
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Writing a string categorical variable to from pandas parquet is read back as 
> string (object dtype). I expected it to be read as category.
> The same thing happens if the category is numeric -- a numeric category is 
> read back as int64.
> In the code below, I tried out an in-memory arrow Table, which successfully 
> translates categories back to pandas. However, when I write to a parquet 
> file, it's not.
> In the scheme of things, this isn't a big deal, but it's a small surprise.
> {code:python}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'x': pd.Categorical(['a', 'a', 'b', 'b'])})
> df.dtypes  # category
> # This works:
> pa.Table.from_pandas(df).to_pandas().dtypes  # category
> df.to_parquet("categories.parquet")
> # This reads back object, but I expected category
> pd.read_parquet("categories.parquet").dtypes  # object
> # Numeric categories have the same issue:
> df_num = pd.DataFrame({'x': pd.Categorical([1, 1, 2, 2])})
> df_num.dtypes # category
> pa.Table.from_pandas(df_num).to_pandas().dtypes  # category
> df_num.to_parquet("categories_num.parquet")
> # This reads back int64, but I expected category
> pd.read_parquet("categories_num.parquet").dtypes  # int64
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-5480) [Python] Pandas categorical type doesn't survive a round-trip through parquet

Reply via email to