George Sakkis created ARROW-4492:
------------------------------------

             Summary: ValueError: Categorical categories must be unique
                 Key: ARROW-4492
                 URL: https://issues.apache.org/jira/browse/ARROW-4492
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.12.0
            Reporter: George Sakkis
         Attachments: slug.pq

On pyarrow 0.12.0 some (but not all) columns cannot be read as category dtype. 
Attached is an extracted failing sample.

 {noformat}
import dask.dataframe as dd
df = dd.read_parquet('slug.pq', categories=['slug'], engine='pyarrow').compute()
print(len(df['slug'].dtype.categories))
 {noformat}

This works on pyarrow 0.11.1 (and fastparquet 0.2.1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to