George Sakkis created ARROW-4492: ------------------------------------ Summary: ValueError: Categorical categories must be unique Key: ARROW-4492 URL: https://issues.apache.org/jira/browse/ARROW-4492 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.12.0 Reporter: George Sakkis Attachments: slug.pq
On pyarrow 0.12.0 some (but not all) columns cannot be read as category dtype. Attached is an extracted failing sample. {noformat} import dask.dataframe as dd df = dd.read_parquet('slug.pq', categories=['slug'], engine='pyarrow').compute() print(len(df['slug'].dtype.categories)) {noformat} This works on pyarrow 0.11.1 (and fastparquet 0.2.1). -- This message was sent by Atlassian JIRA (v7.6.3#76005)