[ https://issues.apache.org/jira/browse/ARROW-10099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ARROW-10099: ----------------------------------- Labels: dataset dataset-dask-integration pull-request-available (was: dataset dataset-dask-integration) > [C++][Dataset] Also allow integer partition fields to be dictionary encoded > --------------------------------------------------------------------------- > > Key: ARROW-10099 > URL: https://issues.apache.org/jira/browse/ARROW-10099 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Joris Van den Bossche > Assignee: Ben Kietzman > Priority: Major > Labels: dataset, dataset-dask-integration, pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > In ARROW-8647, we added the option to indicate that you partition field > columns should be dictionary encoded, but it currently does only do this for > string type, and not for integer type (wiht the reasoning that for integers, > it is not giving any memory efficiency gains to use dictionary encoding). > In dask, they have been using categorical dtypes for _all_ partition fields, > also if they are integers. They would like to keep doing this (apart from > memory efficiency, using categorical/dictionary type also gives information > about all uniques values of the column, without having to calculate this), so > it would be nice to enable this use case. > So I think we could either simply always dictionary encode also integers when > {{max_partition_dictionary_size}} indicates partition fields should be > dictionary encoded, or either have an additional option to indicate also > integer partition fields should be encoded (if the other option indicates > dictionary encoding should be used). > Based on feedback from the dask PR using the dataset API at > https://github.com/dask/dask/pull/6534#issuecomment-698723009 > cc [~rjzamora] [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)