[jira] [Updated] (ARROW-10099) [C++][Dataset] Also allow integer partition fields to be dictionary encoded

ASF GitHub Bot (Jira) Tue, 06 Oct 2020 11:56:25 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-10099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated ARROW-10099:
-----------------------------------
    Labels: dataset dataset-dask-integration pull-request-available  (was: 
dataset dataset-dask-integration)

> [C++][Dataset] Also allow integer partition fields to be dictionary encoded
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-10099
>                 URL: https://issues.apache.org/jira/browse/ARROW-10099
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Assignee: Ben Kietzman
>            Priority: Major
>              Labels: dataset, dataset-dask-integration, pull-request-available
>             Fix For: 2.0.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In ARROW-8647, we added the option to indicate that you partition field 
> columns should be dictionary encoded, but it currently does only do this for 
> string type, and not for integer type (wiht the reasoning that for integers, 
> it is not giving any memory efficiency gains to use dictionary encoding). 
> In dask, they have been using categorical dtypes for _all_ partition fields, 
> also if they are integers. They would like to keep doing this (apart from 
> memory efficiency, using categorical/dictionary type also gives information 
> about all uniques values of the column, without having to calculate this), so 
> it would be nice to enable this use case. 
> So I think we could either simply always dictionary encode also integers when 
> {{max_partition_dictionary_size}} indicates partition fields should be 
> dictionary encoded, or either have an additional option to indicate also 
> integer partition fields should be encoded (if the other option indicates 
> dictionary encoding should be used).
> Based on feedback from the dask PR using the dataset API at 
> https://github.com/dask/dask/pull/6534#issuecomment-698723009
> cc [~rjzamora] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-10099) [C++][Dataset] Also allow integer partition fields to be dictionary encoded

Reply via email to