[ 
https://issues.apache.org/jira/browse/ARROW-11260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11260:
-----------------------------------
    Labels: dataset pull-request-available  (was: dataset)

> [C++][Dataset] Don't require dictionaries for reading dataset with 
> schema-based Partitioning
> --------------------------------------------------------------------------------------------
>
>                 Key: ARROW-11260
>                 URL: https://issues.apache.org/jira/browse/ARROW-11260
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Assignee: David Li
>            Priority: Major
>              Labels: dataset, pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> As a follow-up on ARROW-10247 (see also 
> https://github.com/apache/arrow/pull/9130#issuecomment-760801124). We 
> currently require the user to pass manually specified dictionary values when 
> reading a dataset with a Partitioning based on a schema with dictionary typed 
> fields. 
> In practice that means that the user for example needs to parse the file 
> paths to get all the possible values the partition field can take, while 
> Arrow will then afterwards again do the same to construct the dataset object. 
> _Naively_, it seems that it should be possible to let Arrow infer the 
> dictionary _values_, even when providing an explicit schema with a dictionary 
> field for the Partitioning (i.e. so when not letting the partitioning schema 
> itself be inferred from the file paths).
> An example use case is when you have a Partitioning schema with both 
> dictionary and non-dictionary fields. When discovering the schema, you can 
> only have all or nothing (all dictionary fields or no dictionary fields).
> cc [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to