[ https://issues.apache.org/jira/browse/ARROW-11260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ARROW-11260: ----------------------------------- Labels: dataset pull-request-available (was: dataset) > [C++][Dataset] Don't require dictionaries for reading dataset with > schema-based Partitioning > -------------------------------------------------------------------------------------------- > > Key: ARROW-11260 > URL: https://issues.apache.org/jira/browse/ARROW-11260 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Joris Van den Bossche > Assignee: David Li > Priority: Major > Labels: dataset, pull-request-available > Fix For: 4.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > As a follow-up on ARROW-10247 (see also > https://github.com/apache/arrow/pull/9130#issuecomment-760801124). We > currently require the user to pass manually specified dictionary values when > reading a dataset with a Partitioning based on a schema with dictionary typed > fields. > In practice that means that the user for example needs to parse the file > paths to get all the possible values the partition field can take, while > Arrow will then afterwards again do the same to construct the dataset object. > _Naively_, it seems that it should be possible to let Arrow infer the > dictionary _values_, even when providing an explicit schema with a dictionary > field for the Partitioning (i.e. so when not letting the partitioning schema > itself be inferred from the file paths). > An example use case is when you have a Partitioning schema with both > dictionary and non-dictionary fields. When discovering the schema, you can > only have all or nothing (all dictionary fields or no dictionary fields). > cc [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)