Ben Kietzman created ARROW-8658:
-----------------------------------
Summary: [C++][Dataset] Implement subtree pruning for
FileSystemDataset::GetFragments
Key: ARROW-8658
URL: https://issues.apache.org/jira/browse/ARROW-8658
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Affects Versions: 0.17.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
Fix For: 1.0.0
This is a very handy optimization for large datasets with multiple partition
fields. For example, given a hive-style directory {{$base_dir/a=3/}} and a
filter {{"a"_ == 2}} none of its files or subdirectories need be examined.
After ARROW-8318 FileSystemDataset stores only files so subtree pruning (whose
implementation depended on the presence of directories to represent subtrees)
was disabled. It should be possible to reintroduce this without reference to
directories by examining partition expressions directly and extracting a tree
structure from their subexpressions.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)