Ben Kietzman created ARROW-13340:
------------------------------------

             Summary: [C++][Dataset] Simplify ScanOptions after complexity has 
moved to ScanNode
                 Key: ARROW-13340
                 URL: https://issues.apache.org/jira/browse/ARROW-13340
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Ben Kietzman
             Fix For: 6.0.0


ScanOptions currently has a number of constraints between members, which 
violates the contract of a public struct:

- {{filter}} must be bound to {{dataset_schema}}
- {{projection}} must be bound to {{dataset_schema}}
- {{projected_schema}} must be {{schema<...fields>}}, where the type of 
projection is {{struct<...fields>}}

These are currently required to support {{FilterAndProjectScanTask}}, but after 
ARROW-13328 this complexity can be removed and ScanOptions can be a pure struct 
argument to {{MakeScanNode}}. Specifically, it should be possible to:

- remove the {{projected_schema}} field (ScanNode doesn't need to know the 
schemas of any subsequent nodes)
- remove the {{projection}} field (ScanNode doesn't need to know how or if 
scanned batches will be projected)
- provide a simple vector of {{FieldRef}} to indicate which fields should be 
materialized (MakeScanNode can validate that this includes every field 
referenced by {{filter}})
- allow {{filter}} to be unbound (MakeScanNode can bind it to the dataset 
schema)

{{dataset_schema}} seems slightly redundant too since MakeScanNode also takes a 
Dataset as an argument but it is currently used by CsvFileFormat to derive 
column types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to