Ben Kietzman created ARROW-13340: ------------------------------------ Summary: [C++][Dataset] Simplify ScanOptions after complexity has moved to ScanNode Key: ARROW-13340 URL: https://issues.apache.org/jira/browse/ARROW-13340 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Ben Kietzman Fix For: 6.0.0
ScanOptions currently has a number of constraints between members, which violates the contract of a public struct: - {{filter}} must be bound to {{dataset_schema}} - {{projection}} must be bound to {{dataset_schema}} - {{projected_schema}} must be {{schema<...fields>}}, where the type of projection is {{struct<...fields>}} These are currently required to support {{FilterAndProjectScanTask}}, but after ARROW-13328 this complexity can be removed and ScanOptions can be a pure struct argument to {{MakeScanNode}}. Specifically, it should be possible to: - remove the {{projected_schema}} field (ScanNode doesn't need to know the schemas of any subsequent nodes) - remove the {{projection}} field (ScanNode doesn't need to know how or if scanned batches will be projected) - provide a simple vector of {{FieldRef}} to indicate which fields should be materialized (MakeScanNode can validate that this includes every field referenced by {{filter}}) - allow {{filter}} to be unbound (MakeScanNode can bind it to the dataset schema) {{dataset_schema}} seems slightly redundant too since MakeScanNode also takes a Dataset as an argument but it is currently used by CsvFileFormat to derive column types -- This message was sent by Atlassian Jira (v8.3.4#803005)