Antoine Pitrou created ARROW-18037:
--------------------------------------

             Summary: [C++] Acero/dataset relies on ExecBatch::ToRecordBatch 
truncating excess columns
                 Key: ARROW-18037
                 URL: https://issues.apache.org/jira/browse/ARROW-18037
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
            Reporter: Antoine Pitrou


As found while working on ARROW-18004: the dataset scanner and the Acero engine 
rely on {{ExecBatch::ToRecordBatch}} returning successfully when the given 
schema has fewer fields than the ExecBatch has columns.

This apparently allows to implicitly drop the dataset-added columns 
({{kAugmentedFields}} in {{arrow/dataset/scanner.cc}}) from a scan's final 
result.

However, it seems wrong and brittle to do this implicitly at the 
{{ExecBatch::ToRecordBatch}} level (hiding potential errors). Instead, it 
should probably be done explicitly inside Acero/dataset.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to