Neal Richardson created ARROW-17463:
---------------------------------------

             Summary: [R] Avoid unnecessary projections
                 Key: ARROW-17463
                 URL: https://issues.apache.org/jira/browse/ARROW-17463
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
            Reporter: Neal Richardson
            Assignee: Neal Richardson
             Fix For: 10.0.0


In ExecPlan$Build(), we call Project in a few places, and there is code to make 
sure that there is at least one ProjectNode in the query in order to remove 
augmented fields from a Dataset scan (unless the user has added them). As a 
result, it is possible to get multiple ProjectNodes in a row that are 
essentially no-op. One example is with grouped aggregation: there is a 
projection to get the order of the columns back to what R expects, and then a 
no-op projection after that:

{code}
> mtcars |> arrow_table() |> count(cyl) |> explain()
ExecPlan with 6 nodes:
5:SinkNode{}
  4:ProjectNode{projection=[cyl, n]}
    3:ProjectNode{projection=[cyl, n]}
      2:GroupByNode{keys=["cyl"], aggregates=[
        hash_sum(n, {skip_nulls=true, min_count=1}),
      ]}
        1:ProjectNode{projection=["n": 1, cyl]}
          0:TableSourceNode{}
{code}

IDK how significant of a performance impact this would have, but it certainly 
looks wasteful and should be avoidable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to