alamb opened a new issue, #4968:
URL: https://github.com/apache/arrow-datafusion/issues/4968

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   `ProjectionExec` can either have computations like (`col1` + `col2`) or it 
can be used to reorder / rename the columns
   
   The first use case benefits from repartitioning (as then the calculation can 
be done in multiple cores)
   
   The second use case (ordering) does not benefit from partitioning as it is 
simply a bookkeeping arrangement
   
   Basically we have a plan like
   
   ```text 
   ProjectionExec: expr=[f@0 as f]
     DeduplicateExec: [tag@1 ASC,time@2 ASC]
       SortPreservingMergeExec: [tag@1 ASC,time@2 ASC]
         UnionExec
   ```
   
   That is then optimized by 
https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/physical_optimizer/repartition.rs
 to repartition before the projection
   
   ```text
   ProjectionExec: expr=[f@0 as f]
     RepartitionExec: partitioning=RoundRobinBatch(4) <-- This repartition node 
is likely worthless
       DeduplicateExec: [tag@1 ASC,time@2 ASC]
         SortPreservingMergeExec: [tag@1 ASC,time@2 ASC]
           UnionExec
   ```
   
   **Describe the solution you'd like**
   This I think ProjectionExec should only "benefit from partitioning" when its 
partition expressions actually have calculations (aka are not just columns / 
aliases)
   
   This would like defining `benefits_from_input_partitioning` 
   
https://github.com/apache/arrow-datafusion/blob/906896b7c59ff14d71b3056ec4349274cf6662af/datafusion/core/src/physical_plan/mod.rs#L176-L183
   
   For `impl ExecutionPlan for ProjectionExec`: 
https://github.com/apache/arrow-datafusion/blob/906896b7c59ff14d71b3056ec4349274cf6662af/datafusion/core/src/physical_plan/projection.rs#L151
   
   So that it returned true only if there were expressions that had non column 
references / aliases
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features 
you've considered.
   
   **Additional context**
   
   I think this is a good first issue as the code and desire is fairly 
straightforward and this would largely be an exercise in updating tests I 
suspect


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to