alamb opened a new issue #92:
URL: https://github.com/apache/arrow-datafusion/issues/92


   *Note*: migrated from original JIRA: 
https://issues.apache.org/jira/browse/ARROW-9464
   
   I would like to propose a refactor of the physical/execution planning based 
on the experience I have had in implementing distributed execution in Ballista.
   
   This will likely need subtasks but here is an overview of the changes I am 
proposing.
   h3. *Introduce physical plan optimization rule to insert "shuffle" operators*
   
   We should extend the ExecutionPlan trait so that each operator can specify 
its input and output partitioning needs, and then have an optimization rule 
that can insert any repartitioning or reordering steps required.
   
   For example, these are the methods to be added to ExecutionPlan. This design 
is based on Apache Spark.
   
    
   {code:java}
   /// Specifies how data is partitioned across different nodes in the cluster
   fn output_partitioning(&self) -> Partitioning {
       Partitioning::UnknownPartitioning(0)
   }
   
   /// Specifies the data distribution requirements of all the children for 
this operator
   fn required_child_distribution(&self) -> Distribution {
       Distribution::UnspecifiedDistribution
   }
   
   /// Specifies how data is ordered in each partition
   fn output_ordering(&self) -> Option<Vec<SortOrder>> {
       None
   }
   
   /// Specifies the data distribution requirements of all the children for 
this operator
   fn required_child_ordering(&self) -> Option<Vec<Vec<SortOrder>>> {
       None
   }
    {code}
   A good example of applying this rule would be in the case of hash aggregates 
where we perform a partial aggregate in parallel across partitions and then 
coalesce the results and apply a final hash aggregate.
   
   Another example would be a SortMergeExec specifying the sort order required 
for its children.
   
    
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to