[GitHub] [arrow-datafusion] mingmwang commented on a diff in pull request #3855: [Phase1] Partition and Sort Enforcement rule

GitBox Tue, 18 Oct 2022 22:46:35 -0700


mingmwang commented on code in PR #3855:
URL: https://github.com/apache/arrow-datafusion/pull/3855#discussion_r998948538



##########
datafusion/core/src/physical_plan/mod.rs:
##########
@@ -201,15 +285,18 @@ pub trait ExecutionPlan: Debug + Send + Sync {
     ///
     /// The default implementation returns `true` unless this operator
     /// has signalled it requires a single child input partition.
-    fn benefits_from_input_partitioning(&self) -> bool {
+    fn prefer_parallel(&self) -> bool {
         // By default try to maximize parallelism with more CPUs if
-        // possible
-        !matches!(
-            self.required_child_distribution(),
-            Distribution::SinglePartition
-        )
+        // possibles
+        !self
+            .required_input_distribution()
+            .into_iter()
+            .any(|dist| matches!(dist, Distribution::SinglePartition))
     }
 
+    /// Get a list of equivalence properties within the plan

Review Comment:
   Sure, I will add more doc.  I will also refine the interface to make it 
clear, something like this.
   
   `fn equivalence_properties(&self) -> Vec<EquivalenceProperties>;`
   
   ```
   struct EquivalenceProperties {
      /// First element in the EquivalenceProperties
      head :  Column,
      /// Other equal columns
      others : Set<Column>,
   }
   ```
   
   And it doesn't allow duplicates or intersections, the two 
EquivalenceProperties will be combined/merged if there is intersection.
   
   See this method:
   
   ```
   pub fn combine_equivalence_properties(
       eq_properties: &mut Vec<Vec<Column>>,
       new_condition: (&Column, &Column),
   )
   ```
   
   And If an operator does not return any equivalence properties, it will not 
impact the correctness, but the optimizer might choose a non-optimal plan(with 
additional unnecessary SortExec/RepartitionExec)
   
   Equivalence property is a very useful concept, it is also mentioned in the 
"Incorporating Partitioning and Parallel Plans into the SCOPE Optimizer" paper:
   
   _A set of columns that are known to have the same value in all tuples of a 
relation belong to a column equivalence class. An equivalence class may also 
contain a constant c, which implies that all column in the class have the value 
c. Equivalence classes are generated by equality predicates, typically equijoin 
conditions and equality comparisons with a constant._
   
   http://www.cs.albany.edu/~jhh/courses/readings/zhou10.pdf
   
   And PostgreSQL optimizer also implemented the EquivalenceClasses and the 
join reodering can leverage this.
   
   **EquivalenceClasses** selection
   
   
https://github.com/postgres/postgres/tree/f4c7c410ee4a7baa06f51ebb8d5333c169691dd3/src/backend/optimizer
   
   In future, I would suggest to introduce the equivalence properties to 
logical plan also, couple of logical optimization rules can benefit from it, 
like FilterPushDown, EliminateFilter, etc.
   
   A negative example is SparkSQL does not support the equivalence properties 
explicitly, it cause some performance issues when it try to infer additional 
filter constraints but causing exponential growth of filter constraints,  an 
interesting sharing. 
   
   See the section of rewriting Spark’s constraint propagation mechanism 
   
https://www.databricks.com/session_na21/optimizing-the-catalyst-optimizer-for-complex-plans
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] mingmwang commented on a diff in pull request #3855: [Phase1] Partition and Sort Enforcement rule

Reply via email to