mingmwang commented on code in PR #3855:
URL: https://github.com/apache/arrow-datafusion/pull/3855#discussion_r998948538
##########
datafusion/core/src/physical_plan/mod.rs:
##########
@@ -201,15 +285,18 @@ pub trait ExecutionPlan: Debug + Send + Sync {
///
/// The default implementation returns `true` unless this operator
/// has signalled it requires a single child input partition.
- fn benefits_from_input_partitioning(&self) -> bool {
+ fn prefer_parallel(&self) -> bool {
// By default try to maximize parallelism with more CPUs if
- // possible
- !matches!(
- self.required_child_distribution(),
- Distribution::SinglePartition
- )
+ // possibles
+ !self
+ .required_input_distribution()
+ .into_iter()
+ .any(|dist| matches!(dist, Distribution::SinglePartition))
}
+ /// Get a list of equivalence properties within the plan
Review Comment:
Sure, I will add more doc. I will also refine the interface to make it
clear, something like this.
`fn equivalence_properties(&self) -> Vec<EquivalenceProperties>;`
```
struct EquivalenceProperties {
/// First element in the EquivalenceProperties
head : Column,
/// Other equal columns
others : Set<Column>,
}
```
And it doesn't allow duplicates or intersections, the two
EquivalenceProperties will be combined/merged if there is intersections.
See this method:
```
pub fn combine_equivalence_properties(
eq_properties: &mut Vec<Vec<Column>>,
new_condition: (&Column, &Column),
)
```
And If an operator does not return any equivalence properties, it will not
impact the correctness, but the optimizer might choose a non-optimal plan(with
additional unnecessary SortExec/RepartitionExec)
Equivalence property is a very useful concept, it is also mentioned in the
"Incorporating Partitioning and Parallel Plans into the SCOPE Optimizer" paper:
A set of columns that are known to have the same value in all tuples of a
relation belong to a column equivalence class. An equivalence class may also
contain a constant c, which implies that all column in the class have the value
c. Equivalence classes are generated by equality predicates, typically equijoin
conditions and equality comparisons with a constant.
http://www.cs.albany.edu/~jhh/courses/readings/zhou10.pdf
And PostgreSQL optimizer also implemented the EquivalenceClasses and the
join reodering can leverage this.
**EquivalenceClasses** selection
https://github.com/postgres/postgres/tree/f4c7c410ee4a7baa06f51ebb8d5333c169691dd3/src/backend/optimizer
In future, I would suggest to introduce the equivalence properties to
logical plan also, couple of logical optimization
rules can benefit from it, like FilterPushDown, EliminateFilter, etc.
A negative example is SparkSQL does not support the equivalence properties
explicitly, it cause some performance issues when it try to infer additional
filter constraints but causing exponential growth of filter constraints, an
interesting sharing.
See the section of rewriting Spark’s constraint propagation mechanism
https://www.databricks.com/session_na21/optimizing-the-catalyst-optimizer-for-complex-plans
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]