[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-33152:
-------------------------
       Shepherd: Wenchen Fan  (was: Arnaud Doucet)
    Description: 
h2. Q1. What are you trying to do? Articulate your objectives using absolutely 
no jargon.

Proposing new algorithm to create, store and use constraints for removing 
redundant filters & inferring new filters.
The current algorithm has subpar performance in complex expression scenarios 
involving aliases( with certain use cases the compilation time can go into 
hours), potential to cause OOM, may miss removing redundant filters in 
different scenarios, may miss creating IsNotNull constraints in different 
scenarios, does not push compound predicates in Join.
 # This issue if not fixed can cause OutOfMemory issue or unacceptable query 
compilation times.
Have added a test "plan equivalence with case statements and performance 
comparison with benefit of more than 10x conservatively" in 
org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. *With 
this PR the compilation time is 247 ms vs 13958 ms without the change*
 # It is more effective in filter pruning as is evident in some of the tests in 
org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite where 
current code is not able to identify the redundant filter in some cases.
 # It is able to generate a better optimized plan for join queries as it can 
push compound predicates.
 # The current logic can miss a lot of possible cases of removing redundant 
predicates, as it fails to take into account if same attribute or its aliases 
are repeated multiple times in a complex expression.
 # There are cases where some of the optimizer rules involving removal of 
redundant predicates fail to remove on the basis of constraint data. In some 
cases the rule works, just by the virtue of previous rules helping it out to 
cover the inaccuracy. That the ConstraintPropagation rule & its function of 
removal of redundant filters & addition of new inferred filters is dependent on 
the working of some of the other unrelated previous optimizer rules is 
behaving, is indicative of issues.
 # It does away with all the EqualNullSafe constraints as this logic does not 
need those constraints to be created.
 # There is at least one test in existing ConstraintPropagationSuite which is 
missing a IsNotNull constraints because the code incorrectly generated a 
EqualsNullSafeConstraint instead of EqualTo constraint, when using the existing 
Constraints code. With these changes, the test correctly creates an EqualTo 
constraint, resulting in an inferred IsNotNull constraint
 # It does away with the current combinatorial logic of evaluation all the 
constraints can cause compilation to run into hours or cause OOM. The number of 
constraints stored is exactly the same as the number of filters encountered

h2. Q2. What problem is this proposal NOT designed to solve?

It mainly focuses on compile time performance, but in some cases can benefit 
run time characteristics too, like inferring IsNotNull filter or pushing down 
compound predicates on the join, which currently may get missed/ does not 
happen , respectively, by the present code.
h2. Q3. How is it done today, and what are the limits of current practice?

Current ConstraintsPropagation code, pessimistically tries to generates all the 
possible combinations of constraints , based on the aliases ( even then it may 
miss a lot of combinations if the expression is a complex expression involving 
same attribute repeated multiple times within the expression and there are many 
aliases to that column). There are query plans in our production env, which can 
result in intermediate number of constraints going into hundreds of thousands, 
causing OOM or taking time running into hours. Also there are cases where it 
incorrectly generates an EqualNullSafe constraint instead of EqualTo constraint 
, thus missing a possible IsNull constraint on column. 
Also it only pushes single column predicate on the other side of the join.
The constraints generated , in some cases, are missing the required ones, and 
the plan apparently is behaving correctly only due to the preceding unrelated 
optimizer rule. Have Test which show that with the bare mnimum rules containing 
RemoveRedundantPredicate, it misses the removal of redundant predicate.
h2. Q4. What is new in your approach and why do you think it will be successful?

It solves all the above mentioned issues.
 # The number of constraints created are same as the number of filters. No 
combinatorial creation of constraints. No need for EqualsNullSafe constraint on 
aliases.
 # Can remove redundant predicates on any expression involving aliases 
irrespective of the number of repeat occurences in all possible combination.
 # Brings down query compilation time to few minutes from hours.
 # Can push compound predicates on Joins & infer right number of IsNotNull 
constraints which can impact query runtime also positively.
 # The proposed algorithm has been running successfully in our env. (WorkDay) 
for months & has solved all the above issues.

h2. Q5. Who cares? If you are successful, what difference will it make?

For My company WorkDay, it has solved the previously failing plans due to OOM & 
compilation time running into 10 hrs or so. I suppose there have been previous 
attempts too, to fix this issue, but did not make progress due to complexity of 
change.
The PR for the same is 
[PR|https://github.com/apache/spark/pull/37870]
h2. Q6. What are the risks?

Well the changes are little extensive, but thoroughly tested ( old & many new 
tests added). Have added a lot of tests for Union node, as found that current 
constraints tests were not sufficient for Union case.
So in that sense , given that all existing tests as well as new tests are 
clean, this is a safe PR.
h2. Q7. How long will it take?

The PR is already there. Implementation already done. whatever time needed is 
for review and discussion.
[PR|https://github.com/apache/spark/pull/37870]
h2. Q8. What are the mid-term and final “exams” to check for success?

All tests should pass.
The perf benefit should justify the changes.

  was:
h2. Q1. What are you trying to do? Articulate your objectives using absolutely 
no jargon.

Proposing new algorithm to create, store and use constraints for removing 
redundant filters & inferring new filters.
The current algorithm has subpar performance in complex expression scenarios 
involving aliases( with certain use cases the compilation time can go into 
hours), potential to cause OOM,  may miss removing redundant filters in 
different scenarios, may miss creating IsNotNull constraints in different 
scenarios,  does not push compound predicates in Join.

# This issue if not fixed can cause OutOfMemory issue or unacceptable query 
compilation times.
Have added  a test "plan equivalence with case statements and performance 
comparison with benefit of more than 10x conservatively" in 
org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. *With 
this PR the compilation time is 247 ms vs 13958 ms without the change*
# It is more effective in filter pruning as is evident in some of the tests in 
org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite where 
current code is not able to identify the redundant filter in some cases.
# It is able to generate a better optimized plan for join queries as it can 
push compound predicates.
# The current logic can miss a lot of possible cases of removing redundant 
predicates, as it fails to take into account if same attribute or its aliases 
are repeated multiple times in a complex expression.
# There are cases where some of the optimizer rules involving removal of 
redundant predicates fail to remove on the basis of constraint data. In some 
cases the rule works, just by the virtue of previous rules helping it out to 
cover the inaccuracy. That the ConstraintPropagation rule & its function of 
removal of redundant filters & addition of new inferred filters is dependent on 
the working of some of the other unrelated previous optimizer rules is 
behaving, is indicative of issues.
# It does away with all the EqualNullSafe constraints as this logic does not 
need those constraints to be created.
# There is at least one test in existing ConstraintPropagationSuite which is 
missing a IsNotNull constraints because the code incorrectly generated a 
EqualsNullSafeConstraint instead of EqualTo constraint, when using the existing 
Constraints code. With these changes, the test correctly creates an EqualTo 
constraint, resulting in an inferred IsNotNull constraint
# It does away with the current combinatorial logic of evaluation all the 
constraints can cause compilation to run into hours or cause OOM. The number of 
constraints stored is exactly the same as the number of filters encountered

h2. Q2. What problem is this proposal NOT designed to solve?
It mainly focuses on compile time performance, but in some cases can benefit 
run time characteristics too, like inferring IsNotNull filter or pushing down 
compound predicates on the join, which currently may get missed/ does not 
happen , respectively, by the present code.

h2. Q3. How is it done today, and what are the limits of current practice?
Current ConstraintsPropagation code,  pessimistically tries to generates all 
the possible combinations of constraints , based on the aliases ( even then it 
may miss a lot of  combinations if the expression is a complex expression 
involving same attribute repeated multiple times within the expression and 
there are many aliases to that column).  There are query plans in our 
production env, which can result in intermediate number of constraints going 
into hundreds of thousands, causing OOM or taking time running into hours.  
Also there are cases where it incorrectly generates an EqualNullSafe constraint 
instead of EqualTo constraint , thus missing a possible IsNull constraint on 
column. 
Also it only pushes single column predicate on the other side of the join.
The constraints generated , in some cases, are missing the required ones, and 
the plan apparently is behaving correctly only due to the preceding unrelated 
optimizer rule. Have Test which show that with the bare mnimum rules containing 
RemoveRedundantPredicate, it misses the removal of redundant predicate.

h2. Q4. What is new in your approach and why do you think it will be successful?
It solves all the above mentioned issues. 
# The number of constraints created are same as the number of filters.  No 
combinatorial creation of constraints. No need for EqualsNullSafe constraint on 
aliases.
#  Can remove redundant predicates on any expression involving aliases 
irrespective of the number of repeat occurences in all possible combination.
# Brings down query compilation time to few minutes from hours.
# Can push compound predicates on Joins & infer right number of IsNotNull 
constraints which can impact query runtime also positively.
# The proposed algorithm has been running successfully in our env.  (WorkDay) 
for months & has solved all the above issues.

h2. Q5. Who cares? If you are successful, what difference will it make?
For My company WorkDay, it has solved the previously failing plans due to OOM & 
compilation time running into 10 hrs or so. I suppose there have been previous 
attempts too, to fix this issue, but did not make progress due to complexity of 
change.
The PR for the same is 
[*PR for this SPIP*|https://github.com/apache/spark/pull/33983]


h2. Q6. What are the risks?
Well the changes are little extensive, but thoroughly tested ( old & many new 
tests added). Have added a lot of tests for Union node, as found that current 
constraints tests were not sufficient for Union case.
So in that sense , given that all existing tests as well as new tests are 
clean, this is a safe PR.

h2. Q7. How long will it take?
The PR is already there. Implementation already done.  whatever time needed is 
for review and discussion.
[*PR for this SPIP*|https://github.com/apache/spark/pull/33983]

h2. Q8. What are the mid-term and final “exams” to check for success?
All tests should pass.
The perf benefit should justify the changes.


> SPIP: Constraint Propagation code causes OOM issues or increasing compilation 
> time to hours
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-33152
>                 URL: https://issues.apache.org/jira/browse/SPARK-33152
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0, 3.0.1, 3.1.2
>            Reporter: Asif
>            Priority: Major
>              Labels: SPIP
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Proposing new algorithm to create, store and use constraints for removing 
> redundant filters & inferring new filters.
> The current algorithm has subpar performance in complex expression scenarios 
> involving aliases( with certain use cases the compilation time can go into 
> hours), potential to cause OOM, may miss removing redundant filters in 
> different scenarios, may miss creating IsNotNull constraints in different 
> scenarios, does not push compound predicates in Join.
>  # This issue if not fixed can cause OutOfMemory issue or unacceptable query 
> compilation times.
> Have added a test "plan equivalence with case statements and performance 
> comparison with benefit of more than 10x conservatively" in 
> org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. 
> *With this PR the compilation time is 247 ms vs 13958 ms without the change*
>  # It is more effective in filter pruning as is evident in some of the tests 
> in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite 
> where current code is not able to identify the redundant filter in some cases.
>  # It is able to generate a better optimized plan for join queries as it can 
> push compound predicates.
>  # The current logic can miss a lot of possible cases of removing redundant 
> predicates, as it fails to take into account if same attribute or its aliases 
> are repeated multiple times in a complex expression.
>  # There are cases where some of the optimizer rules involving removal of 
> redundant predicates fail to remove on the basis of constraint data. In some 
> cases the rule works, just by the virtue of previous rules helping it out to 
> cover the inaccuracy. That the ConstraintPropagation rule & its function of 
> removal of redundant filters & addition of new inferred filters is dependent 
> on the working of some of the other unrelated previous optimizer rules is 
> behaving, is indicative of issues.
>  # It does away with all the EqualNullSafe constraints as this logic does not 
> need those constraints to be created.
>  # There is at least one test in existing ConstraintPropagationSuite which is 
> missing a IsNotNull constraints because the code incorrectly generated a 
> EqualsNullSafeConstraint instead of EqualTo constraint, when using the 
> existing Constraints code. With these changes, the test correctly creates an 
> EqualTo constraint, resulting in an inferred IsNotNull constraint
>  # It does away with the current combinatorial logic of evaluation all the 
> constraints can cause compilation to run into hours or cause OOM. The number 
> of constraints stored is exactly the same as the number of filters encountered
> h2. Q2. What problem is this proposal NOT designed to solve?
> It mainly focuses on compile time performance, but in some cases can benefit 
> run time characteristics too, like inferring IsNotNull filter or pushing down 
> compound predicates on the join, which currently may get missed/ does not 
> happen , respectively, by the present code.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Current ConstraintsPropagation code, pessimistically tries to generates all 
> the possible combinations of constraints , based on the aliases ( even then 
> it may miss a lot of combinations if the expression is a complex expression 
> involving same attribute repeated multiple times within the expression and 
> there are many aliases to that column). There are query plans in our 
> production env, which can result in intermediate number of constraints going 
> into hundreds of thousands, causing OOM or taking time running into hours. 
> Also there are cases where it incorrectly generates an EqualNullSafe 
> constraint instead of EqualTo constraint , thus missing a possible IsNull 
> constraint on column. 
> Also it only pushes single column predicate on the other side of the join.
> The constraints generated , in some cases, are missing the required ones, and 
> the plan apparently is behaving correctly only due to the preceding unrelated 
> optimizer rule. Have Test which show that with the bare mnimum rules 
> containing RemoveRedundantPredicate, it misses the removal of redundant 
> predicate.
> h2. Q4. What is new in your approach and why do you think it will be 
> successful?
> It solves all the above mentioned issues.
>  # The number of constraints created are same as the number of filters. No 
> combinatorial creation of constraints. No need for EqualsNullSafe constraint 
> on aliases.
>  # Can remove redundant predicates on any expression involving aliases 
> irrespective of the number of repeat occurences in all possible combination.
>  # Brings down query compilation time to few minutes from hours.
>  # Can push compound predicates on Joins & infer right number of IsNotNull 
> constraints which can impact query runtime also positively.
>  # The proposed algorithm has been running successfully in our env. (WorkDay) 
> for months & has solved all the above issues.
> h2. Q5. Who cares? If you are successful, what difference will it make?
> For My company WorkDay, it has solved the previously failing plans due to OOM 
> & compilation time running into 10 hrs or so. I suppose there have been 
> previous attempts too, to fix this issue, but did not make progress due to 
> complexity of change.
> The PR for the same is 
> [PR|https://github.com/apache/spark/pull/37870]
> h2. Q6. What are the risks?
> Well the changes are little extensive, but thoroughly tested ( old & many new 
> tests added). Have added a lot of tests for Union node, as found that current 
> constraints tests were not sufficient for Union case.
> So in that sense , given that all existing tests as well as new tests are 
> clean, this is a safe PR.
> h2. Q7. How long will it take?
> The PR is already there. Implementation already done. whatever time needed is 
> for review and discussion.
> [PR|https://github.com/apache/spark/pull/37870]
> h2. Q8. What are the mid-term and final “exams” to check for success?
> All tests should pass.
> The perf benefit should justify the changes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to