[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17603663#comment-17603663
 ] 

Apache Spark commented on SPARK-33152:
--------------------------------------

User 'ahshahid' has created a pull request for this issue:
https://github.com/apache/spark/pull/37870

> SPIP: Constraint Propagation code causes OOM issues or increasing compilation 
> time to hours
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-33152
>                 URL: https://issues.apache.org/jira/browse/SPARK-33152
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0, 3.0.1, 3.1.2
>            Reporter: Asif
>            Priority: Major
>              Labels: SPIP
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Proposing new algorithm to create, store and use constraints for removing 
> redundant filters & inferring new filters.
> The current algorithm has subpar performance in complex expression scenarios 
> involving aliases( with certain use cases the compilation time can go into 
> hours), potential to cause OOM,  may miss removing redundant filters in 
> different scenarios, may miss creating IsNotNull constraints in different 
> scenarios,  does not push compound predicates in Join.
> # This issue if not fixed can cause OutOfMemory issue or unacceptable query 
> compilation times.
> Have added  a test "plan equivalence with case statements and performance 
> comparison with benefit of more than 10x conservatively" in 
> org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. 
> *With this PR the compilation time is 247 ms vs 13958 ms without the change*
> # It is more effective in filter pruning as is evident in some of the tests 
> in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite 
> where current code is not able to identify the redundant filter in some cases.
> # It is able to generate a better optimized plan for join queries as it can 
> push compound predicates.
> # The current logic can miss a lot of possible cases of removing redundant 
> predicates, as it fails to take into account if same attribute or its aliases 
> are repeated multiple times in a complex expression.
> # There are cases where some of the optimizer rules involving removal of 
> redundant predicates fail to remove on the basis of constraint data. In some 
> cases the rule works, just by the virtue of previous rules helping it out to 
> cover the inaccuracy. That the ConstraintPropagation rule & its function of 
> removal of redundant filters & addition of new inferred filters is dependent 
> on the working of some of the other unrelated previous optimizer rules is 
> behaving, is indicative of issues.
> # It does away with all the EqualNullSafe constraints as this logic does not 
> need those constraints to be created.
> # There is at least one test in existing ConstraintPropagationSuite which is 
> missing a IsNotNull constraints because the code incorrectly generated a 
> EqualsNullSafeConstraint instead of EqualTo constraint, when using the 
> existing Constraints code. With these changes, the test correctly creates an 
> EqualTo constraint, resulting in an inferred IsNotNull constraint
> # It does away with the current combinatorial logic of evaluation all the 
> constraints can cause compilation to run into hours or cause OOM. The number 
> of constraints stored is exactly the same as the number of filters encountered
> h2. Q2. What problem is this proposal NOT designed to solve?
> It mainly focuses on compile time performance, but in some cases can benefit 
> run time characteristics too, like inferring IsNotNull filter or pushing down 
> compound predicates on the join, which currently may get missed/ does not 
> happen , respectively, by the present code.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Current ConstraintsPropagation code,  pessimistically tries to generates all 
> the possible combinations of constraints , based on the aliases ( even then 
> it may miss a lot of  combinations if the expression is a complex expression 
> involving same attribute repeated multiple times within the expression and 
> there are many aliases to that column).  There are query plans in our 
> production env, which can result in intermediate number of constraints going 
> into hundreds of thousands, causing OOM or taking time running into hours.  
> Also there are cases where it incorrectly generates an EqualNullSafe 
> constraint instead of EqualTo constraint , thus missing a possible IsNull 
> constraint on column. 
> Also it only pushes single column predicate on the other side of the join.
> The constraints generated , in some cases, are missing the required ones, and 
> the plan apparently is behaving correctly only due to the preceding unrelated 
> optimizer rule. Have Test which show that with the bare mnimum rules 
> containing RemoveRedundantPredicate, it misses the removal of redundant 
> predicate.
> h2. Q4. What is new in your approach and why do you think it will be 
> successful?
> It solves all the above mentioned issues. 
> # The number of constraints created are same as the number of filters.  No 
> combinatorial creation of constraints. No need for EqualsNullSafe constraint 
> on aliases.
> #  Can remove redundant predicates on any expression involving aliases 
> irrespective of the number of repeat occurences in all possible combination.
> # Brings down query compilation time to few minutes from hours.
> # Can push compound predicates on Joins & infer right number of IsNotNull 
> constraints which can impact query runtime also positively.
> # The proposed algorithm has been running successfully in our env.  (WorkDay) 
> for months & has solved all the above issues.
> h2. Q5. Who cares? If you are successful, what difference will it make?
> For My company WorkDay, it has solved the previously failing plans due to OOM 
> & compilation time running into 10 hrs or so. I suppose there have been 
> previous attempts too, to fix this issue, but did not make progress due to 
> complexity of change.
> The PR for the same is 
> [*PR for this SPIP*|https://github.com/apache/spark/pull/33983]
> h2. Q6. What are the risks?
> Well the changes are little extensive, but thoroughly tested ( old & many new 
> tests added). Have added a lot of tests for Union node, as found that current 
> constraints tests were not sufficient for Union case.
> So in that sense , given that all existing tests as well as new tests are 
> clean, this is a safe PR.
> h2. Q7. How long will it take?
> The PR is already there. Implementation already done.  whatever time needed 
> is for review and discussion.
> [*PR for this SPIP*|https://github.com/apache/spark/pull/33983]
> h2. Q8. What are the mid-term and final “exams” to check for success?
> All tests should pass.
> The perf benefit should justify the changes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to