[ https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17603663#comment-17603663 ]
Apache Spark commented on SPARK-33152: -------------------------------------- User 'ahshahid' has created a pull request for this issue: https://github.com/apache/spark/pull/37870 > SPIP: Constraint Propagation code causes OOM issues or increasing compilation > time to hours > ------------------------------------------------------------------------------------------- > > Key: SPARK-33152 > URL: https://issues.apache.org/jira/browse/SPARK-33152 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.4.0, 3.0.1, 3.1.2 > Reporter: Asif > Priority: Major > Labels: SPIP > Original Estimate: 168h > Remaining Estimate: 168h > > h2. Q1. What are you trying to do? Articulate your objectives using > absolutely no jargon. > Proposing new algorithm to create, store and use constraints for removing > redundant filters & inferring new filters. > The current algorithm has subpar performance in complex expression scenarios > involving aliases( with certain use cases the compilation time can go into > hours), potential to cause OOM, may miss removing redundant filters in > different scenarios, may miss creating IsNotNull constraints in different > scenarios, does not push compound predicates in Join. > # This issue if not fixed can cause OutOfMemory issue or unacceptable query > compilation times. > Have added a test "plan equivalence with case statements and performance > comparison with benefit of more than 10x conservatively" in > org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. > *With this PR the compilation time is 247 ms vs 13958 ms without the change* > # It is more effective in filter pruning as is evident in some of the tests > in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite > where current code is not able to identify the redundant filter in some cases. > # It is able to generate a better optimized plan for join queries as it can > push compound predicates. > # The current logic can miss a lot of possible cases of removing redundant > predicates, as it fails to take into account if same attribute or its aliases > are repeated multiple times in a complex expression. > # There are cases where some of the optimizer rules involving removal of > redundant predicates fail to remove on the basis of constraint data. In some > cases the rule works, just by the virtue of previous rules helping it out to > cover the inaccuracy. That the ConstraintPropagation rule & its function of > removal of redundant filters & addition of new inferred filters is dependent > on the working of some of the other unrelated previous optimizer rules is > behaving, is indicative of issues. > # It does away with all the EqualNullSafe constraints as this logic does not > need those constraints to be created. > # There is at least one test in existing ConstraintPropagationSuite which is > missing a IsNotNull constraints because the code incorrectly generated a > EqualsNullSafeConstraint instead of EqualTo constraint, when using the > existing Constraints code. With these changes, the test correctly creates an > EqualTo constraint, resulting in an inferred IsNotNull constraint > # It does away with the current combinatorial logic of evaluation all the > constraints can cause compilation to run into hours or cause OOM. The number > of constraints stored is exactly the same as the number of filters encountered > h2. Q2. What problem is this proposal NOT designed to solve? > It mainly focuses on compile time performance, but in some cases can benefit > run time characteristics too, like inferring IsNotNull filter or pushing down > compound predicates on the join, which currently may get missed/ does not > happen , respectively, by the present code. > h2. Q3. How is it done today, and what are the limits of current practice? > Current ConstraintsPropagation code, pessimistically tries to generates all > the possible combinations of constraints , based on the aliases ( even then > it may miss a lot of combinations if the expression is a complex expression > involving same attribute repeated multiple times within the expression and > there are many aliases to that column). There are query plans in our > production env, which can result in intermediate number of constraints going > into hundreds of thousands, causing OOM or taking time running into hours. > Also there are cases where it incorrectly generates an EqualNullSafe > constraint instead of EqualTo constraint , thus missing a possible IsNull > constraint on column. > Also it only pushes single column predicate on the other side of the join. > The constraints generated , in some cases, are missing the required ones, and > the plan apparently is behaving correctly only due to the preceding unrelated > optimizer rule. Have Test which show that with the bare mnimum rules > containing RemoveRedundantPredicate, it misses the removal of redundant > predicate. > h2. Q4. What is new in your approach and why do you think it will be > successful? > It solves all the above mentioned issues. > # The number of constraints created are same as the number of filters. No > combinatorial creation of constraints. No need for EqualsNullSafe constraint > on aliases. > # Can remove redundant predicates on any expression involving aliases > irrespective of the number of repeat occurences in all possible combination. > # Brings down query compilation time to few minutes from hours. > # Can push compound predicates on Joins & infer right number of IsNotNull > constraints which can impact query runtime also positively. > # The proposed algorithm has been running successfully in our env. (WorkDay) > for months & has solved all the above issues. > h2. Q5. Who cares? If you are successful, what difference will it make? > For My company WorkDay, it has solved the previously failing plans due to OOM > & compilation time running into 10 hrs or so. I suppose there have been > previous attempts too, to fix this issue, but did not make progress due to > complexity of change. > The PR for the same is > [*PR for this SPIP*|https://github.com/apache/spark/pull/33983] > h2. Q6. What are the risks? > Well the changes are little extensive, but thoroughly tested ( old & many new > tests added). Have added a lot of tests for Union node, as found that current > constraints tests were not sufficient for Union case. > So in that sense , given that all existing tests as well as new tests are > clean, this is a safe PR. > h2. Q7. How long will it take? > The PR is already there. Implementation already done. whatever time needed > is for review and discussion. > [*PR for this SPIP*|https://github.com/apache/spark/pull/33983] > h2. Q8. What are the mid-term and final “exams” to check for success? > All tests should pass. > The perf benefit should justify the changes. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org