[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823510#comment-17823510
 ] 

Asif edited comment on SPARK-33152 at 3/5/24 6:43 PM:
------------------------------------------------------

[~tedjenks] .. Unfortunately I am not a committer. As part of workday , I had 
opened this Jira and opened a PR to fix this issue completely which required a 
different logic. The changes are extensive and they were never reviewed or 
dicussed by OS community. This PR has been in production since past 3 years at 
Workday. 

As to why a check is not added, etc,.,:

That would be unclean and as such is not easy to implement also in current 
codebase, because it will result in various other issues like new redundant 
filters being inferred and other messy bugs as the constraint code is sensitive 
to constraints coming from each node below and the constraints available at 
current node, to decide whether to create new filters or not.

Constrainst are created per operator node ( project, filter etc) and arbitrary 
putting a limit on constraints at a given operator , will impact the new 
filters being created.


was (Author: ashahid7):
[~tedjenks] .. Unfortunately I am not a committer. As part of workday , I had 
opened this Jira and opened a PR to fix this issue completely which required a 
different logic. The changes are extensive and they were never reviewed or 
dicussed by OS community. This PR has been in production since past 3 years at 
Workday. 

As to why a check is not added, etc,.,:

That would be unclean and as such is not easy to implement also in current 
codebase, because it will result in various other issues like new/wrong filters 
being inferred and other messy bugs as the constraint code is sensitive to 
constraints coming from each node below and the constraints available at 
current node, to decide whether to create new filters or not.

Constrainst are created per operator node ( project, filter etc) and arbitrary 
putting a limit on constraints at a given operator , will impact the new 
filters being created.

> SPIP: Constraint Propagation code causes OOM issues or increasing compilation 
> time to hours
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-33152
>                 URL: https://issues.apache.org/jira/browse/SPARK-33152
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.5.0
>            Reporter: Asif
>            Priority: Major
>              Labels: SPIP
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Proposing new algorithm to create, store and use constraints for removing 
> redundant filters & inferring new filters.
> The current algorithm has subpar performance in complex expression scenarios 
> involving aliases( with certain use cases the compilation time can go into 
> hours), potential to cause OOM, may miss removing redundant filters in 
> different scenarios, may miss creating IsNotNull constraints in different 
> scenarios, does not push compound predicates in Join.
>  # This issue if not fixed can cause OutOfMemory issue or unacceptable query 
> compilation times.
> Have added a test "plan equivalence with case statements and performance 
> comparison with benefit of more than 10x conservatively" in 
> org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. 
> *With this PR the compilation time is 247 ms vs 13958 ms without the change*
>  # It is more effective in filter pruning as is evident in some of the tests 
> in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite 
> where current code is not able to identify the redundant filter in some cases.
>  # It is able to generate a better optimized plan for join queries as it can 
> push compound predicates.
>  # The current logic can miss a lot of possible cases of removing redundant 
> predicates, as it fails to take into account if same attribute or its aliases 
> are repeated multiple times in a complex expression.
>  # There are cases where some of the optimizer rules involving removal of 
> redundant predicates fail to remove on the basis of constraint data. In some 
> cases the rule works, just by the virtue of previous rules helping it out to 
> cover the inaccuracy. That the ConstraintPropagation rule & its function of 
> removal of redundant filters & addition of new inferred filters is dependent 
> on the working of some of the other unrelated previous optimizer rules is 
> behaving, is indicative of issues.
>  # It does away with all the EqualNullSafe constraints as this logic does not 
> need those constraints to be created.
>  # There is at least one test in existing ConstraintPropagationSuite which is 
> missing a IsNotNull constraints because the code incorrectly generated a 
> EqualsNullSafeConstraint instead of EqualTo constraint, when using the 
> existing Constraints code. With these changes, the test correctly creates an 
> EqualTo constraint, resulting in an inferred IsNotNull constraint
>  # It does away with the current combinatorial logic of evaluation all the 
> constraints can cause compilation to run into hours or cause OOM. The number 
> of constraints stored is exactly the same as the number of filters encountered
> h2. Q2. What problem is this proposal NOT designed to solve?
> It mainly focuses on compile time performance, but in some cases can benefit 
> run time characteristics too, like inferring IsNotNull filter or pushing down 
> compound predicates on the join, which currently may get missed/ does not 
> happen , respectively, by the present code.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Current ConstraintsPropagation code, pessimistically tries to generates all 
> the possible combinations of constraints , based on the aliases ( even then 
> it may miss a lot of combinations if the expression is a complex expression 
> involving same attribute repeated multiple times within the expression and 
> there are many aliases to that column). There are query plans in our 
> production env, which can result in intermediate number of constraints going 
> into hundreds of thousands, causing OOM or taking time running into hours. 
> Also there are cases where it incorrectly generates an EqualNullSafe 
> constraint instead of EqualTo constraint , thus missing a possible IsNull 
> constraint on column. 
> Also it only pushes single column predicate on the other side of the join.
> The constraints generated , in some cases, are missing the required ones, and 
> the plan apparently is behaving correctly only due to the preceding unrelated 
> optimizer rule. Have Test which show that with the bare mnimum rules 
> containing RemoveRedundantPredicate, it misses the removal of redundant 
> predicate.
> h2. Q4. What is new in your approach and why do you think it will be 
> successful?
> It solves all the above mentioned issues.
>  # The number of constraints created are same as the number of filters. No 
> combinatorial creation of constraints. No need for EqualsNullSafe constraint 
> on aliases.
>  # Can remove redundant predicates on any expression involving aliases 
> irrespective of the number of repeat occurences in all possible combination.
>  # Brings down query compilation time to few minutes from hours.
>  # Can push compound predicates on Joins & infer right number of IsNotNull 
> constraints which can impact query runtime also positively.
>  # The proposed algorithm has been running successfully in our env. (WorkDay) 
> for months & has solved all the above issues.
> h2. Q5. Who cares? If you are successful, what difference will it make?
> For My company WorkDay, it has solved the previously failing plans due to OOM 
> & compilation time running into 10 hrs or so. I suppose there have been 
> previous attempts too, to fix this issue, but did not make progress due to 
> complexity of change.
> The PR for the same is 
> [PR|https://github.com/apache/spark/pull/37870]
> h2. Q6. What are the risks?
> Well the changes are little extensive, but thoroughly tested ( old & many new 
> tests added). Have added a lot of tests for Union node, as found that current 
> constraints tests were not sufficient for Union case.
> So in that sense , given that all existing tests as well as new tests are 
> clean, this is a safe PR.
> h2. Q7. How long will it take?
> The PR is already there. Implementation already done. whatever time needed is 
> for review and discussion.
> [PR|https://github.com/apache/spark/pull/37870]
> h2. Q8. What are the mid-term and final “exams” to check for success?
> All tests should pass.
> The perf benefit should justify the changes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to