[ https://issues.apache.org/jira/browse/SPARK-36786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781976#comment-17781976 ]
Asif commented on SPARK-36786: ------------------------------ I had put this on back burner as my changes were on 3.2, so I have to do a merge . on latest. Though whatever optimizations I did on 3.2 are still applicable as the drawback still exist. But chnages are going to be a a little extensive. If there is interest on it I can pick up , after some days as right now occupied with another spip which proposes chnages for improving perf of broadcast hash joins on non partition column joins. > SPIP: Improving the compile time performance, by improving a couple of > rules, from 24 hrs to under 8 minutes > ------------------------------------------------------------------------------------------------------------- > > Key: SPARK-36786 > URL: https://issues.apache.org/jira/browse/SPARK-36786 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.4.1, 3.1.2 > Reporter: Asif > Priority: Major > Labels: SPIP > > h2. Q1. What are you trying to do? Articulate your objectives using > absolutely no jargon. > The aim is to improve the compile time performance of query which in > WorkDay's use case takes > 24 hrs ( & eventually fails) , to < 8 min. > To explain the problem, I will provide the context. > The query plan in our production system, is huge, with nested *case when* > expressions ( level of nesting could be > 8) , where each *case when* can > have branches sometimes > 1000. > The plan could look like > {quote}Project1 > | > Filter 1 > | > Project2 > | > Filter2 > | > Project3 > | > Filter3 > | > Join > {quote} > Now the optimizer has a Batch of Rules , intended to run at max 100 times. > *Also note that the, the batch will continue to run till one of the condition > is satisfied* > *i.e either numIter == 100 || inputPlan == outputPlan (idempotency is > achieved)* > One of the early Rule is *PushDownPredicateRule.* > **Followed by **CollapseProject**. > > The first issue is *PushDownPredicate* rule. > It picks one filter at a time & pushes it at lowest level ( I understand > that in 3.1 it pushes through join, while in 2.4 it stops at Join) , but > either case it picks 1 filter at time starting from top, in each iteration. > *The above comment is no longer true in 3.1 release as it now combines > filters. so it does push now all the encountered filters in a single pass. > But it still materializes the filter on each push by realiasing.* > So if there are say 50 projects interspersed with Filters , the idempotency > is guaranteedly not going to get achieved till around 49 iterations. > Moreover, CollapseProject will also be modifying tree on each iteration as a > filter will get removed within Project. > Moreover, on each movement of filter through project tree, the filter is > re-aliased using transformUp rule. transformUp is very expensive compared to > transformDown. As the filter keeps getting pushed down , its size increases. > To optimize this rule , 2 things are needed > # Instead of pushing one filter at a time, collect all the filters as we > traverse the tree in that iteration itself. > # Do not re-alias the filters on each push. Collect the sequence of projects > it has passed through, and when the filters have reached their resting > place, do the re-alias by processing the projects collected in down to up > manner. > This will result in achieving idempotency in a couple of iterations. > *How reducing the number of iterations help in performance* > There are many rules like *NullPropagation, OptimizeIn, SimplifyConditionals > ( ... there are around 6 more such rules)* which traverse the tree using > transformUp, and they run unnecessarily in each iteration , even when the > expressions in an operator have not changed since the previous runs. > *I have a different proposal which I will share later, as to how to avoid the > above rules from running unnecessarily, if it can be guaranteed that the > expression is not going to mutate in the operator.* > The cause of our huge compilation time has been identified as the above. > > h2. Q2. What problem is this proposal NOT designed to solve? > It is not going to change any runtime profile. > h2. Q3. How is it done today, and what are the limits of current practice? > Like mentioned above , currently PushDownPredicate pushes one filter at a > time & at each Project , it materialized the re-aliased filter. This > results in large number of iterations to achieve idempotency as well as > immediate materialization of Filter after each Project pass,, results in > unnecessary tree traversals of filter expression that too using transformUp. > and the expression tree of filter is bound to keep increasing as it is pushed > down. > h2. Q4. What is new in your approach and why do you think it will be > successful? > In the new approach we push all the filters down in a single pass. And do not > materialize filters as it pass through Project. Instead keep collecting > projects in sequential order and materialize the final filter once its final > position is achieved ( above a join , in case of 2.1 , or above the base > relation etc). > This approach when coupled with the logic of identifying those Project > operator whose expressions will not mutate ( which I will share later) , so > that rules like > NullPropagation, > OptimizeIn., > LikeSimplification., > BooleanSimplification., > SimplifyConditionals., > RemoveDispensableExpressions., > SimplifyBinaryComparison., > SimplifyCaseConversionExpressions., > SimplifyExtractValueOps > > are applied only in first pass on the expressions of that Project operator, > the compilation time of offending queries have been reduced to under 8 mins > from 24 hrs or more. > > h2. Q5. Who cares? If you are successful, what difference will it make? > For My company WorkDay, it will solve the currently failing plans due to OOM > & compilation time running into 24 hrs or so. I have a PR for this locally, > will publish it in some time. > > h2. Q6. What are the risks? > The risk in the change of PushDownPredicate is very low. > For the next proposal of identifying Project operator whose expressions will > be immutable such that the above set of rules run only once has some > relatively complex logic but with extra tests coverage it should be safe. > h2. Q7. How long will it take? > The basic changes are already in place. tests will take time. around 10 -15 > days. > h2. Q8. What are the mid-term and final “exams” to check for success? > All tests should pass. > The perf benefit should justify the changes. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org