[ 
https://issues.apache.org/jira/browse/SPARK-36786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781976#comment-17781976
 ] 

Asif commented on SPARK-36786:
------------------------------

I had put this on back burner as my changes were on 3.2, so I have to do a 
merge . on latest. Though whatever optimizations I did on 3.2 are still 
applicable as the drawback still exist. But chnages are going to be a a little 
extensive.
If there is interest on it I can pick up , after some days as right now 
occupied with another spip which proposes chnages for improving perf of 
broadcast hash joins on non partition column joins.
 

> SPIP: Improving the compile time performance, by improving  a couple of 
> rules, from 24 hrs to under 8 minutes
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-36786
>                 URL: https://issues.apache.org/jira/browse/SPARK-36786
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.1, 3.1.2
>            Reporter: Asif
>            Priority: Major
>              Labels: SPIP
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> The aim is to improve the compile time performance of query which in 
> WorkDay's use case takes > 24 hrs ( & eventually fails) , to  < 8 min.
> To explain the problem, I will provide the context.
> The query plan in our production system, is huge, with nested *case when* 
> expressions ( level of nesting could be >  8) , where each *case when* can 
> have branches sometimes > 1000.
> The plan could look like
> {quote}Project1
>     |
>    Filter 1
>     |
> Project2
>     |
>  Filter2
>     |
>  Project3
>     |
>  Filter3
>   |
> Join
> {quote}
> Now the optimizer has a Batch of Rules , intended to run at max 100 times.
> *Also note that the, the batch will continue to run till one of the condition 
> is satisfied*
> *i.e  either numIter == 100 || inputPlan == outputPlan (idempotency is 
> achieved)*
> One of the early  Rule is   *PushDownPredicateRule.*
> **Followed by **CollapseProject**.
>  
> The first issue is *PushDownPredicate* rule.
> It picks  one filter at a time & pushes it at lowest level ( I understand 
> that in 3.1 it pushes through join, while in 2.4 it stops at Join) , but 
> either case it picks 1 filter at time starting from top, in each iteration.
> *The above comment is no longer true in 3.1 release as it now combines 
> filters. so it does push now all the encountered filters in a single pass. 
> But it still materializes the filter on each push by realiasing.*
> So if there are say  50 projects interspersed with Filters , the idempotency 
> is guaranteedly not going to get achieved till around 49 iterations. 
> Moreover, CollapseProject will also be modifying tree on each iteration as a 
> filter will get removed within Project.
> Moreover, on each movement of filter through project tree, the filter is 
> re-aliased using transformUp rule.  transformUp is very expensive compared to 
> transformDown. As the filter keeps getting pushed down , its size increases.
> To optimize this rule , 2 things are needed
>  # Instead of pushing one filter at a time,  collect all the filters as we 
> traverse the tree in that iteration itself.
>  # Do not re-alias the filters on each push. Collect the sequence of projects 
> it has passed through, and  when the filters have reached their resting 
> place, do the re-alias by processing the projects collected in down to up 
> manner.
> This will result in achieving idempotency in a couple of iterations. 
> *How reducing the number of iterations help in performance*
> There are many rules like *NullPropagation, OptimizeIn, SimplifyConditionals 
> ( ... there are around 6 more such rules)*  which traverse the tree using 
> transformUp, and they run unnecessarily in each iteration , even when the 
> expressions in an operator have not changed since the previous runs.
> *I have a different proposal which I will share later, as to how to avoid the 
> above rules from running unnecessarily, if it can be guaranteed that the 
> expression is not going to mutate in the operator.* 
> The cause of our huge compilation time has been identified as the above.
>   
> h2. Q2. What problem is this proposal NOT designed to solve?
> It is not going to change any runtime profile.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Like mentioned above , currently PushDownPredicate pushes one filter at a 
> time  & at each Project , it materialized the re-aliased filter.  This 
> results in large number of iterations to achieve idempotency as well as 
> immediate materialization of Filter after each Project pass,, results in 
> unnecessary tree traversals of filter expression that too using transformUp. 
> and the expression tree of filter is bound to keep increasing as it is pushed 
> down.
> h2. Q4. What is new in your approach and why do you think it will be 
> successful?
> In the new approach we push all the filters down in a single pass. And do not 
> materialize filters as it pass through Project. Instead keep collecting 
> projects in sequential order and materialize the final filter once its final 
> position is achieved ( above a join , in case of 2.1 , or above the base 
> relation etc).
> This approach when coupled with the logic of identifying those Project 
> operator whose expressions will not mutate ( which I will share later) , so 
> that rules like 
> NullPropagation,
>  OptimizeIn.,
>  LikeSimplification.,
>  BooleanSimplification.,
>  SimplifyConditionals.,
>  RemoveDispensableExpressions.,
>  SimplifyBinaryComparison.,
>  SimplifyCaseConversionExpressions.,
>  SimplifyExtractValueOps
>   
> are applied only in first pass on the expressions of that Project operator,  
> the compilation time of offending queries have been reduced to under 8 mins 
> from 24 hrs or more.
>  
> h2. Q5. Who cares? If you are successful, what difference will it make?
> For My company WorkDay, it will solve the currently failing plans due to OOM 
> & compilation time running into 24 hrs or so. I have a PR for this locally, 
> will publish it in some  time.
>  
> h2. Q6. What are the risks?
> The risk in the change of PushDownPredicate is very low. 
>  For the next proposal of identifying Project operator whose expressions will 
> be immutable such that the above set of rules run only once has some 
> relatively complex logic but with extra tests coverage it should be safe.
> h2. Q7. How long will it take?
> The basic changes are already in place. tests will take time. around 10 -15 
> days.
> h2. Q8. What are the mid-term and final “exams” to check for success?
> All tests should pass.
>  The perf benefit should justify the changes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to