Re: [PR] [SPARK-45959][SQL] Improving performance when addition of 1 column at a time causes increase in the LogicalPlan tree depth [spark]

via GitHub Wed, 29 Nov 2023 14:00:39 -0800


ahshahid commented on PR #43854:
URL: https://github.com/apache/spark/pull/43854#issuecomment-1832769219


   @JoshRosen  . : Sorry I did not notice you left comments.. For some reason I 
do not receive any emails on any PR review comments.
   Yes I completely agree that these huge tree plans cause extreme  performance 
degradation especially in spark 3, where the plans are cloned, analyzer , 
optimizer rules modified to cause multiple tree traversals.
   In fact as part of WorkDay, I have modified the Optimizer rules, in multiple 
ways PushDownPredicateThroughNonJoin ( It pushes one pred at atime and 
realiases immediately, it should push all filters together as it traverses 
down, re-alias only at the end when it has reached right location, and re-alias 
expression from bottom to top) , CollapseProject ( same  : realias at the end, 
keep collecting projects and then collapse from bottom to top), combining some 
of the optimizer rules to apply togeher in a single pass ( & lastly controlled 
expansion of collapsed projects to identify repeated expressions so that they 
evaluate once).
   
   Any ways, In this change, as you rightly mentioned the problem is  the 
cached plan look up.
   I have provided a small explanation of logic in  CachedManager : lookup code
   I am dealing it in a 2 pass  manner.
   There would be no user visible change.
   For any given subplan,  the first pass, looks for exact match.
   If that does not work.. then.
   given that addition of cols or rename result in modification of  existing 
project, which has previously been cached,
   It means that the child plan of the project ( incoming and cached) would 
definitely match.
   so in second pass, the cache look up tries to find those cached plans,  
whose child matches, the incoming plan's child.
   Once that is done..
   then using canonicalized versions of incoming project, and cached Plan 
Project, the code acertains:
   1) All the output attributes of cache plan project match  the incoming plan 
project's NamedExpressions which are AttributeRefs. ( These are the pass thru 
attributes). and code identifies the index mapping of incoming attribute index 
to cached plan output.
   The remaining Aliases of incoming project, if the child expressions are all 
functions of pass thru attributes, this means that those are added columns, 
which can be created out of the cached plan project output. 
   So based on that the a new Project is created whose expressions are defined 
in terms of the output of the cached plan project.
   The Cached Represnation is added an EitherOr. If the value is 
InMemoryRelation then the lookup code would create a new InMemoryRelatio with 
new output. But if its a LogicalPlan, then that means it was a partial match, 
and hence appropriate Projection is applied and returned for use.
   
   Similar logic as above is applied to rename also.
   
   Dropping of column is not handled.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45959][SQL] Improving performance when addition of 1 column at a time causes increase in the LogicalPlan tree depth [spark]

Reply via email to