[ 
https://issues.apache.org/jira/browse/SPARK-45959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-45959:
-------------------------
    Description: 
Though documentation clearly recommends to add all columns in a single shot, 
but in reality is difficult to expect customer to modify their code, as in 
spark2  the rules in analyzer were such that  they did not do deep tree 
traversal.  Moreover in Spark3 , the plans are cloned before giving to analyzer 
, optimizer etc which was not the case in Spark2.
All these things have resulted in query time being increased from 5 min to 2 - 
3 hrs.
Many times the columns are added to plan via some for loop logic which just 
keeps adding new computation based on some rule.

So,  my suggestion is to Collapse the Projects early, once the analysis of the 
logical plan is done, but before the plan gets assigned to the field variable 
in QueryExecution.
The PR for the above is ready for review.
The major change is in the way the lookup is performed in CacheManager.
I have described the logic in the PR and have added multiple tests.

  was:
Though documentation clearly recommends to add all columns in a single shot, 
but in reality is difficult to expect customer to modify their code, as in 
spark2  the rules in analyzer were such that  they did not do deep tree 
traversal.  Moreover in Spark3 , the plans are cloned before giving to analyzer 
, optimizer etc which was not the case in Spark2.
All these things have resulted in query time being increased from 5 min to 2 - 
3 hrs.
Many times the columns are added to plan via some for loop logic which just 
keeps adding new computation based on some rule.

So,  my suggestion is to do some intial check in the withColumn api, before 
creating a new projection, like if all the existing columns are still being 
projected, and the new column being added has an expression which is not 
depending on the output of the top node , but its child,  then instead of 
adding a new project, the column can be added to the existing node.
For starts, may be we can just handle Project node ..


> SPIP: Abusing DataSet.withColumn can cause huge tree with severe perf 
> degradation
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-45959
>                 URL: https://issues.apache.org/jira/browse/SPARK-45959
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.5.1
>            Reporter: Asif
>            Priority: Minor
>              Labels: pull-request-available
>
> Though documentation clearly recommends to add all columns in a single shot, 
> but in reality is difficult to expect customer to modify their code, as in 
> spark2  the rules in analyzer were such that  they did not do deep tree 
> traversal.  Moreover in Spark3 , the plans are cloned before giving to 
> analyzer , optimizer etc which was not the case in Spark2.
> All these things have resulted in query time being increased from 5 min to 2 
> - 3 hrs.
> Many times the columns are added to plan via some for loop logic which just 
> keeps adding new computation based on some rule.
> So,  my suggestion is to Collapse the Projects early, once the analysis of 
> the logical plan is done, but before the plan gets assigned to the field 
> variable in QueryExecution.
> The PR for the above is ready for review.
> The major change is in the way the lookup is performed in CacheManager.
> I have described the logic in the PR and have added multiple tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to