Asif created SPARK-45959:
----------------------------

             Summary: Abusing DataSet.withColumn can cause huge tree with 
severe perf degradation
                 Key: SPARK-45959
                 URL: https://issues.apache.org/jira/browse/SPARK-45959
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.5.1
            Reporter: Asif


Though documentation clearly recommends to add all columns in a single shot, 
but in reality is difficult to expect customer to modify their code, as in 
spark2  the rules in analyzer were such that  they did not do deep tree 
traversal.  Moreover in Spark3 , the plans are cloned before giving to analyzer 
, optimizer etc which was not the case in Spark2.
All these things have resulted in query time being increased from 5 min to 2 - 
3 hrs.
Many times the columns are added to plan via some for loop logic which just 
keeps adding new computation based on some rule.

So,  my suggestion is to do some intial check in the withColumn api, before 
creating a new projection, like if all the existing columns are still being 
projected, and the new column being added has an expression which is not 
depending on the output of the top node , but its child,  then instead of 
adding a new project, the column can be added to the existing node.
For starts, may be we can just handle Project node ..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to