Asif created SPARK-45959: ---------------------------- Summary: Abusing DataSet.withColumn can cause huge tree with severe perf degradation Key: SPARK-45959 URL: https://issues.apache.org/jira/browse/SPARK-45959 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.1 Reporter: Asif
Though documentation clearly recommends to add all columns in a single shot, but in reality is difficult to expect customer to modify their code, as in spark2 the rules in analyzer were such that they did not do deep tree traversal. Moreover in Spark3 , the plans are cloned before giving to analyzer , optimizer etc which was not the case in Spark2. All these things have resulted in query time being increased from 5 min to 2 - 3 hrs. Many times the columns are added to plan via some for loop logic which just keeps adding new computation based on some rule. So, my suggestion is to do some intial check in the withColumn api, before creating a new projection, like if all the existing columns are still being projected, and the new column being added has an expression which is not depending on the output of the top node , but its child, then instead of adding a new project, the column can be added to the existing node. For starts, may be we can just handle Project node .. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org