[ https://issues.apache.org/jira/browse/SPARK-26224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16717024#comment-16717024 ]
ASF GitHub Bot commented on SPARK-26224: ---------------------------------------- viirya commented on a change in pull request #23285: [SPARK-26224][SQL] Avoid creating many project on subsequent calls to withColumn URL: https://github.com/apache/spark/pull/23285#discussion_r240592731 ########## File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ########## @@ -2146,7 +2146,7 @@ class Dataset[T] private[sql]( * Returns a new Dataset by adding columns or replacing the existing columns that has * the same names. */ - private[spark] def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = { + private[spark] def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = withPlan { Review comment: As stated on the JIRA ticket, the problem is deep query plan. I think we can have many ways to create such deep query plan, not only for `withColumns`. For example, you can call `select` many times to do that too. This change makes `withColumns` a special case. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Results in stackOverFlowError when trying to add 3000 new columns using > withColumn function of dataframe. > --------------------------------------------------------------------------------------------------------- > > Key: SPARK-26224 > URL: https://issues.apache.org/jira/browse/SPARK-26224 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0 > Environment: On macbook, used Intellij editor. Ran the above sample > code as unit test. > Reporter: Dorjee Tsering > Priority: Minor > > Reproduction step: > Run this sample code on your laptop. I am trying to add 3000 new columns to a > base dataframe with 1 column. > > > {code:java} > import spark.implicits._ > val newColumnsToBeAdded : Seq[StructField] = for (i <- 1 to 3000) yield new > StructField("field_" + i, DataTypes.LongType) > val baseDataFrame: DataFrame = Seq((1)).toDF("employee_id") > val result = newColumnsToBeAdded.foldLeft(baseDataFrame)((df, newColumn) => > df.withColumn(newColumn.name, lit(0))) > result.show(false) > > {code} > Ends up with following stacktrace: > java.lang.StackOverflowError > at > scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57) > at > scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52) > at > scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) > at scala.collection.immutable.List.map(List.scala:296) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org