[ 
https://issues.apache.org/jira/browse/SPARK-26224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16716892#comment-16716892
 ] 

ASF GitHub Bot commented on SPARK-26224:
----------------------------------------

cloud-fan commented on a change in pull request #23285: [SPARK-26224][SQL] 
Avoid creating many project on subsequent calls to withColumn
URL: https://github.com/apache/spark/pull/23285#discussion_r240575095
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
 ##########
 @@ -2164,16 +2164,16 @@ class Dataset[T] private[sql](
       columnMap.find { case (colName, _) =>
         resolver(field.name, colName)
       } match {
-        case Some((colName: String, col: Column)) => col.as(colName)
-        case _ => Column(field)
+        case Some((colName: String, col: Column)) => col.as(colName).named
+        case _ => field
       }
     }
 
-    val newColumns = columnMap.filter { case (colName, col) =>
+    val newColumns = columnMap.filter { case (colName, _) =>
       !output.exists(f => resolver(f.name, colName))
-    }.map { case (colName, col) => col.as(colName) }
+    }.map { case (colName, col) => col.as(colName).named }
 
-    select(replacedAndExistingColumns ++ newColumns : _*)
+    CollapseProject(Project(replacedAndExistingColumns ++ newColumns, 
logicalPlan))
 
 Review comment:
   Can we reduce the scope of this optimization? e.g. if the root node of this 
query is `Project`, update its project list to include `withColumns`, otherwise 
add a new Project.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Results in stackOverFlowError when trying to add 3000 new columns using 
> withColumn function of dataframe.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-26224
>                 URL: https://issues.apache.org/jira/browse/SPARK-26224
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>         Environment: On macbook, used Intellij editor. Ran the above sample 
> code as unit test.
>            Reporter: Dorjee Tsering
>            Priority: Minor
>
> Reproduction step:
> Run this sample code on your laptop. I am trying to add 3000 new columns to a 
> base dataframe with 1 column.
>  
>  
> {code:java}
> import spark.implicits._
> val newColumnsToBeAdded : Seq[StructField] = for (i <- 1 to 3000) yield new 
> StructField("field_" + i, DataTypes.LongType)
> val baseDataFrame: DataFrame = Seq((1)).toDF("employee_id")
> val result = newColumnsToBeAdded.foldLeft(baseDataFrame)((df, newColumn) => 
> df.withColumn(newColumn.name, lit(0)))
> result.show(false)
>  
> {code}
> Ends up with following stacktrace:
> java.lang.StackOverflowError
>  at 
> scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57)
>  at 
> scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52)
>  at 
> scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>  at scala.collection.immutable.List.map(List.scala:296)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to