[ https://issues.apache.org/jira/browse/SPARK-7276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17300724#comment-17300724 ]
Abdeali Kothari commented on SPARK-7276: ---------------------------------------- confirm that even with 2.4 there is a significant slowdown for .withColumn() as compared to .select(): ||Num Col||.withColumn() (sec)||.select() (sec)||Ratio|| |10|0.113|0.016|6| |100|0.733|0.061|11| |1000|23.123|0.862|26| |2000|140.796|1.564|90| PFA Code used: [^test.py] > withColumn is very slow on dataframe with large number of columns > ----------------------------------------------------------------- > > Key: SPARK-7276 > URL: https://issues.apache.org/jira/browse/SPARK-7276 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 1.3.1 > Reporter: Alexandre CLEMENT > Assignee: Wenchen Fan > Priority: Major > Fix For: 1.4.0 > > Attachments: test.py > > > The code snippet demonstrates the problem. > {code} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val sparkConf = new SparkConf().setAppName("Spark > Test").setMaster(System.getProperty("spark.master", "local[4]")) > val sc = new SparkContext(sparkConf) > val sqlContext = new SQLContext(sc) > import sqlContext.implicits._ > val custs = Seq( > Row(1, "Bob", 21, 80.5), > Row(2, "Bobby", 21, 80.5), > Row(3, "Jean", 21, 80.5), > Row(4, "Fatime", 21, 80.5) > ) > var fields = List( > StructField("id", IntegerType, true), > StructField("a", IntegerType, true), > StructField("b", StringType, true), > StructField("target", DoubleType, false)) > val schema = StructType(fields) > var rdd = sc.parallelize(custs) > var df = sqlContext.createDataFrame(rdd, schema) > for (i <- 1 to 200) { > val now = System.currentTimeMillis > df = df.withColumn("a_new_col_" + i, df("a") + i) > println(s"$i -> " + (System.currentTimeMillis - now)) > } > df.show() > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org