[ https://issues.apache.org/jira/browse/SPARK-20479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-20479: --------------------------------- Labels: bulk-closed (was: ) > Performance degradation for large number of hash-aggregated columns > ------------------------------------------------------------------- > > Key: SPARK-20479 > URL: https://issues.apache.org/jira/browse/SPARK-20479 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.2.0 > Reporter: Kazuaki Ishizaki > Priority: Major > Labels: bulk-closed > > In comment of SPARK-20184, [~maropu] revealed that performance is degraded > when # of aggregated columns get large with whole-stage codegen. > {code} > ./bin/spark-shell --master local[1] --conf spark.driver.memory=2g --conf > spark.sql.shuffle.partitions=1 -v > def timer[R](f: => {}): Unit = { > val count = 9 > val iters = (0 until count).map { i => > val t0 = System.nanoTime() > f > val t1 = System.nanoTime() > val elapsed = t1 - t0 + 0.0 > println(s"#$i: ${elapsed / 1000000000.0}") > elapsed > } > println("Elapsed time: " + ((iters.sum / count) / 1000000000.0) + "s") > } > val numCols = 80 > val t = s"(SELECT id AS key1, id AS key2, ${((0 until numCols).map(i => s"id > AS c$i")).mkString(", ")} FROM range(0, 100000, 1, 1))" > val sqlStr = s"SELECT key1, key2, ${((0 until numCols).map(i => > s"SUM(c$i)")).mkString(", ")} FROM $t GROUP BY key1, key2 LIMIT 100" > // Elapsed time: 2.3084404905555553s > sql("SET spark.sql.codegen.wholeStage=true") > timer { sql(sqlStr).collect } > // Elapsed time: 0.527486733s > sql("SET spark.sql.codegen.wholeStage=false") > timer { sql(sqlStr).collect } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org