Kazuaki Ishizaki created SPARK-20479: ----------------------------------------
Summary: Performance degradation for large number of hash-aggregated columns Key: SPARK-20479 URL: https://issues.apache.org/jira/browse/SPARK-20479 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Kazuaki Ishizaki In comment of SPARK-20184, [~maropu] revealed that performance is degraded when # of aggregated columns get large with whole-stage codegen. {code} ./bin/spark-shell --master local[1] --conf spark.driver.memory=2g --conf spark.sql.shuffle.partitions=1 -v def timer[R](f: => {}): Unit = { val count = 9 val iters = (0 until count).map { i => val t0 = System.nanoTime() f val t1 = System.nanoTime() val elapsed = t1 - t0 + 0.0 println(s"#$i: ${elapsed / 1000000000.0}") elapsed } println("Elapsed time: " + ((iters.sum / count) / 1000000000.0) + "s") } val numCols = 80 val t = s"(SELECT id AS key1, id AS key2, ${((0 until numCols).map(i => s"id AS c$i")).mkString(", ")} FROM range(0, 100000, 1, 1))" val sqlStr = s"SELECT key1, key2, ${((0 until numCols).map(i => s"SUM(c$i)")).mkString(", ")} FROM $t GROUP BY key1, key2 LIMIT 100" // Elapsed time: 2.3084404905555553s sql("SET spark.sql.codegen.wholeStage=true") timer { sql(sqlStr).collect } // Elapsed time: 0.527486733s sql("SET spark.sql.codegen.wholeStage=false") timer { sql(sqlStr).collect } {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org