[ https://issues.apache.org/jira/browse/SPARK-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998799#comment-14998799 ]
Piotr Niemcunowicz commented on SPARK-4243: ------------------------------------------- Same happens when one uses HiveContext. > Spark SQL SELECT COUNT DISTINCT optimization > -------------------------------------------- > > Key: SPARK-4243 > URL: https://issues.apache.org/jira/browse/SPARK-4243 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.1.0 > Reporter: Bojan Kostić > > Spark SQL runs slow when using this code: > {code} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/") > parquetFile.registerTempTable("parquetFile") > val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile") > count.map(t => t(0)).collect().foreach(println) > {code} > But with this query it runs much faster: > {code} > SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM parquetFile) a > {code} > Old queries stats by phases: > 3.2min > 17s > New query stats by phases: > 0.3 s > 16 s > 20 s > Maybe you should also see this query for optimization: > {code} > SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) > FROM parquetFile > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org