you might try running: "SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM parquetFile) a"
If this works and you open a JIRA we can try to do this kind of optimization by default pretty easily On Fri, Oct 31, 2014 at 10:20 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > The only thing in your code that cannot be parallelized is the collect() > because -- by definition -- it collects all the results to the driver node. > This has nothing to do with the DISTINCT in your query. > > What do you want to do with the results after you collect them? How many > results do you have in the output of collect? > > Perhaps it makes more sense to continue operating on the RDDs you have or > saving them using one of the RDD methods, because that preserves the > cluster's ability to parallelize work. > > Nick > > 2014년 10월 31일 금요일, Bojan Kostic<blood9ra...@gmail.com>님이 작성한 메시지: > > While i testing Spark SQL i noticed that COUNT DISTINCT works really slow. >> Map partitions phase finished fast, but collect phase is slow. >> It's only runs on single executor. >> Should this run this way? >> >> And here is the simple code which i use for testing: >> val sqlContext = new org.apache.spark.sql.SQLContext(sc) >> val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/") >> parquetFile.registerTempTable("parquetFile") >> val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile") >> count.map(t => t(0)).collect().foreach(println) >> >> I guess because of the distinct process must be on single node. But i >> wonder >> can i add some parallelism to the collect process. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >>