Re: SQL COUNT DISTINCT
Here is the link on jira: https://issues.apache.org/jira/browse/SPARK-4243 https://issues.apache.org/jira/browse/SPARK-4243 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818p18166.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SQL COUNT DISTINCT
Hi Michael, Thanks for response. I did test with query that you send me. And it works really faster: Old queries stats by phases: 3.2min 17s Your query stats by phases: 0.3 s 16 s 20 s But will this improvement also affect when you want to count distinct on 2 or more fields: SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) FROM parquetFile Should i still create Jira issue/improvement for this? @Nick That also make sense. But should i just get count of my data to driver node? I just started to learn about Spark(and it is great) so sorry if i ask stupid questions or anything like that. Best regards Bojan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818p17939.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SQL COUNT DISTINCT
On Mon, Nov 3, 2014 at 12:45 AM, Bojan Kostic blood9ra...@gmail.com wrote: But will this improvement also affect when you want to count distinct on 2 or more fields: SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) FROM parquetFile Unfortunately I think this case may be harder for us to optimize, though could be possible with some work. Should i still create Jira issue/improvement for this? Yes please.
SQL COUNT DISTINCT
While i testing Spark SQL i noticed that COUNT DISTINCT works really slow. Map partitions phase finished fast, but collect phase is slow. It's only runs on single executor. Should this run this way? And here is the simple code which i use for testing: val sqlContext = new org.apache.spark.sql.SQLContext(sc) val parquetFile = sqlContext.parquetFile(/bojan/test/2014-10-20/) parquetFile.registerTempTable(parquetFile) val count = sqlContext.sql(SELECT COUNT(DISTINCT f2) FROM parquetFile) count.map(t = t(0)).collect().foreach(println) I guess because of the distinct process must be on single node. But i wonder can i add some parallelism to the collect process. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SQL COUNT DISTINCT
The only thing in your code that cannot be parallelized is the collect() because -- by definition -- it collects all the results to the driver node. This has nothing to do with the DISTINCT in your query. What do you want to do with the results after you collect them? How many results do you have in the output of collect? Perhaps it makes more sense to continue operating on the RDDs you have or saving them using one of the RDD methods, because that preserves the cluster's ability to parallelize work. Nick 2014년 10월 31일 금요일, Bojan Kosticblood9ra...@gmail.com님이 작성한 메시지: While i testing Spark SQL i noticed that COUNT DISTINCT works really slow. Map partitions phase finished fast, but collect phase is slow. It's only runs on single executor. Should this run this way? And here is the simple code which i use for testing: val sqlContext = new org.apache.spark.sql.SQLContext(sc) val parquetFile = sqlContext.parquetFile(/bojan/test/2014-10-20/) parquetFile.registerTempTable(parquetFile) val count = sqlContext.sql(SELECT COUNT(DISTINCT f2) FROM parquetFile) count.map(t = t(0)).collect().foreach(println) I guess because of the distinct process must be on single node. But i wonder can i add some parallelism to the collect process. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:; For additional commands, e-mail: user-h...@spark.apache.org javascript:;