Re: SQL COUNT DISTINCT

2014-11-05 Thread Bojan Kostic
Here is the link on jira: https://issues.apache.org/jira/browse/SPARK-4243 https://issues.apache.org/jira/browse/SPARK-4243 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818p18166.html Sent from the Apache Spark User List

Re: SQL COUNT DISTINCT

2014-11-03 Thread Bojan Kostic
Hi Michael, Thanks for response. I did test with query that you send me. And it works really faster: Old queries stats by phases: 3.2min 17s Your query stats by phases: 0.3 s 16 s 20 s But will this improvement also affect when you want to count distinct on 2 or more fields: SELECT COUNT(f1),

Re: SQL COUNT DISTINCT

2014-11-03 Thread Michael Armbrust
On Mon, Nov 3, 2014 at 12:45 AM, Bojan Kostic blood9ra...@gmail.com wrote: But will this improvement also affect when you want to count distinct on 2 or more fields: SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) FROM parquetFile Unfortunately I think this

Re: SQL COUNT DISTINCT

2014-10-31 Thread Nicholas Chammas
The only thing in your code that cannot be parallelized is the collect() because -- by definition -- it collects all the results to the driver node. This has nothing to do with the DISTINCT in your query. What do you want to do with the results after you collect them? How many results do you have