Re: SQL COUNT DISTINCT

2014-11-05 Thread Bojan Kostic
Here is the link on jira:  https://issues.apache.org/jira/browse/SPARK-4243
https://issues.apache.org/jira/browse/SPARK-4243  




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818p18166.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SQL COUNT DISTINCT

2014-11-03 Thread Bojan Kostic
Hi Michael,
Thanks for response. I did test with query that you send me. And it works
really faster:
Old queries stats by phases:
3.2min
17s
Your query stats by phases:
0.3 s
16 s
20 s

But will this improvement also affect when you want to count distinct on 2
or more fields:
SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4)
FROM parquetFile

Should i still create Jira issue/improvement for this?

@Nick
That also make sense. But should i just get count of my data to driver node?

I just started to learn about Spark(and it is great) so sorry if i ask
stupid questions or anything like that.

Best regards
Bojan




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818p17939.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SQL COUNT DISTINCT

2014-11-03 Thread Michael Armbrust
On Mon, Nov 3, 2014 at 12:45 AM, Bojan Kostic blood9ra...@gmail.com wrote:

 But will this improvement also affect when you want to count distinct on 2
 or more fields:
 SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT
 f4)
 FROM parquetFile


Unfortunately I think this case may be harder for us to optimize, though
could be possible with some work.


 Should i still create Jira issue/improvement for this?


Yes please.


Re: SQL COUNT DISTINCT

2014-10-31 Thread Nicholas Chammas
The only thing in your code that cannot be parallelized is the collect()
because -- by definition -- it collects all the results to the driver node.
This has nothing to do with the DISTINCT in your query.

What do you want to do with the results after you collect them? How many
results do you have in the output of collect?

Perhaps it makes more sense to continue operating on the RDDs you have or
saving them using one of the RDD methods, because that preserves the
cluster's ability to parallelize work.

Nick

2014년 10월 31일 금요일, Bojan Kosticblood9ra...@gmail.com님이 작성한 메시지:

 While i testing Spark SQL i noticed that COUNT DISTINCT works really slow.
 Map partitions phase finished fast, but collect phase is slow.
 It's only runs on single executor.
 Should this run this way?

 And here is the simple code which i use for testing:
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val parquetFile = sqlContext.parquetFile(/bojan/test/2014-10-20/)
 parquetFile.registerTempTable(parquetFile)
 val count = sqlContext.sql(SELECT COUNT(DISTINCT f2) FROM parquetFile)
 count.map(t = t(0)).collect().foreach(println)

 I guess because of the distinct process must be on single node. But i
 wonder
 can i add some parallelism to the collect process.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:;
 For additional commands, e-mail: user-h...@spark.apache.org javascript:;