you might try running:

"SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM parquetFile) a"

If this works and you open a JIRA we can try to do this kind of
optimization by default pretty easily

On Fri, Oct 31, 2014 at 10:20 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> The only thing in your code that cannot be parallelized is the collect()
> because -- by definition -- it collects all the results to the driver node.
> This has nothing to do with the DISTINCT in your query.
>
> What do you want to do with the results after you collect them? How many
> results do you have in the output of collect?
>
> Perhaps it makes more sense to continue operating on the RDDs you have or
> saving them using one of the RDD methods, because that preserves the
> cluster's ability to parallelize work.
>
> Nick
>
> 2014년 10월 31일 금요일, Bojan Kostic<blood9ra...@gmail.com>님이 작성한 메시지:
>
> While i testing Spark SQL i noticed that COUNT DISTINCT works really slow.
>> Map partitions phase finished fast, but collect phase is slow.
>> It's only runs on single executor.
>> Should this run this way?
>>
>> And here is the simple code which i use for testing:
>> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>> val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/")
>> parquetFile.registerTempTable("parquetFile")
>> val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile")
>> count.map(t => t(0)).collect().foreach(println)
>>
>> I guess because of the distinct process must be on single node. But i
>> wonder
>> can i add some parallelism to the collect process.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

Reply via email to