The only thing in your code that cannot be parallelized is the collect()
because -- by definition -- it collects all the results to the driver node.
This has nothing to do with the DISTINCT in your query.

What do you want to do with the results after you collect them? How many
results do you have in the output of collect?

Perhaps it makes more sense to continue operating on the RDDs you have or
saving them using one of the RDD methods, because that preserves the
cluster's ability to parallelize work.

Nick

2014년 10월 31일 금요일, Bojan Kostic<blood9ra...@gmail.com>님이 작성한 메시지:

> While i testing Spark SQL i noticed that COUNT DISTINCT works really slow.
> Map partitions phase finished fast, but collect phase is slow.
> It's only runs on single executor.
> Should this run this way?
>
> And here is the simple code which i use for testing:
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/")
> parquetFile.registerTempTable("parquetFile")
> val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile")
> count.map(t => t(0)).collect().foreach(println)
>
> I guess because of the distinct process must be on single node. But i
> wonder
> can i add some parallelism to the collect process.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <javascript:;>
> For additional commands, e-mail: user-h...@spark.apache.org <javascript:;>
>
>

Reply via email to