Hi Xiangrui,

Thanks for the guidance. I read the log carefully and found the root cause. 

KMeans, by default, uses KMeans++ as the initialization mode. According to
the log file, the 70-minute hanging is actually the computing time of
Kmeans++, as pasted below:

14/10/14 14:48:18 INFO DAGScheduler: Stage 20 (collectAsMap at
KMeans.scala:293) finished in 2.233 s
14/10/14 14:48:18 INFO SparkContext: Job finished: collectAsMap at
KMeans.scala:293, took 85.590020124 s
14/10/14 14:48:18 INFO ShuffleBlockManager: Could not find files for shuffle
5 for deleting
14/10/14 *14:48:18* INFO ContextCleaner: Cleaned shuffle 5
14/10/14 15:50:41 WARN BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeSystemBLAS
14/10/14 15:50:41 WARN BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeRefBLAS
*14/10/14 15:54:36 INFO LocalKMeans: Local KMeans++ converged in 11
iterations.
14/10/14 15:54:36 INFO KMeans: Initialization with k-means|| took 4426.913
seconds.*
14/10/14 15:54:37 INFO SparkContext: Starting job: collectAsMap at
KMeans.scala:190
14/10/14 15:54:37 INFO DAGScheduler: Registering RDD 38 (reduceByKey at
KMeans.scala:190)
14/10/14 15:54:37 INFO DAGScheduler: Got job 16 (collectAsMap at
KMeans.scala:190) with 100 output partitions (allowLocal=false)
14/10/14 15:54:37 INFO DAGScheduler: Final stage: Stage 22(collectAsMap at
KMeans.scala:190)



I now use "random" as the Kmeans initialization mode, and other confs remain
the same. This time, it just finished quickly~~

In your test on mnis8m, did you use KMeans++ as initialization mode? How
long it takes?

Thanks again for your help.

Ray







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16450.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to