Hi Xiangrui, Thanks for the guidance. I read the log carefully and found the root cause.
KMeans, by default, uses KMeans++ as the initialization mode. According to the log file, the 70-minute hanging is actually the computing time of Kmeans++, as pasted below: 14/10/14 14:48:18 INFO DAGScheduler: Stage 20 (collectAsMap at KMeans.scala:293) finished in 2.233 s 14/10/14 14:48:18 INFO SparkContext: Job finished: collectAsMap at KMeans.scala:293, took 85.590020124 s 14/10/14 14:48:18 INFO ShuffleBlockManager: Could not find files for shuffle 5 for deleting 14/10/14 *14:48:18* INFO ContextCleaner: Cleaned shuffle 5 14/10/14 15:50:41 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 14/10/14 15:50:41 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS *14/10/14 15:54:36 INFO LocalKMeans: Local KMeans++ converged in 11 iterations. 14/10/14 15:54:36 INFO KMeans: Initialization with k-means|| took 4426.913 seconds.* 14/10/14 15:54:37 INFO SparkContext: Starting job: collectAsMap at KMeans.scala:190 14/10/14 15:54:37 INFO DAGScheduler: Registering RDD 38 (reduceByKey at KMeans.scala:190) 14/10/14 15:54:37 INFO DAGScheduler: Got job 16 (collectAsMap at KMeans.scala:190) with 100 output partitions (allowLocal=false) 14/10/14 15:54:37 INFO DAGScheduler: Final stage: Stage 22(collectAsMap at KMeans.scala:190) I now use "random" as the Kmeans initialization mode, and other confs remain the same. This time, it just finished quickly~~ In your test on mnis8m, did you use KMeans++ as initialization mode? How long it takes? Thanks again for your help. Ray -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16450.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org