I have put more detail of my problem at http://stackoverflow.com/questions/29295420/spark-kmeans-computation-cannot-be-distributed
It is really appreciate if you can help me take a look at this problem. I have tried various settings and ways to load/partition my data, but I just cannot get rid that long pause. Thanks, David [image: --] Xi Shen [image: http://]about.me/davidshen <http://about.me/davidshen?promo=email_sig> <http://about.me/davidshen> On Sat, Mar 28, 2015 at 2:38 PM, Xi Shen <davidshe...@gmail.com> wrote: > Yes, I have done repartition. > > I tried to repartition to the number of cores in my cluster. Not helping... > I tried to repartition to the number of centroids (k value). Not helping... > > > On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley <jos...@databricks.com> > wrote: > >> Can you try specifying the number of partitions when you load the data to >> equal the number of executors? If your ETL changes the number of >> partitions, you can also repartition before calling KMeans. >> >> >> On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen <davidshe...@gmail.com> wrote: >> >>> Hi, >>> >>> I have a large data set, and I expects to get 5000 clusters. >>> >>> I load the raw data, convert them into DenseVector; then I did >>> repartition and cache; finally I give the RDD[Vector] to KMeans.train(). >>> >>> Now the job is running, and data are loaded. But according to the Spark >>> UI, all data are loaded onto one executor. I checked that executor, and its >>> CPU workload is very low. I think it is using only 1 of the 8 cores. And >>> all other 3 executors are at rest. >>> >>> Did I miss something? Is it possible to distribute the workload to all 4 >>> executors? >>> >>> >>> Thanks, >>> David >>> >>> >>