Hi,

I have a large data set, and I expects to get 5000 clusters.

I load the raw data, convert them into DenseVector; then I did repartition
and cache; finally I give the RDD[Vector] to KMeans.train().

Now the job is running, and data are loaded. But according to the Spark UI,
all data are loaded onto one executor. I checked that executor, and its CPU
workload is very low. I think it is using only 1 of the 8 cores. And all
other 3 executors are at rest.

Did I miss something? Is it possible to distribute the workload to all 4
executors?


Thanks,
David

Reply via email to