Hi,

I'm trying to use org.apache.spark.mllib.clustering.KMeans to do some basic
clustering with Strings.

My code works great when I use a five-figure amount of training elements.
However, with for example 2 million elements, it gets extremely slow. A
single stage may take up to 30 minutes.

>From the Web UI, I can see that it does these three things repeatedly:


All of these tasks only use one executor, and on that executor only one
core. And I can see a scheduler delay of about 25 seconds.

I tried to use broadcast variables to speed this up, but maybe I'm using it
wrong. The relevant code (where it gets slow) is this:




What could I do to use more executors, and generally speed this up? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-for-large-training-data-tp9407.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to