Hi, I'm trying to use org.apache.spark.mllib.clustering.KMeans to do some basic clustering with Strings.
My code works great when I use a five-figure amount of training elements. However, with for example 2 million elements, it gets extremely slow. A single stage may take up to 30 minutes. >From the Web UI, I can see that it does these three things repeatedly: All of these tasks only use one executor, and on that executor only one core. And I can see a scheduler delay of about 25 seconds. I tried to use broadcast variables to speed this up, but maybe I'm using it wrong. The relevant code (where it gets slow) is this: What could I do to use more executors, and generally speed this up? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-for-large-training-data-tp9407.html Sent from the Apache Spark User List mailing list archive at Nabble.com.