Thanks, setting the number of partitions to the number of executors helped a lot and training with 20k entries got a lot faster.
However, when I tried training with 1M entries, after about 45 minutes of calculations, I get this: It's stuck at this point. The CPU load for the master is at 100% (so 1 of 8 cores is used), but the WebUI shows no active task, and after 30 more minutes of no visible change I had to leave for an appointment. I've never seen an error referring to this library before. Could that be due to the new partitioning? Edit: Just before sending, in a new test I realized this error also appears when the amount of testdata is very low (here 500 items). This time it includes a Java stacktrace though, instead of just stopping: So, to sum it up, KMeans.train works somewhere inbetween 10k and 200k items, but not outside this range. Can you think of an explanation for this behavior? Best regards, Simon -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-for-large-training-data-tp9407p9508.html Sent from the Apache Spark User List mailing list archive at Nabble.com.