Hi Xiangru, thanks for the explanation.
1. You said we have to broadcast m * k centers (with m = number of rows). I thought there were only k centers at each time, which would the have size of n * k and needed to be broadcasted. Is that I typo or did I understand something wrong? And the collection of the average is partition-wise. So more partitions = more overhead, but basically same number of operations? 2. I have 5 executors with 8 CPU cores and 25G of memory each, and I usually split the input RDD into 80 partitions for a few Gigs of input data. Is there a rule of thumb for the number of partitions in relation to the input size? 3. Assuming I wouldn't use numeric data but instead converted text data into a numeric representation using a dictionary and a featurization function: The number of columns would be the number of entries in my dictionary (i.e. number of distinct words in my case). I'd use a sparse vector representation of course. But even so, if I have a few hundred thousand entries and therefore columns, broadcasting overhead will get very large, as the centers are still in a dense representation. Do you know of any way to improve performance then? Best regards, Simon -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10804.html Sent from the Apache Spark User List mailing list archive at Nabble.com.