As a source, I have a textfile with n rows that each contain m comma-separated integers. Each row is then converted into a feature vector with m features each.
I've noticed, that given the same total filesize and number of features, a larger number of columns is much more expensive for training a KMeans model than a large number of rows. To give an example: 10k rows X 1k columns took 21 seconds on my cluster, whereas 1k rows X 10k colums took 1min47s. Both files had a size of 238M. Can someone explain what in the implementation of KMeans causes large vectors to be so much more expensive than having many of these vectors? A pointer to the exact part of the source would be fantastic, but even a general explanation would help me. Best regards, Simon -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614.html Sent from the Apache Spark User List mailing list archive at Nabble.com.