If you have an m-by-n dataset and train a k-means model with k, the
cost for each iteration is

O(m * n * k) (assuming dense data)

Since m * n * k == n * m * k, so ideally you would expect the same run
time. However,

1. Communication. We need to broadcast current centers (m * k), do the
computation, and then collect the average centers from each partition
(m * k). Having a large n (#cols) will increase the communication
cost.

2. Load. How many partitions did you use? If there are 10k rows, maybe
the rows are distributed well. But if there are only 1000 rows, you
may have, for example, 400 rows on a single partition and then 200
rows on the other three. Then the run time is determined by the
largest partition size.

Hopefully these could explain your observation.

Best,
Xiangru

On Thu, Jul 24, 2014 at 2:30 PM, durin <m...@simon-schaefer.net> wrote:
> As a source, I have a textfile with n rows that each contain m
> comma-separated integers.
> Each row is then converted into a feature vector with m features each.
>
> I've noticed, that given the same total filesize and number of features, a
> larger number of columns is much more expensive for training a KMeans model
> than a large number of rows.
>
> To give an example:
> 10k rows X 1k columns took 21 seconds on my cluster, whereas 1k rows X 10k
> colums took 1min47s. Both files had a size of 238M.
>
> Can someone explain what in the implementation of KMeans causes large
> vectors to be so much more expensive than having many of these vectors?
> A pointer to the exact part of the source would be fantastic, but even a
> general explanation would help me.
>
>
> Best regards,
> Simon
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to