Re: KMeans: expensiveness of large vectors

2014-07-29 Thread Xiangrui Meng
Before torrent, http is the default way for broadcasting. The driver holds the data and the executors request the data via http, making the driver the bottleneck if the data is large. -Xiangrui On Tue, Jul 29, 2014 at 10:32 AM, durin wrote: > Development is really rapid here, that's a great thing

Re: KMeans: expensiveness of large vectors

2014-07-29 Thread durin
Development is really rapid here, that's a great thing. Out of curiosity, how did communication work before torrent? Did everything have to go back to the master / driver first? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-v

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread Xiangrui Meng
Great! Thanks for testing the new features! -Xiangrui On Mon, Jul 28, 2014 at 8:58 PM, durin wrote: > Hi Xiangrui, > > using the current master meant a huge improvement for my task. Something > that did not even finish before (training with 120G of dense data) now > completes in a reasonable time

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread durin
Hi Xiangrui, using the current master meant a huge improvement for my task. Something that did not even finish before (training with 120G of dense data) now completes in a reasonable time. I guess using torrent helps a lot in this case. Best regards, Simon -- View this message in context: ht

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread Xiangrui Meng
1. I meant in the n (1k) by m (10k) case, we need to broadcast k centers and hence the total size is m * k. In 1.0, the driver needs to send the current centers to each partition one by one. In the current master, we use torrent to broadcast the centers to workers, which should be much faster. 2.

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread durin
Hi Xiangru, thanks for the explanation. 1. You said we have to broadcast m * k centers (with m = number of rows). I thought there were only k centers at each time, which would the have size of n * k and needed to be broadcasted. Is that I typo or did I understand something wrong? And the collecti

Re: KMeans: expensiveness of large vectors

2014-07-27 Thread Xiangrui Meng
If you have an m-by-n dataset and train a k-means model with k, the cost for each iteration is O(m * n * k) (assuming dense data) Since m * n * k == n * m * k, so ideally you would expect the same run time. However, 1. Communication. We need to broadcast current centers (m * k), do the computati