Re: KMeans: expensiveness of large vectors

2014-07-29 Thread durin
Development is really rapid here, that's a great thing. Out of curiosity, how did communication work before torrent? Did everything have to go back to the master / driver first? -- View this message in context:

Re: KMeans: expensiveness of large vectors

2014-07-29 Thread Xiangrui Meng
Before torrent, http is the default way for broadcasting. The driver holds the data and the executors request the data via http, making the driver the bottleneck if the data is large. -Xiangrui On Tue, Jul 29, 2014 at 10:32 AM, durin m...@simon-schaefer.net wrote: Development is really rapid

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread durin
Hi Xiangru, thanks for the explanation. 1. You said we have to broadcast m * k centers (with m = number of rows). I thought there were only k centers at each time, which would the have size of n * k and needed to be broadcasted. Is that I typo or did I understand something wrong? And the

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread Xiangrui Meng
1. I meant in the n (1k) by m (10k) case, we need to broadcast k centers and hence the total size is m * k. In 1.0, the driver needs to send the current centers to each partition one by one. In the current master, we use torrent to broadcast the centers to workers, which should be much faster. 2.

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread durin
Hi Xiangrui, using the current master meant a huge improvement for my task. Something that did not even finish before (training with 120G of dense data) now completes in a reasonable time. I guess using torrent helps a lot in this case. Best regards, Simon -- View this message in context:

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread Xiangrui Meng
Great! Thanks for testing the new features! -Xiangrui On Mon, Jul 28, 2014 at 8:58 PM, durin m...@simon-schaefer.net wrote: Hi Xiangrui, using the current master meant a huge improvement for my task. Something that did not even finish before (training with 120G of dense data) now completes