Before torrent, http is the default way for broadcasting. The driver
holds the data and the executors request the data via http, making the
driver the bottleneck if the data is large. -Xiangrui
On Tue, Jul 29, 2014 at 10:32 AM, durin wrote:
> Development is really rapid here, that's a great thing
Development is really rapid here, that's a great thing.
Out of curiosity, how did communication work before torrent? Did everything
have to go back to the master / driver first?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-v
Great! Thanks for testing the new features! -Xiangrui
On Mon, Jul 28, 2014 at 8:58 PM, durin wrote:
> Hi Xiangrui,
>
> using the current master meant a huge improvement for my task. Something
> that did not even finish before (training with 120G of dense data) now
> completes in a reasonable time
Hi Xiangrui,
using the current master meant a huge improvement for my task. Something
that did not even finish before (training with 120G of dense data) now
completes in a reasonable time. I guess using torrent helps a lot in this
case.
Best regards,
Simon
--
View this message in context:
ht
1. I meant in the n (1k) by m (10k) case, we need to broadcast k
centers and hence the total size is m * k. In 1.0, the driver needs to
send the current centers to each partition one by one. In the current
master, we use torrent to broadcast the centers to workers, which
should be much faster.
2.
Hi Xiangru,
thanks for the explanation.
1. You said we have to broadcast m * k centers (with m = number of rows). I
thought there were only k centers at each time, which would the have size of
n * k and needed to be broadcasted. Is that I typo or did I understand
something wrong?
And the collecti
If you have an m-by-n dataset and train a k-means model with k, the
cost for each iteration is
O(m * n * k) (assuming dense data)
Since m * n * k == n * m * k, so ideally you would expect the same run
time. However,
1. Communication. We need to broadcast current centers (m * k), do the
computati