In Spark/MLlib, task serialization such as cluster centers of k-means was
replaced by broadcast variables due to performance.
You can refer this PR https://github.com/apache/spark/pull/1427
And current k-means implementation of MLlib, it's benefited from sparse
vector computing.
Thanks for the useful links.
Cheers,
Julien
2014-08-21 11:47 GMT+02:00 Yanbo Liang yanboha...@gmail.com:
In Spark/MLlib, task serialization such as cluster centers of k-means was
replaced by broadcast variables due to performance.
You can refer this PR
Hi,
I have a question about broadcast. I'm working on a clustering algorithm
close to KMeans. It seems that KMeans broadcast clusters centers at each
step. For the moment I just use my centers as Array that I call directly in
my map at each step. Could it be more efficient to use broadcast
For large objects, it will be more efficient to broadcast it. If your array
is small it won't really matter. How many centers do you have? Unless you
are finding that you have very large tasks (and Spark will print a warning
about this), it could be okay to just reference it directly.
On Wed,