Re: Broadcast vs simple variable

2014-08-21 Thread Yanbo Liang
In Spark/MLlib, task serialization such as cluster centers of k-means was replaced by broadcast variables due to performance. You can refer this PR https://github.com/apache/spark/pull/1427 And current k-means implementation of MLlib, it's benefited from sparse vector computing.

Re: Broadcast vs simple variable

2014-08-21 Thread Julien Naour
Thanks for the useful links. Cheers, Julien 2014-08-21 11:47 GMT+02:00 Yanbo Liang yanboha...@gmail.com: In Spark/MLlib, task serialization such as cluster centers of k-means was replaced by broadcast variables due to performance. You can refer this PR

Broadcast vs simple variable

2014-08-20 Thread Julien Naour
Hi, I have a question about broadcast. I'm working on a clustering algorithm close to KMeans. It seems that KMeans broadcast clusters centers at each step. For the moment I just use my centers as Array that I call directly in my map at each step. Could it be more efficient to use broadcast

Re: Broadcast vs simple variable

2014-08-20 Thread Patrick Wendell
For large objects, it will be more efficient to broadcast it. If your array is small it won't really matter. How many centers do you have? Unless you are finding that you have very large tasks (and Spark will print a warning about this), it could be okay to just reference it directly. On Wed,