In Spark/MLlib, task serialization such as cluster centers of k-means was replaced by broadcast variables due to performance. You can refer this PR https://github.com/apache/spark/pull/1427 And current k-means implementation of MLlib, it's benefited from sparse vector computing. http://spark-summit.org/2014/talk/sparse-data-support-in-mllib-2
2014-08-21 15:40 GMT+08:00 Julien Naour <julna...@gmail.com>: > My Arrays are in fact Array[Array[Long]] and like 17x150000 (17 centers > with 150 000 modalities, i'm working on qualitative variables) so they are > pretty large. I'm working on it to get them smaller, it's mostly a sparse > matrix. > Good things to know nervertheless. > > Thanks, > > Julien Naour > > > 2014-08-20 23:27 GMT+02:00 Patrick Wendell <pwend...@gmail.com>: > > For large objects, it will be more efficient to broadcast it. If your >> array is small it won't really matter. How many centers do you have? Unless >> you are finding that you have very large tasks (and Spark will print a >> warning about this), it could be okay to just reference it directly. >> >> >> On Wed, Aug 20, 2014 at 1:18 AM, Julien Naour <julna...@gmail.com> wrote: >> >>> Hi, >>> >>> I have a question about broadcast. I'm working on a clustering algorithm >>> close to KMeans. It seems that KMeans broadcast clusters centers at each >>> step. For the moment I just use my centers as Array that I call directly in >>> my map at each step. Could it be more efficient to use broadcast instead of >>> simple variable? >>> >>> Cheers, >>> >>> Julien Naour >>> >> >> >