In Spark/MLlib, task serialization such as cluster centers of k-means was
replaced by broadcast variables due to performance.
You can refer this PR https://github.com/apache/spark/pull/1427
And current k-means implementation of MLlib, it's benefited from sparse
vector computing.
http://spark-summit.org/2014/talk/sparse-data-support-in-mllib-2



2014-08-21 15:40 GMT+08:00 Julien Naour <julna...@gmail.com>:

> My Arrays are in fact Array[Array[Long]] and like 17x150000 (17 centers
> with 150 000 modalities, i'm working on qualitative variables) so they are
> pretty large. I'm working on it to get them smaller, it's mostly a sparse
> matrix.
> Good things to know nervertheless.
>
> Thanks,
>
> Julien Naour
>
>
> 2014-08-20 23:27 GMT+02:00 Patrick Wendell <pwend...@gmail.com>:
>
> For large objects, it will be more efficient to broadcast it. If your
>> array is small it won't really matter. How many centers do you have? Unless
>> you are finding that you have very large tasks (and Spark will print a
>> warning about this), it could be okay to just reference it directly.
>>
>>
>> On Wed, Aug 20, 2014 at 1:18 AM, Julien Naour <julna...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a question about broadcast. I'm working on a clustering algorithm
>>> close to KMeans. It seems that KMeans broadcast clusters centers at each
>>> step. For the moment I just use my centers as Array that I call directly in
>>> my map at each step. Could it be more efficient to use broadcast instead of
>>> simple variable?
>>>
>>> Cheers,
>>>
>>> Julien Naour
>>>
>>
>>
>

Reply via email to