Actually for the first step, the client could create a file with the
centers and then put it on hdfs and use it with distributed cache.
A single reducer might be enough and that case, its only responsibility is
to create the file with the updated centers.
You can then use this new file again in the distributed cache instead of
the first.

Your real input will always be your set of points.

Regards

Bertrand

PS : One reducer should be enough because it only needs to aggregate the
partial update of each mapper. The volume of data send to the reducer will
change according to the number of centers but not the number of points.


On Wed, Mar 27, 2013 at 10:59 AM, Yaron Gonen <yaron.go...@gmail.com> wrote:

> Hi,
> I'd like to implement k-means by myself, in the following naive way:
> Given a large set of vectors:
>
>    1. Generate k random centers from set.
>    2. Mapper reads all center and a split of the vectors set and emits
>    for each vector the closest center as a key.
>    3. Reducer calculated new center and writes it.
>    4. Goto step 2 until no change in the centers.
>
> My question is very basic: how do I distribute all the new centers
> (produced by the reducers) to all the mappers? I can't use distributed
> cache since its read-only. I can't use the context.write since it will
> create a file for each reduce task, and I need a single file. The more
> general issue here is how to distribute data produced by reducer to all the
> mappers?
>
> Thanks.
>

Reply via email to