Actually for the first step, the client could create a file with the centers and then put it on hdfs and use it with distributed cache. A single reducer might be enough and that case, its only responsibility is to create the file with the updated centers. You can then use this new file again in the distributed cache instead of the first.
Your real input will always be your set of points. Regards Bertrand PS : One reducer should be enough because it only needs to aggregate the partial update of each mapper. The volume of data send to the reducer will change according to the number of centers but not the number of points. On Wed, Mar 27, 2013 at 10:59 AM, Yaron Gonen <yaron.go...@gmail.com> wrote: > Hi, > I'd like to implement k-means by myself, in the following naive way: > Given a large set of vectors: > > 1. Generate k random centers from set. > 2. Mapper reads all center and a split of the vectors set and emits > for each vector the closest center as a key. > 3. Reducer calculated new center and writes it. > 4. Goto step 2 until no change in the centers. > > My question is very basic: how do I distribute all the new centers > (produced by the reducers) to all the mappers? I can't use distributed > cache since its read-only. I can't use the context.write since it will > create a file for each reduce task, and I need a single file. The more > general issue here is how to distribute data produced by reducer to all the > mappers? > > Thanks. >