Implement a clustering algorithm on mapreduce
---------------------------------------------

                 Key: MAHOUT-357
                 URL: https://issues.apache.org/jira/browse/MAHOUT-357
             Project: Mahout
          Issue Type: New Feature
            Reporter: new user


As I mentioned in my previous posts that I am interseted in implementing the 
clustering algorithm on mapreduce.So,now I am going to tell what I have thought 
to implement this. Thinking of the k-means algorithm for clustering,it appears 
that the whole set of data has to copied on each of the nodes of the hadoop 
framework to process the data in each iteration of the k-means clustering. But, 
this can be done without useless replication  of data on the clusters.First of 
all, we select a set of k elements as the initial clusters.This can be purely 
random or decided on the basis of some criteria.We maintain a file which stores 
the id of each cluster, the number of elements in each cluster, and the exact 
position of the cluster in terms of its co-ordinates.This file has to be shared 
by each of the nodes. During each iteration of the algorithm, the following 
steps are done:
1. As each node has a part of the initial data,during the map phase, it 
calculates the distance of each of the element from the k cluster centroids. 
For each row,the smallest distance is chosen and the id of the cluster and the 
position of that element is stored.
2.During the combine phase, for each of the cluster, the average of the 
co-ordinates for all the elements is calculated and the number of elements in 
that cluster. So, the combiner funnction outputs the cluster id and the average 
co-ordinates of the elements.
3. During the reduce phase, the cluster centroid is re-calculated using the 
weighted averages of the co-ordinates.
Thus , after these 3 steps, the new value of centorid for each cluster and the 
number of elemnts in each cluster is updated.

The above three steps can be performed iteratively as long as the condition set 
for the convergence is not satisfied, by applying the map-combine-reduce phases 
again.
I have proposed this as per my understanding of the probelem and my knowledge. 
If anybody have any doubts or want to add anything or suggest anything 
anything,then please respond as soon as possible. And, if you consider it a 
good idea, then please suggest how to proceed further in the Gsoc procedure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to