Implement a clustering algorithm on mapreduce ---------------------------------------------
Key: MAHOUT-357 URL: https://issues.apache.org/jira/browse/MAHOUT-357 Project: Mahout Issue Type: New Feature Reporter: new user As I mentioned in my previous posts that I am interseted in implementing the clustering algorithm on mapreduce.So,now I am going to tell what I have thought to implement this. Thinking of the k-means algorithm for clustering,it appears that the whole set of data has to copied on each of the nodes of the hadoop framework to process the data in each iteration of the k-means clustering. But, this can be done without useless replication of data on the clusters.First of all, we select a set of k elements as the initial clusters.This can be purely random or decided on the basis of some criteria.We maintain a file which stores the id of each cluster, the number of elements in each cluster, and the exact position of the cluster in terms of its co-ordinates.This file has to be shared by each of the nodes. During each iteration of the algorithm, the following steps are done: 1. As each node has a part of the initial data,during the map phase, it calculates the distance of each of the element from the k cluster centroids. For each row,the smallest distance is chosen and the id of the cluster and the position of that element is stored. 2.During the combine phase, for each of the cluster, the average of the co-ordinates for all the elements is calculated and the number of elements in that cluster. So, the combiner funnction outputs the cluster id and the average co-ordinates of the elements. 3. During the reduce phase, the cluster centroid is re-calculated using the weighted averages of the co-ordinates. Thus , after these 3 steps, the new value of centorid for each cluster and the number of elemnts in each cluster is updated. The above three steps can be performed iteratively as long as the condition set for the convergence is not satisfied, by applying the map-combine-reduce phases again. I have proposed this as per my understanding of the probelem and my knowledge. If anybody have any doubts or want to add anything or suggest anything anything,then please respond as soon as possible. And, if you consider it a good idea, then please suggest how to proceed further in the Gsoc procedure. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.