[ 
https://issues.apache.org/jira/browse/MAHOUT-357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852716#action_12852716
 ] 

new user commented on MAHOUT-357:
---------------------------------

yeah...it is the k-means algorithm. But, the difference lies in the efficient 
implementation of the k-means on a very large data set on the mapreduce 
framework.By, efficient, I mean that you don't have to copy the whole data set 
on each of the nodes. In my implementation, the data is distributed among the 
nodes equally and each one of them perform the processing on the data on their 
data set only. But, the result would be the same as the whole data is processed 
on a single machine. So, the strength of the algorithm lies in this local 
processing without copying the whole data set.
I hope , I have answered your doubt. In case ,  you meant something else, 
please elaborate.

> Implement a clustering algorithm on mapreduce
> ---------------------------------------------
>
>                 Key: MAHOUT-357
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-357
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: new user
>
> As I mentioned in my previous posts that I am interseted in implementing the 
> clustering algorithm on mapreduce.So,now I am going to tell what I have 
> thought to implement this. Thinking of the k-means algorithm for 
> clustering,it appears that the whole set of data has to copied on each of the 
> nodes of the hadoop framework to process the data in each iteration of the 
> k-means clustering. But, this can be done without useless replication  of 
> data on the clusters.First of all, we select a set of k elements as the 
> initial clusters.This can be purely random or decided on the basis of some 
> criteria.We maintain a file which stores the id of each cluster, the number 
> of elements in each cluster, and the exact position of the cluster in terms 
> of its co-ordinates.This file has to be shared by each of the nodes. During 
> each iteration of the algorithm, the following steps are done:
> 1. As each node has a part of the initial data,during the map phase, it 
> calculates the distance of each of the element from the k cluster centroids. 
> For each row,the smallest distance is chosen and the id of the cluster and 
> the position of that element is stored.
> 2.During the combine phase, for each of the cluster, the average of the 
> co-ordinates for all the elements is calculated and the number of elements in 
> that cluster. So, the combiner funnction outputs the cluster id and the 
> average co-ordinates of the elements.
> 3. During the reduce phase, the cluster centroid is re-calculated using the 
> weighted averages of the co-ordinates.
> Thus , after these 3 steps, the new value of centorid for each cluster and 
> the number of elemnts in each cluster is updated.
> The above three steps can be performed iteratively as long as the condition 
> set for the convergence is not satisfied, by applying the map-combine-reduce 
> phases again.
> I have proposed this as per my understanding of the probelem and my 
> knowledge. If anybody have any doubts or want to add anything or suggest 
> anything anything,then please respond as soon as possible. And, if you 
> consider it a good idea, then please suggest how to proceed further in the 
> Gsoc procedure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to