I am writing my own map and reduce method for implementing K Means algorithm in Hadoop-1.0.1 in java language. Although i got some example link of K Means algorithm in Hadoop over blogs but i don't want to copy their code, as a lerner i want to implement it my self. So i just need some ideas/clues for the same. Below is the work which i already done.
I have Point and Cluster classes which are Writable, Point class have point x, point y and Cluster by whom this Point belongs. On the other hand my Cluster class has an ArrayList which stores all the Point objects which belongs to that Cluster. Cluseter class has an centroid variable also. Hope i am going correct (if not correct me please.) Now first of all my input (which is a file, containing some points coordinates) must be provided to Point Objects. I mean this input file must be mapped to all the Point. This should be done ONCE in map class (but how?). After assigning some value to each Point, some random Cluster must be chosen at the initial phase (This must be done only ONCE, but how). Now every Point must be mapped to all the cluster with the distance between that point and centroid. In the reduce method, every Point will be checked and assigned to that Cluster which is nearest to that Point (by comparing the distance). Now new centroid is calculated in each Cluster (Should map and reduce be called recursively? if yes then where all the initialization part would go. Here by saying initialization i mean providing input to Point objects (which must be done ONCE initially) and choosing some random centroid (Initially we have to choose random centroid ONCE) ). One more question, The value of parameter K(which will decide the total number of clusters should be assigned by user or hadoop will itself decide it?) Somebody please explain me, i don't need the code, i want to write it myself. I need a way. Thank you. -Ravi