I am trying to do DBScan Algo.I refered the algo in "Data Mining - Concepts and Techniques (3rd Ed)" chapter 10 Page no: 474. Here in this algorithmwe need to find the disance between each point. say my sample input is 5,6 8,2 4,5 4,6
So in DBScan we have to pic 1 elemnt and then find the distance between all. While implementing so I will not be able to get the whole file in map inorder to find the distance. I tried some approach 1. used WholeFileInput and done the entire algorithm in Map itself - I dnt think this is a better one.(And it end up with heap space error) 2. and this one is not implementes as I thought it is not feasible - Reading 1 line of input data set in driver and write to a new file.(say centroid) - this centriod can be read in setup and calculate the distance in Map and emit the data which satifies the condition with dbscan map(id,epsilonneighbr) and in reducer we will be able to aggregate all the epsilon neighbours of (5,6) which come from different map and in Reducer find the neighbors of epsilon neighbour. - Next iteration should also be done agian read the input file find a node which is not visited.... If the input is a 1GB file the MR job executes as many times of the total record. Can anyone suggest me a better way to do this. Hope the usecase is understandable else please tell me.I will explain further. -- *Thanks & Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/