Well, the default clusterFilter == 0, so this is not the difference between the implementations. When you talk about distributing similar vectors to each mapper, you are really moving into a hierarchical clustering method where you cluster your input points into a few large clusters and then cluster each cluster subset again. This can be done with scripting of any clustering algorithm and might be effective with canopy.
-----Original Message----- From: Paritosh Ranjan [mailto:[email protected]] Sent: Sunday, October 02, 2011 10:56 PM To: [email protected] Subject: Re: Difference in results : Clustering : sequential and MapReduce I got the reason for difference. Actually, its due to if (canopy.getNumPoints()> clusterFilter) in CanopyMapper. Similar data is not distributed evenly in the mappers. So, the canopies might come out with points < clusterFilter which are not processed further. But, this check is a great performance enhancer. I have experienced that. Maybe, distributing similar vectors on mappers might help to attain both quality and performance. On 03-10-2011 09:29, Paritosh Ranjan wrote: > The sequential algorithm finds more/better clusters than the > mapreduce one. > There's not a huge difference, but the standalone one is better for sure. > > Thanks and Regards, > Paritosh > > On 03-10-2011 01:47, Konstantin Shmakov wrote: >> I'd assume that distributed and sequential algorithms shouldn't produce >> identical results. To start with, they differ in initial setup: >> -- In distributed algorithm each mapper deals with subset of data and >> starts >> by picking up a random point, so N random points are picked up by N >> mappers >> to start with. >> -- In sequential algorithm 1 mapper deals with all data and starts by >> picking up 1 random point. >> But for the data with real clusters both algorithms should produce >> similar >> results. How different are the results in your case? >> >> Thanks >> --Konstantin >> >> >> >> >> >> >> >> >> On Sun, Oct 2, 2011 at 1:36 AM, Paritosh Ranjan<[email protected]> >> wrote: >> >>> Even run() of CanopyDriver, which takes only T1 and T2 is giving >>> different >>> results for sequential and mapreduce. >>> This is preventing me from scaling up, as I need to run mapreduce on >>> hadoop >>> to scale. >>> >>> Is anyone having any idea of this problem? >>> >>> On 02-10-2011 00:27, Paritosh Ranjan wrote: >>> >>>> Hi, >>>> >>>> I am able to cluster correctly sequentially, using CanopyDriver. >>>> >>>> However, the same dataset, when processed as a MapReduce job, where >>>> ( t1 = >>>> t3 and t2 = t4 and t1>t2) is not working. I am getting errors like >>>> Canopies >>>> are empty. >>>> >>>> I also tried to reduce the values of t3 and t4. But reducing it >>>> either has >>>> no effect or gives meaningless results. >>>> >>>> Am I doing something wrong? or is there a bug somewhere? >>>> >>>> I feel that both, sequential and MapReduce should give similar >>>> results. >>>> But, It is not happening. >>>> >>>> Thanks and Regards, >>>> Paritosh >>>> >>>> >>>> ----- >>>> No virus found in this message. >>>> Checked by AVG - www.avg.com >>>> Version: 10.0.1410 / Virus Database: 1520/3932 - Release Date: >>>> 10/01/11 >>>> >>> >> > > > > ----- > No virus found in this message. > Checked by AVG - www.avg.com > Version: 10.0.1410 / Virus Database: 1520/3933 - Release Date: 10/02/11
