RE: Difference in results : Clustering : sequential and MapReduce

Jeff Eastman Mon, 03 Oct 2011 10:26:42 -0700

Well, the default clusterFilter == 0, so this is not the difference between the 
implementations. When you talk about distributing similar vectors to each 
mapper, you are really moving into a hierarchical clustering method where you 
cluster your input points into a few large clusters and then cluster each 
cluster subset again. This can be done with scripting of any clustering 
algorithm and might be effective with canopy.


-----Original Message-----
From: Paritosh Ranjan [mailto:[email protected]] 
Sent: Sunday, October 02, 2011 10:56 PM
To: [email protected]
Subject: Re: Difference in results : Clustering : sequential and MapReduce

I got the reason for difference.
Actually, its due to

if (canopy.getNumPoints()>  clusterFilter)


in CanopyMapper.

Similar data is not distributed evenly in the mappers. So, the canopies 
might come out with points < clusterFilter which are not processed further.
But, this check is a great performance enhancer. I have experienced that.

Maybe, distributing similar vectors on mappers might help to attain both 
quality and performance.


On 03-10-2011 09:29, Paritosh Ranjan wrote:
> The sequential algorithm finds more/better clusters  than the 
> mapreduce one.
> There's not a huge difference, but the standalone one is better for sure.
>
> Thanks and Regards,
> Paritosh
>
> On 03-10-2011 01:47, Konstantin Shmakov wrote:
>> I'd assume that distributed and sequential algorithms shouldn't produce
>> identical results. To start with, they differ in initial setup:
>> -- In distributed algorithm each mapper deals with subset of data and 
>> starts
>> by picking up a random point, so N random points are picked up by N 
>> mappers
>> to start with.
>> -- In sequential algorithm 1 mapper deals with all data and starts by
>> picking up 1 random point.
>> But for the data with real clusters both algorithms should produce 
>> similar
>> results.  How different are the results in your case?
>>
>> Thanks
>> --Konstantin
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sun, Oct 2, 2011 at 1:36 AM, Paritosh Ranjan<[email protected]>  
>> wrote:
>>
>>> Even run() of CanopyDriver, which takes only T1 and T2 is giving 
>>> different
>>> results for sequential and mapreduce.
>>> This is preventing me from scaling up, as I need to run mapreduce on 
>>> hadoop
>>> to scale.
>>>
>>> Is anyone having any idea of this problem?
>>>
>>> On 02-10-2011 00:27, Paritosh Ranjan wrote:
>>>
>>>> Hi,
>>>>
>>>> I am able to cluster correctly sequentially, using CanopyDriver.
>>>>
>>>> However, the same dataset, when processed as a MapReduce job, where 
>>>> ( t1 =
>>>> t3 and t2 = t4 and t1>t2) is not working. I am getting errors like 
>>>> Canopies
>>>> are empty.
>>>>
>>>> I also tried to reduce the values of t3 and t4. But reducing it 
>>>> either has
>>>> no effect or gives meaningless results.
>>>>
>>>> Am I doing something wrong? or is there a bug somewhere?
>>>>
>>>> I feel that both, sequential and MapReduce should give similar 
>>>> results.
>>>> But, It is not happening.
>>>>
>>>> Thanks and Regards,
>>>> Paritosh
>>>>
>>>>
>>>> -----
>>>> No virus found in this message.
>>>> Checked by AVG - www.avg.com
>>>> Version: 10.0.1410 / Virus Database: 1520/3932 - Release Date: 
>>>> 10/01/11
>>>>
>>>
>>
>
>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 10.0.1410 / Virus Database: 1520/3933 - Release Date: 10/02/11

RE: Difference in results : Clustering : sequential and MapReduce

Reply via email to