Re: Canopy implementation

Benson Margulies Sat, 30 May 2009 13:21:20 -0700

I'll look at the copy in DisplayKMeans again and see if it is missing that
last test.


On Sat, May 30, 2009 at 12:41 PM, Jeff Eastman
<j...@windwardsolutions.com>wrote:

> Canopy tests each point against the current set of canopies, adding the
> point to each canopy that is within t1 and finally stopping when it finds
> one within t2. If all canopies are tested and none are within t2 then a new
> canopy is added with the point as its center. So, even if you set t1 and t2
> badly, the worst you will see is one canopy for each point in the data set.
>
>
> Benson Margulies wrote:
>
>> I've looked at the implementation of Canopy in DisplayKMeans, and then
>> tried
>> to compare it to the MapReduce version.
>>
>> I'm sure that the simple version in DisplayKMeans has the potential to
>> loop
>> indefinitely. I can't prove that the problem exists in the real map/reduce
>> code one way or the other.
>>
>> The 'problem' is this. If you make bad choices for t1 and t2, the
>> algorithm
>> doesn't terminate. If no points make it inside of t2, the process never
>> stops. Since the range of distances depends entirely on the vectors and
>> the
>> distance metric, there's no way without pre-processing to ensure that the
>> values avoid a loop.
>>
>> Now, McCallum et. al. doesn't formally present the algorithm with t1 and
>> t2.
>> They describe it informally as one example of their main idea: initial
>> partition with a cheap distance metric followed by detailed clustering
>> using
>> an expensive one. It is possible that the 'cross-validate' that they
>> mention
>> and which is copied into the comments in Mahout avoids this problem by
>> estimating t1 and t2 from a survey of the data.
>>
>> It might be a good idea to implement a cross validation capability, or at
>> least a maximum iteration limit.
>>
>> All this is conditional on the notion that the MapReduce version in the
>> end
>> has the same 'iterate until cooked' . The code in     addPointToCanopies
>> looks to have the same potential, as it could fail to strongly bind enough
>> points to run out of thing to do. Or not. I expect to learn that I don't
>> understand this well enough yet.
>>
>>
>>
>
>

Re: Canopy implementation

Reply via email to