I'll look at the copy in DisplayKMeans again and see if it is missing that last test.
On Sat, May 30, 2009 at 12:41 PM, Jeff Eastman <j...@windwardsolutions.com>wrote: > Canopy tests each point against the current set of canopies, adding the > point to each canopy that is within t1 and finally stopping when it finds > one within t2. If all canopies are tested and none are within t2 then a new > canopy is added with the point as its center. So, even if you set t1 and t2 > badly, the worst you will see is one canopy for each point in the data set. > > > Benson Margulies wrote: > >> I've looked at the implementation of Canopy in DisplayKMeans, and then >> tried >> to compare it to the MapReduce version. >> >> I'm sure that the simple version in DisplayKMeans has the potential to >> loop >> indefinitely. I can't prove that the problem exists in the real map/reduce >> code one way or the other. >> >> The 'problem' is this. If you make bad choices for t1 and t2, the >> algorithm >> doesn't terminate. If no points make it inside of t2, the process never >> stops. Since the range of distances depends entirely on the vectors and >> the >> distance metric, there's no way without pre-processing to ensure that the >> values avoid a loop. >> >> Now, McCallum et. al. doesn't formally present the algorithm with t1 and >> t2. >> They describe it informally as one example of their main idea: initial >> partition with a cheap distance metric followed by detailed clustering >> using >> an expensive one. It is possible that the 'cross-validate' that they >> mention >> and which is copied into the comments in Mahout avoids this problem by >> estimating t1 and t2 from a survey of the data. >> >> It might be a good idea to implement a cross validation capability, or at >> least a maximum iteration limit. >> >> All this is conditional on the notion that the MapReduce version in the >> end >> has the same 'iterate until cooked' . The code in addPointToCanopies >> looks to have the same potential, as it could fail to strongly bind enough >> points to run out of thing to do. Or not. I expect to learn that I don't >> understand this well enough yet. >> >> >> > >