Re: MeanShift Clustering duplicating vectors in canopies?

Jeff Eastman Tue, 26 Jan 2010 12:40:15 -0800

I added some test code to detect duplicate boundPoint entries and canduplicate the issue in a unit test. I will see what is happening and letyou know.

Jeff


Jeff Eastman wrote:

Christoph Hermann wrote:
Am Montag, 25. Januar 2010 schrieben Sie:

Hello,
The Meanshift canopy keeps copies of all the input points it has
accreted. It does this for bookkeeping purposes, so that points can
 be associated with each canopy when it is done, but this clearly
 does not scale and is currently a showstopper for its utility in
 large problems (despite the M/R implementation, a large number of
 points will converge to a smaller number of very large cluster
 descriptions). I've considered two ways to improve this situation:
 1) associate identifiers with each point and just store the ids
 instead of the whole point; 2) write out the accreted/merged
 canopies to a separate log file so that final cluster membership can
 be calculated after the fact. Option 1 would be the easiest to
 implement but would only give an order-constant improvement in
 space. Option 2 would solve the cluster space problem but would
 introduce another post-processing step to track the cluster merges.
ok, thats good to know, thanks for the explanation.
I'm currently making some small experiments with the differentclustering implementations of mahout and to be honest although ithink i understood how the algorithms work, i have some problemswith the code.
I choose MeanShift first, because it allows me to start withoutspecifying the number of clusters and because i easily found out howto check which vector belongs to which canopy after running thealgorithm.
Yes, MeanShift is unique in that it does that bookkeeping as it goes.It is; however, not scalable and needs another accounting mechanism.
Unlike the other clustering algorithms, which define symmetrical
 regions of n-space for each cluster, Meanshift clusters are
 asymmetric and so points cannot be clustered after the fact using
 just the cluster centers and distance measure.
Ok. Since a 'Cluster' (using k-means) does not seem to havereferences to the vectors it contains, how do i get all the vectorsbelonging to one cluster?
Do i have to iterate over all points and calculate the nearestcluster again? That looks very inefficient to me, maybe you can pointme in the right direction.
Often it is the parameters of the clusters which are of primaryinterest, and not so much the actual point assignments. Making anotherpass over the data to assign points to clusters - when needed - doesseem inefficient but is really not that onerous since it is highlyparallel.
What i need to do is to locate a vector in the list of clusters andlist all the other vectors belonging to the same cluster.
Canopy and k-means both embody the notions that a point 'belongs' toonly one cluster; the one with the minimal distance measure. Each hasan optional, separate post-processing step to assign the points usingthat metric. Fuzzy k-means and Dirichlet both allow points to 'belong'to more than one cluster, making this determination moreprobabilistic. The former does it by adding a fraction of each point'svalue to each cluster center calculation depending upon the distancemeasure. Dirichlet uses sampling methods to assign each point to alikely cluster (model) during each iteration but usually not to thesame cluster in subsequent iterations.
I'm not sure why you are getting duplicate copies of the same point
 in your canopy. Your code looks like it was derived from the
testReferenceImplementation unit test but has some minor differences.
Why, since the code adds all the points to a new set of canopies
 before iterating, are you passing in 'canopies' as an argument? Can
 you say more about your input data set and the T1 & T2 values you
 used? How many iterations occurred? What was your convergence test
 value?
Actually i took it from "DisplayMeanShift" class.
I'm putting in "canopies" (which is currently just an empty list) bci wanted to extend the code lateron. For now its useless.
My input data is a distribution of downloads of files, so for eachday i have the day, the id of a file, and the number of downloads.I'm selecting for a certain period of time (i.e. 7 days, starting atday x) the download values and try to cluster similar download patterns.
T1 and T2 are varying, i'm still trying to find the best values.
Finally, our Vector library has improved its asFormatString in a
 number of areas but at the cost of readibility. This makes debugging
 terribly difficult and some sort of debuggable formatter is needed.
At least it shows me the ids and values of the vector, thats how ifound out the canopy contains duplicates; i'll improve that lateron.
The mean shift canopy should never contain duplicates. If you canfigure out why this is occurring in your runs I'd be very interestedin fixing it. If you can send me a copy of your dataset I will lookinto it too.
Jeff

Re: MeanShift Clustering duplicating vectors in canopies?

Reply via email to