Hi Christoph,
The Meanshift canopy keeps copies of all the input points it has
accreted. It does this for bookkeeping purposes, so that points can be
associated with each canopy when it is done, but this clearly does not
scale and is currently a showstopper for its utility in large problems
(despite the M/R implementation, a large number of points will converge
to a smaller number of very large cluster descriptions). I've considered
two ways to improve this situation: 1) associate identifiers with each
point and just store the ids instead of the whole point; 2) write out
the accreted/merged canopies to a separate log file so that final
cluster membership can be calculated after the fact. Option 1 would be
the easiest to implement but would only give an order-constant
improvement in space. Option 2 would solve the cluster space problem but
would introduce another post-processing step to track the cluster merges.
Unlike the other clustering algorithms, which define symmetrical regions
of n-space for each cluster, Meanshift clusters are asymmetric and so
points cannot be clustered after the fact using just the cluster centers
and distance measure.
I'm not sure why you are getting duplicate copies of the same point in
your canopy. Your code looks like it was derived from the
testReferenceImplementation unit test but has some minor differences.
Why, since the code adds all the points to a new set of canopies before
iterating, are you passing in 'canopies' as an argument? Can you say
more about your input data set and the T1 & T2 values you used? How many
iterations occurred? What was your convergence test value?
Finally, our Vector library has improved its asFormatString in a number
of areas but at the cost of readibility. This makes debugging terribly
difficult and some sort of debuggable formatter is needed.
Jeff
Christoph Hermann wrote:
Hello,
i'm running some clustering with the Mean Shift and in my final canopy i
get 5x the same vector.
In the original input list i only had it once and i'm wondering why
duplicates are allowed within the same canopy?
Attached is a file with the method i'm using to run mean shift as well
as the ouput (i'm iterating over the getBoundPoints() list of the
canopy).
I'd be happy if someone could explain this.
regards
Christoph Hermann