[Senseclusters-users] cluster stopping, how to show that knees really exist

ted pedersen Sun, 07 Aug 2005 21:28:05 -0700

As you recall we have been looking at the value of criterion functions
as the value of k (the number of clusters) decreases down to 1. We pointed
out a few places where a "knee" might exist in the plot of such values,
and suggested that this might signal where a reasonable stopping point in
the clustering might be found.


Now, you might be wondering about this idea of finding a "knee" in a graph
of criterion scores versus number of clusters. You may think, for example,
that these sorts of knees will simply occur in any data as a consequence
of noise, or as a natural by product of clustering.

But, in fact, you can convince yourself in a small way that they are not,
by clustering data that is itself all noise. If you find a knee in data
that has no discernable underlying pattern, then  you would know that the
knee is just an artifact of data in general, and doesn't really signify
anything.

So, to test that hypothesis, I created a random vector file for cluto that
consisted of 321 rows, and 1000 columns. The values in the "cells" of
this matrix were randomly generated, so our clustering algorithm probably
won't find anything meaningful (or at least we we would hope that it
didn't). And, when I ran the clustering for 10 down to 1 clusters using
i1, i2, e1, h1 and h2 criterion functions, notice that there is very
little movement at all in the criterion function scores. There are no
knees here, for sure. This suggests then, that the "knees" we saw in the
Mexico-Brazil data earlier, might in fact correspond to something real,
since no such knees were found in the data that was really just random
noise.

So, if you plot the criterion function versus k values as shown below,
you would have very gradual linear plots, with no sudden changes in the
values. This suggests that in this randomized data, there is no knee
point, and there are no reasonable ways to create clusters (which makes
sense intuitively, since the data was generated randomly).

ref.i1.output:1-way clustering: [I1=2.41e+02] [321 of 321]
ref.i1.output:2-way clustering: [I1=2.41e+02] [321 of 321]
ref.i1.output:3-way clustering: [I1=2.42e+02] [321 of 321]
ref.i1.output:4-way clustering: [I1=2.42e+02] [321 of 321]
ref.i1.output:5-way clustering: [I1=2.43e+02] [321 of 321]
ref.i1.output:6-way clustering: [I1=2.43e+02] [321 of 321]
ref.i1.output:7-way clustering: [I1=2.43e+02] [321 of 321]
ref.i1.output:8-way clustering: [I1=2.44e+02] [321 of 321]
ref.i1.output:9-way clustering: [I1=2.44e+02] [321 of 321]
ref.i1.output:10-way clustering: [I1=2.44e+02] [321 of 321]

ref.i2.output:1-way clustering: [I2=2.78e+02] [321 of 321]
ref.i2.output:2-way clustering: [I2=2.78e+02] [321 of 321]
ref.i2.output:3-way clustering: [I2=2.79e+02] [321 of 321]
ref.i2.output:4-way clustering: [I2=2.79e+02] [321 of 321]
ref.i2.output:5-way clustering: [I2=2.79e+02] [321 of 321]
ref.i2.output:6-way clustering: [I2=2.79e+02] [321 of 321]
ref.i2.output:7-way clustering: [I2=2.79e+02] [321 of 321]
ref.i2.output:8-way clustering: [I2=2.80e+02] [321 of 321]
ref.i2.output:9-way clustering: [I2=2.80e+02] [321 of 321]
ref.i2.output:10-way clustering: [I2=2.80e+02] [321 of 321]

ref.e1.output:1-way clustering: [E1=8.93e+04] [321 of 321]
ref.e1.output:2-way clustering: [E1=8.92e+04] [321 of 321]
ref.e1.output:3-way clustering: [E1=8.91e+04] [321 of 321]
ref.e1.output:4-way clustering: [E1=8.90e+04] [321 of 321]
ref.e1.output:5-way clustering: [E1=8.89e+04] [321 of 321]
ref.e1.output:6-way clustering: [E1=8.89e+04] [321 of 321]
ref.e1.output:7-way clustering: [E1=8.88e+04] [321 of 321]
ref.e1.output:8-way clustering: [E1=8.88e+04] [321 of 321]
ref.e1.output:9-way clustering: [E1=8.87e+04] [321 of 321]
ref.e1.output:10-way clustering: [E1=8.86e+04] [321 of 321]

ref.h1.output:1-way clustering: [H1=2.70e-03] [321 of 321]
ref.h1.output:2-way clustering: [H1=2.71e-03] [321 of 321]
ref.h1.output:3-way clustering: [H1=2.71e-03] [321 of 321]
ref.h1.output:4-way clustering: [H1=2.72e-03] [321 of 321]
ref.h1.output:5-way clustering: [H1=2.73e-03] [321 of 321]
ref.h1.output:6-way clustering: [H1=2.73e-03] [321 of 321]
ref.h1.output:7-way clustering: [H1=2.74e-03] [321 of 321]
ref.h1.output:8-way clustering: [H1=2.74e-03] [321 of 321]
ref.h1.output:9-way clustering: [H1=2.75e-03] [321 of 321]
ref.h1.output:10-way clustering: [H1=2.76e-03] [321 of 321]

ref.h2.output:1-way clustering: [H2=3.12e-03] [321 of 321]
ref.h2.output:2-way clustering: [H2=3.12e-03] [321 of 321]
ref.h2.output:3-way clustering: [H2=3.13e-03] [321 of 321]
ref.h2.output:4-way clustering: [H2=3.13e-03] [321 of 321]
ref.h2.output:5-way clustering: [H2=3.14e-03] [321 of 321]
ref.h2.output:6-way clustering: [H2=3.14e-03] [321 of 321]
ref.h2.output:7-way clustering: [H2=3.15e-03] [321 of 321]
ref.h2.output:8-way clustering: [H2=3.15e-03] [321 of 321]
ref.h2.output:9-way clustering: [H2=3.16e-03] [321 of 321]
ref.h2.output:10-way clustering: [H2=3.16e-03] [321 of 321]

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

[Senseclusters-users] cluster stopping, how to show that knees really exist

Reply via email to