As you recall we have been looking at the value of criterion functions as the value of k (the number of clusters) decreases down to 1. We pointed out a few places where a "knee" might exist in the plot of such values, and suggested that this might signal where a reasonable stopping point in the clustering might be found.
Now, you might be wondering about this idea of finding a "knee" in a graph of criterion scores versus number of clusters. You may think, for example, that these sorts of knees will simply occur in any data as a consequence of noise, or as a natural by product of clustering. But, in fact, you can convince yourself in a small way that they are not, by clustering data that is itself all noise. If you find a knee in data that has no discernable underlying pattern, then you would know that the knee is just an artifact of data in general, and doesn't really signify anything. So, to test that hypothesis, I created a random vector file for cluto that consisted of 321 rows, and 1000 columns. The values in the "cells" of this matrix were randomly generated, so our clustering algorithm probably won't find anything meaningful (or at least we we would hope that it didn't). And, when I ran the clustering for 10 down to 1 clusters using i1, i2, e1, h1 and h2 criterion functions, notice that there is very little movement at all in the criterion function scores. There are no knees here, for sure. This suggests then, that the "knees" we saw in the Mexico-Brazil data earlier, might in fact correspond to something real, since no such knees were found in the data that was really just random noise. So, if you plot the criterion function versus k values as shown below, you would have very gradual linear plots, with no sudden changes in the values. This suggests that in this randomized data, there is no knee point, and there are no reasonable ways to create clusters (which makes sense intuitively, since the data was generated randomly). ref.i1.output:1-way clustering: [I1=2.41e+02] [321 of 321] ref.i1.output:2-way clustering: [I1=2.41e+02] [321 of 321] ref.i1.output:3-way clustering: [I1=2.42e+02] [321 of 321] ref.i1.output:4-way clustering: [I1=2.42e+02] [321 of 321] ref.i1.output:5-way clustering: [I1=2.43e+02] [321 of 321] ref.i1.output:6-way clustering: [I1=2.43e+02] [321 of 321] ref.i1.output:7-way clustering: [I1=2.43e+02] [321 of 321] ref.i1.output:8-way clustering: [I1=2.44e+02] [321 of 321] ref.i1.output:9-way clustering: [I1=2.44e+02] [321 of 321] ref.i1.output:10-way clustering: [I1=2.44e+02] [321 of 321] ref.i2.output:1-way clustering: [I2=2.78e+02] [321 of 321] ref.i2.output:2-way clustering: [I2=2.78e+02] [321 of 321] ref.i2.output:3-way clustering: [I2=2.79e+02] [321 of 321] ref.i2.output:4-way clustering: [I2=2.79e+02] [321 of 321] ref.i2.output:5-way clustering: [I2=2.79e+02] [321 of 321] ref.i2.output:6-way clustering: [I2=2.79e+02] [321 of 321] ref.i2.output:7-way clustering: [I2=2.79e+02] [321 of 321] ref.i2.output:8-way clustering: [I2=2.80e+02] [321 of 321] ref.i2.output:9-way clustering: [I2=2.80e+02] [321 of 321] ref.i2.output:10-way clustering: [I2=2.80e+02] [321 of 321] ref.e1.output:1-way clustering: [E1=8.93e+04] [321 of 321] ref.e1.output:2-way clustering: [E1=8.92e+04] [321 of 321] ref.e1.output:3-way clustering: [E1=8.91e+04] [321 of 321] ref.e1.output:4-way clustering: [E1=8.90e+04] [321 of 321] ref.e1.output:5-way clustering: [E1=8.89e+04] [321 of 321] ref.e1.output:6-way clustering: [E1=8.89e+04] [321 of 321] ref.e1.output:7-way clustering: [E1=8.88e+04] [321 of 321] ref.e1.output:8-way clustering: [E1=8.88e+04] [321 of 321] ref.e1.output:9-way clustering: [E1=8.87e+04] [321 of 321] ref.e1.output:10-way clustering: [E1=8.86e+04] [321 of 321] ref.h1.output:1-way clustering: [H1=2.70e-03] [321 of 321] ref.h1.output:2-way clustering: [H1=2.71e-03] [321 of 321] ref.h1.output:3-way clustering: [H1=2.71e-03] [321 of 321] ref.h1.output:4-way clustering: [H1=2.72e-03] [321 of 321] ref.h1.output:5-way clustering: [H1=2.73e-03] [321 of 321] ref.h1.output:6-way clustering: [H1=2.73e-03] [321 of 321] ref.h1.output:7-way clustering: [H1=2.74e-03] [321 of 321] ref.h1.output:8-way clustering: [H1=2.74e-03] [321 of 321] ref.h1.output:9-way clustering: [H1=2.75e-03] [321 of 321] ref.h1.output:10-way clustering: [H1=2.76e-03] [321 of 321] ref.h2.output:1-way clustering: [H2=3.12e-03] [321 of 321] ref.h2.output:2-way clustering: [H2=3.12e-03] [321 of 321] ref.h2.output:3-way clustering: [H2=3.13e-03] [321 of 321] ref.h2.output:4-way clustering: [H2=3.13e-03] [321 of 321] ref.h2.output:5-way clustering: [H2=3.14e-03] [321 of 321] ref.h2.output:6-way clustering: [H2=3.14e-03] [321 of 321] ref.h2.output:7-way clustering: [H2=3.15e-03] [321 of 321] ref.h2.output:8-way clustering: [H2=3.15e-03] [321 of 321] ref.h2.output:9-way clustering: [H2=3.16e-03] [321 of 321] ref.h2.output:10-way clustering: [H2=3.16e-03] [321 of 321] -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
