Now that we have introduced some of the criterion functions that are
supported in Cluto, it is time to consider how they can be interpreted,
with a particular eye on the issue of automatically stopping the
clustering at some optimal number of clusters (rather than having to
spedify the value ahead of time.)
The basic idea is to look at clustering solutions for a range of k values,
and see if there are any hints given by the value of the criterion
function in use as to what would be a good stopping point. For example, we
might be look for a k value for which similarity scores drop sharply. This
is sometimes known as looking for the knee or elbow in the graph of the
criterion function versus the k value.
Now, agglomerative aglorithms proceed naturally from N to 1, and so it is
possible to look at a criterion function score at each step of a single
clustering run (and that is what we shall do, in due course). The
situation is different for partitonal algorithms (like rb), in which case
the algorithm starts with the given number of clusters and iterates
though solutions with those number of clusters that try to optimize the
criterion function. In that case, we don't naturally have a criterion
function for each possible value of k, but we can easily get them by
simply running the partional algorithm for each value of k that we might
consider reasonable. Or, if we are truly puzzled, we can run the
partional algorithm from N to 1 (with N of a few hundred, Cluto is fast
enough where this is indeed tractable).
What I show below are the criterion functions I2, E1, and then H2 values
for a solution to the Mexico-Brazil data that has been under consideration
of late. In that case there are 2 true clusters, although it must be
admitted the data is pretty noisy.
Remember that I2 is the flower, and that we want to form clusters where
each context in a cluster is as close to the centroid as possible. (The
contexts are connected only to the centroid of their cluster, forming what
looks like a flower. If there are 3 clusters, then there are 3 flowers.)
The object is to have as large a score as is possible (meaning that the
members of the clusters are as similar to each other as they can be).
However, we'll see that blindly using that as a criterion for cluster
stopping will lead to heartache!
Here are I2 scores for clusters of size 1 to 20. Note that the scores
decrease montonically from 20 to 1. At 20 the score is 186 and at 1 it is
99.3. So this means that the solution with 20 clusters is the best? No,
it does not. It shows in fact that a blind use of intra-cluster
similarity as a stopping criterion will cause trouble, since the best
intra-cluster similarity exists when there is one context per cluster,
since then the members of the cluster are as similar to each other as
possible (they are self similar in fact).
zori3.i2.output:1-way clustering: [I2=9.93e+01] [321 of 321]
zori3.i2.output:2-way clustering: [I2=1.29e+02] [321 of 321]
zori3.i2.output:3-way clustering: [I2=1.39e+02] [321 of 321]
zori3.i2.output:4-way clustering: [I2=1.45e+02] [321 of 321]
zori3.i2.output:5-way clustering: [I2=1.49e+02] [321 of 321]
zori3.i2.output:6-way clustering: [I2=1.53e+02] [321 of 321]
zori3.i2.output:7-way clustering: [I2=1.56e+02] [321 of 321]
zori3.i2.output:8-way clustering: [I2=1.60e+02] [321 of 321]
zori3.i2.output:9-way clustering: [I2=1.63e+02] [321 of 321]
zori3.i2.output:10-way clustering: [I2=1.66e+02] [321 of 321]
zori3.i2.output:11-way clustering: [I2=1.68e+02] [321 of 321]
zori3.i2.output:12-way clustering: [I2=1.71e+02] [321 of 321]
zori3.i2.output:13-way clustering: [I2=1.73e+02] [321 of 321]
zori3.i2.output:14-way clustering: [I2=1.75e+02] [321 of 321]
zori3.i2.output:15-way clustering: [I2=1.77e+02] [321 of 321]
zori3.i2.output:16-way clustering: [I2=1.79e+02] [321 of 321]
zori3.i2.output:17-way clustering: [I2=1.81e+02] [321 of 321]
zori3.i2.output:18-way clustering: [I2=1.83e+02] [321 of 321]
zori3.i2.output:19-way clustering: [I2=1.85e+02] [321 of 321]
zori3.i2.output:20-way clustering: [I2=1.86e+02] [321 of 321]
So what do we do? Well, observe the trend of the I2 values as you go
from 20 to 1. From 20 to 4 or 3 there is a very gradual decline. The
clusters are getting bigger, and the members less similar to each other,
but not dramatically so. Then, suddenly, around 3 or so, there is a
sharper dip in the similarity scores. Is that a knee point? Perhaps.
Now, remember that I2 only considers intra-cluster similarity. Inter
(between) cluster similarity is also useful, and here we show E1 as
an example. Remember that in E1 we want to achieve as much separation as
possible between the centroid of all the contexts with the centroid of
each cluster (in the case of cosine, we want to minimize the cosine score
(and thereby increate the angle or separation between the centroid
of the clusters and the centroid of the collection).
So here, lower scores are better.
zori3.e1.output:1-way clustering: [E1=3.19e+04] [321 of 321]
zori3.e1.output:2-way clustering: [E1=2.43e+04] [321 of 321]
zori3.e1.output:3-way clustering: [E1=2.17e+04] [321 of 321]
zori3.e1.output:4-way clustering: [E1=2.07e+04] [321 of 321]
zori3.e1.output:5-way clustering: [E1=1.99e+04] [321 of 321]
zori3.e1.output:6-way clustering: [E1=1.93e+04] [321 of 321]
zori3.e1.output:7-way clustering: [E1=1.88e+04] [321 of 321]
zori3.e1.output:8-way clustering: [E1=1.84e+04] [321 of 321]
zori3.e1.output:9-way clustering: [E1=1.81e+04] [321 of 321]
zori3.e1.output:10-way clustering: [E1=1.78e+04] [321 of 321]
zori3.e1.output:11-way clustering: [E1=1.75e+04] [321 of 321]
zori3.e1.output:12-way clustering: [E1=1.72e+04] [321 of 321]
zori3.e1.output:13-way clustering: [E1=1.70e+04] [321 of 321]
zori3.e1.output:14-way clustering: [E1=1.68e+04] [321 of 321]
zori3.e1.output:15-way clustering: [E1=1.66e+04] [321 of 321]
zori3.e1.output:16-way clustering: [E1=1.64e+04] [321 of 321]
zori3.e1.output:17-way clustering: [E1=1.62e+04] [321 of 321]
zori3.e1.output:18-way clustering: [E1=1.60e+04] [321 of 321]
zori3.e1.output:19-way clustering: [E1=1.59e+04] [321 of 321]
zori3.e1.output:20-way clustering: [E1=1.58e+04] [321 of 321]
And look, the lowest score is found at 20! 15,800 at 20, and 31,900 at 1.
So, again, this makes sense intuitively. When there is just one cluster,
it's centoid will be essentially the same as the centoid of the
collection, meaning that the cosine value is high but the separation is
low. When there are 20 clusters, their centroids will generally be more
different than the collection centoid than when there are a smaller
number of clusters. So, again, do we want to choose a cluster solution at
20? No, we generally do not. This is why we want to look at the trends in
the I2 and E1 values as we go through successive clustering solutions.
For E1, notice again that the scores gradually increase from 20 down to
around 4 or 3, and then there is a more dramatic increate. Is this a knee
point? Perhaps...
Now, remember that H2 measures intra/inter cluster similarity (I2/E1).
We want to maximize this score, that is to have the most similar intra
cluster measurement (I2) over the most separated iter cluster measurement
(E1)
zori3.h2.output:1-way clustering: [H2=3.12e-03] [321 of 321]
zori3.h2.output:2-way clustering: [H2=5.30e-03] [321 of 321]
zori3.h2.output:3-way clustering: [H2=6.42e-03] [321 of 321]
zori3.h2.output:4-way clustering: [H2=6.99e-03] [321 of 321]
zori3.h2.output:5-way clustering: [H2=7.46e-03] [321 of 321]
zori3.h2.output:6-way clustering: [H2=7.88e-03] [321 of 321]
zori3.h2.output:7-way clustering: [H2=8.29e-03] [321 of 321]
zori3.h2.output:8-way clustering: [H2=8.64e-03] [321 of 321]
zori3.h2.output:9-way clustering: [H2=8.97e-03] [321 of 321]
zori3.h2.output:10-way clustering: [H2=9.29e-03] [321 of 321]
zori3.h2.output:11-way clustering: [H2=9.59e-03] [321 of 321]
zori3.h2.output:12-way clustering: [H2=9.89e-03] [321 of 321]
zori3.h2.output:13-way clustering: [H2=1.01e-02] [321 of 321]
zori3.h2.output:14-way clustering: [H2=1.04e-02] [321 of 321]
zori3.h2.output:15-way clustering: [H2=1.07e-02] [321 of 321]
zori3.h2.output:16-way clustering: [H2=1.09e-02] [321 of 321]
zori3.h2.output:17-way clustering: [H2=1.12e-02] [321 of 321]
zori3.h2.output:18-way clustering: [H2=1.14e-02] [321 of 321]
zori3.h2.output:19-way clustering: [H2=1.16e-02] [321 of 321]
zori3.h2.output:20-way clustering: [H2=1.18e-02] [321 of 321]
Now, again note that the largest value is found at cluster size 20,
.00118, and the smallest is at cluster size 1, .000312. Again, note that
the change from largest to smallest is graduate, except when you reach
around 4 or 3 clusters, the decrease accelerates a bit.
This is meant as food for thought - and attempts to give at least some
small example of what can be obtained from cluto that might be useful
in stopping clustering automatically.
Oh, how did I get the scores above? I ran the following little script...
With 321 contexts to cluster, it took a few minutes to run 60 different
clustering runs. The data being clustered was picked somewhat randomly
from the submissions of the First Transylvanian bake-off.
===========================================
#!/bin/csh
set prefix = zori3
set maxk = 21
set seed = 1000
rm -fr $prefix.*.output
foreach crfun (i2 e1 h2)
set k = 1
while ($k <= $maxk)
vcluster $prefix.vectors $k -crfun $crfun -seed $seed >>
$prefix.$crfun.output
@ k = $k + 1
end
end
grep "way clustering" $prefix.*.output > $prefix.criterion
===============================
--
Ted Pedersen
http://www.d.umn.edu/~tpederse
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users