[Senseclusters-users] Calinski and Harabasz (CH) stopping rule in cluto

ted pedersen Wed, 17 Aug 2005 07:00:03 -0700

The CH method of cluster stopping was mentioned briefly on this list, and
now it is time to go into a few more details, in particular how we might
implement something like this in cluto/SenseClusters.


The CH method is originally described here:

Calinski and Harabasz (1974)
A Dendrite Method for Cluster Analysis
Communications in Statistics, 3(1), 1974, 1-27

The idea behind the CH measure is to compute the sum of squared errors
(distances) between the kth cluster and the other k - 1 clusters, and
compare that to the internal sum of squared errors for the k clusters
(taking their individual squared error terms and summing them, a pooled
value so to speak). In effect, this is a measure of inter-cluster
(dis)similarity over intra-cluster (dis)similarity.

 B(k) - between cluster sum of squares
 W(k) - within cluster sum of squares

         B(k)/(k-1)
 CH(k) = ----------
         W(k)/(N-k)

 CH(k) = B(k)   (N-k)
         ---- * ----
         W(k)   (k-1)

[Note that this (N-k)/(k-1) term is rather similar to the Hartigan term
of (N-k-1).]

Now, if B(k) and W(k) are both measures of difference/dissimilarity, then
a larger CH value indicates a better clustering, since the between
cluster difference should be high, and the within cluster difference
should be low. This is how the original paper presents the method.

However, if they are measures of similarity, then a lower value is better
since the inter-cluster similarity should be low and the intra-cluster
similarity high. So, if we are using similarity values, as k gets larger
CH(k) will tend to get smaller, because inter-cluster similarity gets
smaller and intra-cluster similarity gets higher as k gets larger. This is
the reverse of what happens in the original formulation, but it is
equivalent in meaning.

As has been the case for our previous methods of cluster stopping, it is
not the actual score that really matters, it is the trend in those
values, and in particular the rate of change from one k to the next. A
"faster" change suggests that we might have reached a knee in the graph
of the criterion function versus the number of clusters, and that we
would want to stop at that value.

Since Cluto provides us with similarity measures, we adapt CH to Cluto
in the following way.

CH(k) = inter-cluster similarity   number of contexts - number of clusters
        ------------------------ * ---------------------------------------
        intra-cluster similarity          number of clusters - 1

where we can define the above as:

 CH(k) = E1/I1 * (N-k)/(k-1)

Now, let's recall that I1 is the ball of string, where we are looking
to maximize the pairwise similarities among the contexts in a cluster, and
E1 wants to minimize the squared error between the centroids of the
clusters and the centroid of the collection. So in the above
reformulation of CH, we want to minimize the numerator and maximize the
denominator. Note that the E1/I1 term is simply 1/H1, so we can compute
this quite simply

 CH(k) = 1/H1 * (N-k)/(k-1)

Now, since H1 is itself a measure of intra over inter cluster similarity,
I also find it interesting to simply look at:

nCH(k) = H1 * (N-k)/(k-1)

and use that as a stopping rule as well.

Here's an example. You see that both of the values above start larger
and move smaller (monotonically) as k increases. This is the same data as
used in previous examples, that is the Mexico-Brazil data which has 2
"true" clusters.

N is 321, and k varies from 1 to 10.                        R= (N-k)/k-1
=========================================================================

zori3.e1.output:1-way clustering: [E1=3.19e+04] [321 of 321]    320/0 !
zori3.i1.output:1-way clustering: [I1=3.07e+01] [321 of 321]

1-way clustering: [H1=9.64e-04] [321 of 321] 1/H1 = 1037.34

CH (1) = 1/H1 * R = undef
nCH (1) = H1 * R = undef
==========================================================================

zori3.e1.output:2-way clustering: [E1=2.43e+04] [321 of 321]    319/1 = 319
zori3.i1.output:2-way clustering: [I1=7.08e+01] [321 of 321]

2-way clustering: [H1=2.91e-03] [321 of 321] 1/H1 = 343.64

CH(2) = 1/H1 * R = 109621.16
nCH(2) = H1 * R = .93
==========================================================================

zori3.e1.output:3-way clustering: [E1=2.17e+04] [321 of 321]    318/2 = 159
zori3.i1.output:3-way clustering: [I1=7.67e+01] [321 of 321]

3-way clustering: [H1=3.53e-03] [321 of 321]  1/H1 = 282.29

CH(3) = 1/H1 * R = 44884.11
nCH(3) = H1 * R = .56
==========================================================================

zori3.e1.output:4-way clustering: [E1=2.07e+04] [321 of 321]    317/3 = 105.7
zori3.i1.output:4-way clustering: [I1=8.18e+01] [321 of 321]

4-way clustering: [H1=3.87e-03] [321 of 321]  1/H1 = 258.40

CH(4) = 1/H1 * R = 27312.88
nCH(4) = H1 * R = .41
==========================================================================

zori3.e1.output:5-way clustering: [E1=1.99e+04] [321 of 321]    316/4 = 79
zori3.i1.output:5-way clustering: [I1=8.56e+01] [321 of 321]

5-way clustering: [H1=4.17e-03] [321 of 321]  1/H1 = 239.81

CH(5) = 1/H1 * R = 18944.99
nCH(5) = H1 * R = .33
==========================================================================

zori3.e1.output:6-way clustering: [E1=1.93e+04] [321 of 321]    315/5 = 63
zori3.i1.output:6-way clustering: [I1=8.90e+01] [321 of 321]

6-way clustering: [H1=4.47e-03] [321 of 321]  1/H1 = 223.71

CH(6) = 1/H1 * R = 14093.73
nCH(6) = H1 * R = .28
==========================================================================

zori3.e1.output:7-way clustering: [E1=1.88e+04] [321 of 321]    314/6 = 52.3
zori3.i1.output:7-way clustering: [I1=9.17e+01] [321 of 321]

7-way clustering: [H1=4.73e-03] [321 of 321]  1/H1 = 211.42

CH(7) = 1/H1 * R = 11057.26
nCH(7) = H1 * R = .25
==========================================================================

zori3.e1.output:8-way clustering: [E1=1.84e+04] [321 of 321]    313/7 = 44.7
zori3.i1.output:8-way clustering: [I1=9.42e+01] [321 of 321]

8-way clustering: [H1=4.99e-03] [321 of 321]  1/H1 = 200.40

CH(8) = 1/H1 * R = 8957.88
nCH(8) = H1 * R = .22
==========================================================================

zori3.e1.output:9-way clustering: [E1=1.80e+04] [321 of 321]    312/8 = 39
zori3.i1.output:9-way clustering: [I1=9.66e+01] [321 of 321]

9-way clustering: [H1=5.23e-03] [321 of 321]  1/H1 = 191.20

CH(9) = 1/H1 * R = 7456.9
nCH(9) = H1 * R = .20
==========================================================================

zori3.e1.output:10-way clustering: [E1=1.78e+04] [321 of 321]   311/9 = 34.56
zori3.i1.output:10-way clustering: [I1=9.90e+01] [321 of 321]

10-way clustering: [H1=5.47e-03] [321 of 321] 1/H1 = 182.82

CH(10) = 1/H1 * R = 6318.26
nCH(10) = H1 * R = .18
==========================================================================

Now, to summarize the two scores...

Notice that in both cases, there is a rather dramatic drop from 2 to 3,
and then again (slightly less so) from 3 to 4. And of course, notice that
k=1 is undefined.

 CH(1) = 1/H1 * R = undef
 CH(2) = 1/H1 * R = 109621.16
 CH(3) = 1/H1 * R =  44884.11
 CH(4) = 1/H1 * R =  27312.88
 CH(5) = 1/H1 * R =  18944.99
 CH(6) = 1/H1 * R =  14093.73
 CH(7) = 1/H1 * R =  11057.26
 CH(8) = 1/H1 * R =   8957.88
 CH(9) = 1/H1 * R =   7456.9
CH(10) = 1/H1 * R =   6318.26

nCH(1) = H1 * R = undef
nCH(2) = H1 * R = .93
nCH(3) = H1 * R = .56
nCH(4) = H1 * R = .41
nCH(5) = H1 * R = .33
nCH(6) = H1 * R = .28
nCH(7) = H1 * R = .25
nCH(8) = H1 * R = .22
nCH(9) = H1 * R = .20
nCH(10) =  H1 * R = .18

Both the CH and nCH measures seem to have a fairly nice "knee" in them,
with the caveat of course that you won't ever see a knee at k=2 since
that is where the values start. If you simply look at the H1 values or
the 1/H1 values, in fact you also see a nice knee:

1-way clustering: [H1=9.64e-04] [321 of 321]  1/H1 = 1037.34
2-way clustering: [H1=2.91e-03] [321 of 321]  1/H1 = 343.64
3-way clustering: [H1=3.53e-03] [321 of 321]  1/H1 = 282.29
4-way clustering: [H1=3.87e-03] [321 of 321]  1/H1 = 258.40
5-way clustering: [H1=4.17e-03] [321 of 321]  1/H1 = 239.81
6-way clustering: [H1=4.47e-03] [321 of 321]  1/H1 = 223.71
7-way clustering: [H1=4.73e-03] [321 of 321]  1/H1 = 211.42
8-way clustering: [H1=4.99e-03] [321 of 321]  1/H1 = 200.40
9-way clustering: [H1=5.23e-03] [321 of 321]  1/H1 = 191.20
10-way clustering: [H1=5.47e-03] [321 of 321] 1/H1 = 182.82

So, it's a bit unclear to me what effect the [N-k]/[k-1] term
is having, other than making it hard to deal with k=1. :)

In any case, those are some thoughts on how a CH-like measure could be
included in SenseClusters via information provided by cluto.

Please let me know if you have any comments!

Enjoy,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

[Senseclusters-users] Calinski and Harabasz (CH) stopping rule in cluto

Reply via email to