[Senseclusters-users] a bit more on stopping clustering, now with i1

ted pedersen Sun, 07 Aug 2005 18:07:09 -0700

Here is an additional run of the Mexico-Brazil data where we look at the
I1, E1, and H1 values for 1 to 10 clusters. Remember, that I1 is the ball
of string - this is where we find all the pairwise similarities among the
contexts in a cluster, and hope to maximize those. Again, we'll note that
the pairwise similarity is highest at 10 clusters, which makes sense if
you think about it - if you have 100 objects and you divide them into 100
clusters, then each cluster is maximally similar since it is based on
self similarity (each cluster consists of one object only). As you create
larger clusters (fewer in number) it is inevitable that the objects in
each cluster get somewhat less similar. And when everything is in 1
cluster, that is the case when pairwise similarities are minimized (or
pairwise differences are maximized).


So, for I1 we see a very dramatic dip when moving from 2 to 1 cluster,
from 70.8 to 30.7. When going from 3 to 2 clusters, the dip was from 76.7
to 70.7. So, does this consistute a knee? Perhaps...

zori3.i1.output:1-way clustering: [I1=3.07e+01] [321 of 321]
zori3.i1.output:2-way clustering: [I1=7.08e+01] [321 of 321]
zori3.i1.output:3-way clustering: [I1=7.67e+01] [321 of 321]
zori3.i1.output:4-way clustering: [I1=8.18e+01] [321 of 321]
zori3.i1.output:5-way clustering: [I1=8.56e+01] [321 of 321]
zori3.i1.output:6-way clustering: [I1=8.90e+01] [321 of 321]
zori3.i1.output:7-way clustering: [I1=9.17e+01] [321 of 321]
zori3.i1.output:8-way clustering: [I1=9.42e+01] [321 of 321]
zori3.i1.output:9-way clustering: [I1=9.66e+01] [321 of 321]
zori3.i1.output:10-way clustering: [I1=9.90e+01] [321 of 321]

The E1 values are the same as in the previous run. I am thinking of E1 as
the widely opened fan. (A fan in the sense of what you hold in your hand
to cool yourself with). Now, here is where I need a word. Do you know
those fans that spread out (Asian style I believe) that are made of
paper and have pieces of bamboo or something that hold the paper
together and make the fan flexible? Well, what are those pieces called?
They are key to my analogy! I will call them pieces for lack of a
better term, for now...

The centroid of the collection is the middle "piece" of the fan, and the
object is to stretch the fan out as much as  possible, so that the other
pieces are as far away from the centroid of the collection/fan as
possible. now this is not a great analogy because the fan is mostly two
dimensions, so imagine a 3 dimensional fan. :) Anyway, I need to work on
this visualization I think. So...

zori3.e1.output:1-way clustering: [E1=3.19e+04] [321 of 321]
zori3.e1.output:2-way clustering: [E1=2.43e+04] [321 of 321]
zori3.e1.output:3-way clustering: [E1=2.17e+04] [321 of 321]
zori3.e1.output:4-way clustering: [E1=2.07e+04] [321 of 321]
zori3.e1.output:5-way clustering: [E1=1.99e+04] [321 of 321]
zori3.e1.output:6-way clustering: [E1=1.93e+04] [321 of 321]
zori3.e1.output:7-way clustering: [E1=1.88e+04] [321 of 321]
zori3.e1.output:8-way clustering: [E1=1.84e+04] [321 of 321]
zori3.e1.output:9-way clustering: [E1=1.80e+04] [321 of 321]
zori3.e1.output:10-way clustering: [E1=1.78e+04] [321 of 321]

Now, H1 is simply I1/E1. We are trying to find the clustering
solution that has as tight a ball of string as possible, over the
most spread out fan we can find. Note again that the maximimze
score is at 10 clusters, which follows the intuition previously
described, and again points out that it's not the absolute
score that matters, but rather the trend.

zori3.h1.output:1-way clustering: [H1=9.64e-04] [321 of 321]
zori3.h1.output:2-way clustering: [H1=2.91e-03] [321 of 321]
zori3.h1.output:3-way clustering: [H1=3.53e-03] [321 of 321]
zori3.h1.output:4-way clustering: [H1=3.87e-03] [321 of 321]
zori3.h1.output:5-way clustering: [H1=4.17e-03] [321 of 321]
zori3.h1.output:6-way clustering: [H1=4.47e-03] [321 of 321]
zori3.h1.output:7-way clustering: [H1=4.73e-03] [321 of 321]
zori3.h1.output:8-way clustering: [H1=4.99e-03] [321 of 321]
zori3.h1.output:9-way clustering: [H1=5.23e-03] [321 of 321]
zori3.h1.output:10-way clustering: [H1=5.47e-03] [321 of 321]

So when going from 2 to 1 clusters we go from .00291 to .000964,
which is a sharp drop. From 3 to 2 is from .00353 to .00291,
by contrast. Is this a knee? Perhaps...

So, it's not clear to me if I1 or I2 should be preferred. I1 shows here a
very dramatic change in scores, which might make sense since it is a
"larger" score, that is it is base don all pairwise similarities rather
than similarities to the centroid (which is what I2 does).

And of course, there is also the issue of the agglomerative and graph
based criterion functions too! note that I1, I2, E1, E2, H1, and H2
can be used with partitional or agglomerative methods, but not graph
based. The classically well known methods of single link, complete link,
average link can only be used with agglomerative clustering. So we'll
investigate those combinations as well...

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

[Senseclusters-users] a bit more on stopping clustering, now with i1

Reply via email to