The Internal Criterion functions in Cluto (used by SenseClusters) focus on
measuring "intra cluster" similarity. That is they look at each cluster
and find out how closely related the contexts in each cluster are, without
respect to anything happening in the other clusters. So the measures I1
and I2 give you an overall measure of how "tight" each cluster is in a
particular clustering solution without regard to how well separated the
clusters might be. For this reason they are called "local" criterion
functions.

Keep in mind that Cluto seeks to optimize the criterion function, such
that at each step in the clustering process, it chooses the clustering of
contexts that maximizes the given criterion function. The I1 and I2
criterion functions can be used with any clustering method supported in
Cluto except for the graph method, which has its own special criterion
function.

I1 seeks to maximize the sum of the average pairwise similarities between
the contexts assigned to each cluster. It weights the score for each
cluster by the size of the cluster, so as to avoid giving greater weight
to larger clusters (where "larger" is expressed in terms of the number of
contexts in that cluster).

The pairwise similarities may be computed using the cosine or Pearson's
correlation coefficient in the case of real valued vectors, and if you are
using binary vectors you may also use the match, jaccard, dice, or overlap
coefficient. The criterion function works exactly the same way regardless
of the similarity measurement you are using.

So, for I1 you take each cluster and measure the pairwise similarities
between all the contexts in that cluster, using whatever similarity
measurement you choose and is appropriate for your data. (Note that the
SenseClusters web interface provides guidance on this point by only
allowing you to choose appropriate similarity types for the data you
have.) Cluto computes ALL pairwise similarities in a cluster, so if
you picture the contexts as points and edges as representing where a
similarity value is computed, then the resulting graph is completely
connected. The scores for each pairwise similarity are summed, and then
divided by the number of contexts in that cluster. This value is computed
for each cluster, and then they are summed together to get the I1
criterion function value. Cluto will find the clustering solution that
maximizes the I1 value, meaning it will find the solution at each step
that results in the greatest pairwise similarity between the contexts in
each cluster, with no bias given in the overall computation to larger
clusters (due to the averaging by the size). I *think* this tends to
result in finding clusters of similar size (although I am not sure of
this last point).

I2 is similar, except that rather than measuring pairwise similarities
between each context in each cluster, it finds the centroid of a
cluster and then computes the pairwise similarity between each context
in that cluster and the centroid. (This is very similar to what the
classical K-means algorithm does.) Note that I2 does not scale the value
for each cluster by the size, so larger clusters may have a greater
weight in the outcome and therefore we are more likely to find clusters
of different sizes.

So, for I2 the centroid of a cluster is found (by taking the average
of all the contexts in a cluster) and then the pairwise similarity
between each context and that centroid is measured (by whatever measure
you are using for similarity). A value is found for each cluster, and
then these are summed to find the overall criterion function. The value
found for each cluster is, interestingly enough, the square root of the
pairwise similarities between all of the contexts in a cluster. This
can be seen as a bridge of sorts between understanding I1 and I2, both
of which are measuring similarity within clusters. I1 is measuring all
pairwise similarities, which I2 measures pairwise similarities between
contexts and the centroid (which is the average of all the contexts in
the cluster).

Thus, both I1 and I2 ignore the issue of how well separated the clusters
are. They simply try and find the "tightest" clusters possible my
maximizing pairwise similarity between contexts (I1) or contexts and the
centroid (I2).

Cluto tends to recommend the use of I2, and we have followed that
recommendation, although I think subsequent discussion will show that
there are other interesting choices.

Note that the formulation of I2 might explain why we sometimes get
clusters where most of the contexts are in one cluster. My hypothesis
would be that if you want more balanced clusters, perhaps I1 is more
appropriate.

Please do ask any questions that might arise in reading this dicussion
of I1 and I2. Discussions of the other measures will follow soon.

Enjoy,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to