The Internal Criterion functions in Cluto (used by SenseClusters) focus on measuring "intra cluster" similarity. That is they look at each cluster and find out how closely related the contexts in each cluster are, without respect to anything happening in the other clusters. So the measures I1 and I2 give you an overall measure of how "tight" each cluster is in a particular clustering solution without regard to how well separated the clusters might be. For this reason they are called "local" criterion functions.
Keep in mind that Cluto seeks to optimize the criterion function, such that at each step in the clustering process, it chooses the clustering of contexts that maximizes the given criterion function. The I1 and I2 criterion functions can be used with any clustering method supported in Cluto except for the graph method, which has its own special criterion function. I1 seeks to maximize the sum of the average pairwise similarities between the contexts assigned to each cluster. It weights the score for each cluster by the size of the cluster, so as to avoid giving greater weight to larger clusters (where "larger" is expressed in terms of the number of contexts in that cluster). The pairwise similarities may be computed using the cosine or Pearson's correlation coefficient in the case of real valued vectors, and if you are using binary vectors you may also use the match, jaccard, dice, or overlap coefficient. The criterion function works exactly the same way regardless of the similarity measurement you are using. So, for I1 you take each cluster and measure the pairwise similarities between all the contexts in that cluster, using whatever similarity measurement you choose and is appropriate for your data. (Note that the SenseClusters web interface provides guidance on this point by only allowing you to choose appropriate similarity types for the data you have.) Cluto computes ALL pairwise similarities in a cluster, so if you picture the contexts as points and edges as representing where a similarity value is computed, then the resulting graph is completely connected. The scores for each pairwise similarity are summed, and then divided by the number of contexts in that cluster. This value is computed for each cluster, and then they are summed together to get the I1 criterion function value. Cluto will find the clustering solution that maximizes the I1 value, meaning it will find the solution at each step that results in the greatest pairwise similarity between the contexts in each cluster, with no bias given in the overall computation to larger clusters (due to the averaging by the size). I *think* this tends to result in finding clusters of similar size (although I am not sure of this last point). I2 is similar, except that rather than measuring pairwise similarities between each context in each cluster, it finds the centroid of a cluster and then computes the pairwise similarity between each context in that cluster and the centroid. (This is very similar to what the classical K-means algorithm does.) Note that I2 does not scale the value for each cluster by the size, so larger clusters may have a greater weight in the outcome and therefore we are more likely to find clusters of different sizes. So, for I2 the centroid of a cluster is found (by taking the average of all the contexts in a cluster) and then the pairwise similarity between each context and that centroid is measured (by whatever measure you are using for similarity). A value is found for each cluster, and then these are summed to find the overall criterion function. The value found for each cluster is, interestingly enough, the square root of the pairwise similarities between all of the contexts in a cluster. This can be seen as a bridge of sorts between understanding I1 and I2, both of which are measuring similarity within clusters. I1 is measuring all pairwise similarities, which I2 measures pairwise similarities between contexts and the centroid (which is the average of all the contexts in the cluster). Thus, both I1 and I2 ignore the issue of how well separated the clusters are. They simply try and find the "tightest" clusters possible my maximizing pairwise similarity between contexts (I1) or contexts and the centroid (I2). Cluto tends to recommend the use of I2, and we have followed that recommendation, although I think subsequent discussion will show that there are other interesting choices. Note that the formulation of I2 might explain why we sometimes get clusters where most of the contexts are in one cluster. My hypothesis would be that if you want more balanced clusters, perhaps I1 is more appropriate. Please do ask any questions that might arise in reading this dicussion of I1 and I2. Discussions of the other measures will follow soon. Enjoy, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------- SF.Net email is sponsored by: Discover Easy Linux Migration Strategies from IBM. Find simple to follow Roadmaps, straightforward articles, informative Webcasts and more! Get everything you need to get up to speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
