Here is an additional run of the Mexico-Brazil data where we look at the I1, E1, and H1 values for 1 to 10 clusters. Remember, that I1 is the ball of string - this is where we find all the pairwise similarities among the contexts in a cluster, and hope to maximize those. Again, we'll note that the pairwise similarity is highest at 10 clusters, which makes sense if you think about it - if you have 100 objects and you divide them into 100 clusters, then each cluster is maximally similar since it is based on self similarity (each cluster consists of one object only). As you create larger clusters (fewer in number) it is inevitable that the objects in each cluster get somewhat less similar. And when everything is in 1 cluster, that is the case when pairwise similarities are minimized (or pairwise differences are maximized).
So, for I1 we see a very dramatic dip when moving from 2 to 1 cluster, from 70.8 to 30.7. When going from 3 to 2 clusters, the dip was from 76.7 to 70.7. So, does this consistute a knee? Perhaps... zori3.i1.output:1-way clustering: [I1=3.07e+01] [321 of 321] zori3.i1.output:2-way clustering: [I1=7.08e+01] [321 of 321] zori3.i1.output:3-way clustering: [I1=7.67e+01] [321 of 321] zori3.i1.output:4-way clustering: [I1=8.18e+01] [321 of 321] zori3.i1.output:5-way clustering: [I1=8.56e+01] [321 of 321] zori3.i1.output:6-way clustering: [I1=8.90e+01] [321 of 321] zori3.i1.output:7-way clustering: [I1=9.17e+01] [321 of 321] zori3.i1.output:8-way clustering: [I1=9.42e+01] [321 of 321] zori3.i1.output:9-way clustering: [I1=9.66e+01] [321 of 321] zori3.i1.output:10-way clustering: [I1=9.90e+01] [321 of 321] The E1 values are the same as in the previous run. I am thinking of E1 as the widely opened fan. (A fan in the sense of what you hold in your hand to cool yourself with). Now, here is where I need a word. Do you know those fans that spread out (Asian style I believe) that are made of paper and have pieces of bamboo or something that hold the paper together and make the fan flexible? Well, what are those pieces called? They are key to my analogy! I will call them pieces for lack of a better term, for now... The centroid of the collection is the middle "piece" of the fan, and the object is to stretch the fan out as much as possible, so that the other pieces are as far away from the centroid of the collection/fan as possible. now this is not a great analogy because the fan is mostly two dimensions, so imagine a 3 dimensional fan. :) Anyway, I need to work on this visualization I think. So... zori3.e1.output:1-way clustering: [E1=3.19e+04] [321 of 321] zori3.e1.output:2-way clustering: [E1=2.43e+04] [321 of 321] zori3.e1.output:3-way clustering: [E1=2.17e+04] [321 of 321] zori3.e1.output:4-way clustering: [E1=2.07e+04] [321 of 321] zori3.e1.output:5-way clustering: [E1=1.99e+04] [321 of 321] zori3.e1.output:6-way clustering: [E1=1.93e+04] [321 of 321] zori3.e1.output:7-way clustering: [E1=1.88e+04] [321 of 321] zori3.e1.output:8-way clustering: [E1=1.84e+04] [321 of 321] zori3.e1.output:9-way clustering: [E1=1.80e+04] [321 of 321] zori3.e1.output:10-way clustering: [E1=1.78e+04] [321 of 321] Now, H1 is simply I1/E1. We are trying to find the clustering solution that has as tight a ball of string as possible, over the most spread out fan we can find. Note again that the maximimze score is at 10 clusters, which follows the intuition previously described, and again points out that it's not the absolute score that matters, but rather the trend. zori3.h1.output:1-way clustering: [H1=9.64e-04] [321 of 321] zori3.h1.output:2-way clustering: [H1=2.91e-03] [321 of 321] zori3.h1.output:3-way clustering: [H1=3.53e-03] [321 of 321] zori3.h1.output:4-way clustering: [H1=3.87e-03] [321 of 321] zori3.h1.output:5-way clustering: [H1=4.17e-03] [321 of 321] zori3.h1.output:6-way clustering: [H1=4.47e-03] [321 of 321] zori3.h1.output:7-way clustering: [H1=4.73e-03] [321 of 321] zori3.h1.output:8-way clustering: [H1=4.99e-03] [321 of 321] zori3.h1.output:9-way clustering: [H1=5.23e-03] [321 of 321] zori3.h1.output:10-way clustering: [H1=5.47e-03] [321 of 321] So when going from 2 to 1 clusters we go from .00291 to .000964, which is a sharp drop. From 3 to 2 is from .00353 to .00291, by contrast. Is this a knee? Perhaps... So, it's not clear to me if I1 or I2 should be preferred. I1 shows here a very dramatic change in scores, which might make sense since it is a "larger" score, that is it is base don all pairwise similarities rather than similarities to the centroid (which is what I2 does). And of course, there is also the issue of the agglomerative and graph based criterion functions too! note that I1, I2, E1, E2, H1, and H2 can be used with partitional or agglomerative methods, but not graph based. The classically well known methods of single link, complete link, average link can only be used with agglomerative clustering. So we'll investigate those combinations as well... -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
