As I mentioned in an earlier posting, I attended the EuroLAN 2005
summer school in Cluj-Napoca Romania, and presented a 3 hour tutorial
on SenseClusters, and also conducted a 3 hour practical session
which was great fun, and consisted of the "First Transylvanian
Bake Off", which was a fun competition between about 20 groups of
students using SenseClusters on a particular set of data. I will
provide a more detailed summary of that in the coming days, since
it was quite exciting and interesting. BTW, Cluj-Napoca is the
capital of Transylvania, hence the name of the event...

I learned a number of things during the tutorial, with perhaps the
most important being is that it seems very important to provide a bit
more information to users about the various criterion functions that
are included in SenseClusters. These are not documented in
SenseClusters, with only a reference to the Cluto manual given. Then
the cluto manual refers to another paper for more detailed information.
Sort of a second order relation there I guess. :)

In any case, I spent some time at EuroLAN looking at the various criterion
functions, and decided that it was time to document those in
SenseClusters. I will start by sending some summarizing information
to this list to work out any bugs or glitches in the discussion.

In clustering there are two crucial scores that are considered. The
first is the similarity measure, which is used to score the pairwise
similarity or difference between any two contexts. These consist of
the cosine, the jaccard coefficient, etc. This is not where the problem
lies I don't think, in that generally speaking when using real valued
feature vectors you must use the cosine, and that is often the case
for our data. When using binary data it is possible to use jaccard,
etc. but these are fairly standard and not to difficult to understand.
However, we will provide a bit more description and information regarding
the similarity measurements that you can choose.

The big point of confusion though is the criterion functions. These are
what are used to measure the actual quality of the clustering either on
a local level (how 'tight' is each cluster without regard to its
separation from any other cluster) and then more global measures, that
try and consider both the tightness of clusters and their overall
separation from each other. These issues will be discussed in more detail
as we go along...In any case, the criterion functions are known as I1,
I2, I3, H1, H2, G1, G2, and they appear rather mysterious to the user
I have observed. We have recommended I2 as a default, which is reasonable
but perhaps not the only or even best choice. So what I hope to do
in the coming days is to summarize what each of these criterion functions
offer, and how or when you might like to use them. This information will
eventually find its way into SenseClusters documentation, so your comments
are of course welcome.

There were some other interesting comments that I will share as well,
but the above seemed to be the most important point and the issue that
generated the most curiosity, so I'll pursue that first.

Cordially,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to