Dear All,

We recently started a project in which we look for clusters of
semantically related words in a literary corpus using unsupervised
clustering techniques based on co-occurrence of tokens within windows of a
certain size.

In order to avoid wheel-re-inventions, we thought we could use the 
SenseClusters package. In particular, we were inspired by the message sent
by Amruta a while ago about using SenseClusters for similar purposes.

However, after we created our cooccurrence vectors, we got stuck -- we do
not know what to do next, and we do not know where to find the relevant
documentation.

Following Amruta's mail:
 
 1. The N-gram Statistics Package 
(http://www.d.umn.edu/~tpederse/nsp.html)
 creates the list of word pairs that co-occur in some window from
 each other and their association scores. Run programs count.pl,
 combig.pl and statistics.pl in order ! The output of statistics
 will be the list of word pairs that co-occur in some window and their
 association scores as computed by tests like log-likelihood, mutual
 information, chi-squared test etc.

=> We did this.
 
 2. Give the output of step 1 to wordvec.pl in SenseClusters Package
 (http://senseclusters.sourceforge.net/). This program will create
 a word-by-word association matrix that shows the co-occurrence
 vector of each word.
 
=> We did this, but we started being confused.

=> We ran wordvec.pl like this:

=> $ wordvec.pl --wordorder nocare --feats 
        feats.txt --dims dims.txt  wordpairs.txt > firsttry.vec

The feature and dimension files are created by wordvec.pl, and they are 
identical, which makes sense, since for now we are simply looking at the 
cooccurrence of all words with all words.

 3. Cluster these word vectors with (give the output of step 2 to)
 vcluster program in Cluto http://www-users.cs.umn.edu/~karypis/cluto/
 to get clusters of words !
 
=> Here is where we get stuck. Given the output of the previous
=> step, where can we find documentation on a simple way to obtain
=> clusters from cluto (or from any other package?)

Any advice greatly appreciated!

Best regards,

Marco Baroni & Sara Piccioni

SSLMIT, University of Bologna
http://sslmit.unibo.it/~baroni   
http://sslmit.unibo.it/~spiccioni


-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
senseclusters-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to