We are currently working on adding support for Latent Semantic Analysis 
(LSA) to SenseClusters. Here are a few notes on what we are up to. 

The basic distinction between LSA and what SenseClusters currently 
provides is with respect to how words/features can be represented. Right 
now SenseClusters is able to represent words based on the words they 
co-occur with in a corpus. So when we are doing word clustering 
(--wordclust option in discriminate.pl) we are essentially taking a word 
by word matrix that shows the co-occurrence behavior of words, and 
clustering the rows of that matrix. When we are doing second order context 
clustering (--context o2), we represent the context to be clustered by  
averaging together the word vectors associated with the words in the 
context. In both cases the key representation is a vector for each word 
that shows the other words with which it has occurred. 

Our LSA support will allow for the representation of features with respect 
to the contexts in which they occur. So when we are doing word clustering 
in LSA mode (--wordclust --lsa) we will cluster features (which may be 
unigrams/words, bigrams, or co-occurrences) based on the contexts in which 
they occur. The same will occur for second order context clustering 
(--context o2 --lsa) where a context to be clustered will be represented 
by replacing all of the features in that context with their corresponding 
vectors of contexts. 

In effect, word clustering and second order context clustering with LSA is  
based on a feature by context co-occurrence matrix, while standard  
SenseClusters is based on a word by word co-occurrence matrix. You can see 
that LSA is more general in that it will support "features" rather than 
just words/unigrams in these modes.

Note that both standard SenseClusters and LSA mode support the use of 
Singular Value Decomposition (SVD). Thus, the range of functionality that 
will be supported by SenseClusters with the addition of LSA will include 
the following:

* find clusters of words based on determining which words they tend to 
co-occur with (with and without svd) *

--wordclust
--wordclust --svd

* find clusters of features based on determining which contexts they tend  
to occur in (with and without svd) *

--wordclust --lsa
--wordclust --svd --lsa

[Please note that while we will keep this option named as --wordclust,
when used with the --lsa option this in fact will allow you to cluster
features, which can include unigrams/words, bigrams, or co-occurrences.]

* represent a context to be clustered by averaging together vectors that  
represent words and the words they co-occur with (with and without svd) *

--context o2 
--context o2 --svd

* represent a context to be clustered by averaging together vectors that  
represent features and the contexts in which they occur (with and without 
svd) *

--context o2 --lsa
--context o2 --svd --lsa

And of course, SenseClusters still has order1 context clustering, which 
represents contexts in terms of their features. 

These promise to add major new functionality to SenseClusters, so our plan 
is to roll this out in a series of releases. The first will update  
--wordclust to support LSA, and then the second will update order 2 
context clustering to support LSA. We expect the first in June and then 
second at some point soon thereafter. 

Please let us know if you have any questions or comments about these 
coming changes!

Thanks,
Ted
--
Ted Pedersen
http://www.d.umn.edu/~tpederse



_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to