We are currently working on adding support for Latent Semantic Analysis (LSA) to SenseClusters. Here are a few notes on what we are up to.
The basic distinction between LSA and what SenseClusters currently provides is with respect to how words/features can be represented. Right now SenseClusters is able to represent words based on the words they co-occur with in a corpus. So when we are doing word clustering (--wordclust option in discriminate.pl) we are essentially taking a word by word matrix that shows the co-occurrence behavior of words, and clustering the rows of that matrix. When we are doing second order context clustering (--context o2), we represent the context to be clustered by averaging together the word vectors associated with the words in the context. In both cases the key representation is a vector for each word that shows the other words with which it has occurred. Our LSA support will allow for the representation of features with respect to the contexts in which they occur. So when we are doing word clustering in LSA mode (--wordclust --lsa) we will cluster features (which may be unigrams/words, bigrams, or co-occurrences) based on the contexts in which they occur. The same will occur for second order context clustering (--context o2 --lsa) where a context to be clustered will be represented by replacing all of the features in that context with their corresponding vectors of contexts. In effect, word clustering and second order context clustering with LSA is based on a feature by context co-occurrence matrix, while standard SenseClusters is based on a word by word co-occurrence matrix. You can see that LSA is more general in that it will support "features" rather than just words/unigrams in these modes. Note that both standard SenseClusters and LSA mode support the use of Singular Value Decomposition (SVD). Thus, the range of functionality that will be supported by SenseClusters with the addition of LSA will include the following: * find clusters of words based on determining which words they tend to co-occur with (with and without svd) * --wordclust --wordclust --svd * find clusters of features based on determining which contexts they tend to occur in (with and without svd) * --wordclust --lsa --wordclust --svd --lsa [Please note that while we will keep this option named as --wordclust, when used with the --lsa option this in fact will allow you to cluster features, which can include unigrams/words, bigrams, or co-occurrences.] * represent a context to be clustered by averaging together vectors that represent words and the words they co-occur with (with and without svd) * --context o2 --context o2 --svd * represent a context to be clustered by averaging together vectors that represent features and the contexts in which they occur (with and without svd) * --context o2 --lsa --context o2 --svd --lsa And of course, SenseClusters still has order1 context clustering, which represents contexts in terms of their features. These promise to add major new functionality to SenseClusters, so our plan is to roll this out in a series of releases. The first will update --wordclust to support LSA, and then the second will update order 2 context clustering to support LSA. We expect the first in June and then second at some point soon thereafter. Please let us know if you have any questions or comments about these coming changes! Thanks, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
