I realize most of you can't attend, but all the same I wanted to send a
copy of Amruta's Master's thesis defense announcement. This marks a
transition of sorts, as Amruta will be joining the Intelligent Systems
Program at the University of Pittsburgh this coming Fall. However, we have
plans to continue SenseClusters development both from Duluth and
Pittsburgh, so while Amruta is finishing her degree here in Duluth, that
doesn't mean that SenseClusters is finished too. We expect it will have a
long and interesting life. :)
Ted
---------
Unsupervised Word Sense Discrimination by Clustering Similar Contexts
Amruta Purandare
Master's Thesis Defense
Thursday, July 8, 2004
1:00 pm // Heller Hall 306
Department of Computer Science
University of Minnesota, Duluth
Word sense discrimination is the problem of identifying different contexts
that refer to the same meaning of an ambiguous word. For example, given
multiple contexts that include the word 'sharp', we would hope to
discriminate between those that refer to an intellectual sharpness versus
those that refer to a cutting sharpness. Our methodology is based on the
strong contextual hypothesis of Miller and Charles (1991), which states
that "two words are semantically related to the extent that their
contextual representations are similar." This thesis presents
corpus--based unsupervised solutions that automatically group together
contextually similar instances of a word as observed in a raw text. We do
not utilize any manually created or maintained knowledge--rich resources
such as dictionaries, thesauri or annotated corpora. As a result, our
approach is well suited to the fluid and dynamic nature of word meanings.
It is also portable to different domains and languages, and scales easily
to larger samples of text.
The overall objective of this thesis is to study the effect of various
feature types, context representations and clustering methods on the
accuracy of sense discrimination. We also apply dimensionality reduction
techniques to capture conceptual similarities among the contexts and
don't just rely on the surface forms of words in the text. We present a
systematic comparison of various discrimination techniques proposed by
Pedersen and Bruce (1997) and Schutze (1998). We find that the first order
method of Pedersen and Bruce performs well with larger amounts of text, but
that the second order method of Schutze is more effective with smaller data
sets. We also discovered that a divisive approach is more suitable for
clustering smaller set of contexts, while the agglomerative method performs
better on larger data. We conducted experiments to study the effect
of using various sources of training, and found that local contexts of a word
provide better discrimination features than a running text like complete
newspaper articles. We compared the performance of our knowledge--lean method
against that of a knowledge--intensive approach, and found that although
the latter was successful in conjunction with smaller datasets, it didn't show
significant improvements with larger data. This suggests that the features
learned from a large sample of text certainly have the potential to
outperform those learned from a knowledge-rich resource like dictionary.
-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 -
digital self defense, top technical experts, no vendor pitches,
unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
senseclusters-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users