There is a very nice idea that Kevin Knight and Daniel Marcu discuss when talking about unsupervised learning. In their domain, this has to do with Machine Translation and word alignment. They present an exercise where you manually align words in two "languages" that are (apparently) made up nonsensical languages. You can see the full exercise in the following:
http://www.isi.edu/natural-language/mt/aimag97.ps In general, this exercise forces you to really see the language as the computer sees language when doing unsupervised learning, that is without any additional background or real-world knowledge. This is in fact a very useful way to think about what SenseClusters is trying to do, because it likewise does not use any real world or domain knowledge, it relies strictly on the text. So, if we as humans see contexts like: The big yellow dog is nuts. My cat went crazy. The computer fell off the shelf. As humans, we can (possibly) cluster these based on the fact that we know cats and dogs are animals, and that a computer is inanimate. We also knonw that nuts and crazy are synonyms. But this of course gives us a false impression as to the ease of the problem. If you convert each word type into a random string, then your data really looks like this: Zyx clrg xlll ark abd daf. afaf weoi ckjl jkl. Zyx clllll jkfdjaffd zyx jlkdf. In fact, now we can see more clearly what SenseClusters is dealing with (and it's a mess :). Based on what we see above, the only similarity between the contexts is zyx, which is our new way of saying "the". But, problems about, for exmple daf and jkl (nuts and crazy) are unrecognizable to us as synoyms. So, I think it might be very useful to from time to time convert your data into a form like there, where you can't rely on your world knowledge to make distinctions. Then, try and cluster the data. You'll see what a tough job SenseClusters sometimes is faced with. :) You know what, I like this idea so much I think we'll write a little program to do this for Senseval-2 formatted data. We'll keep you posted on that. Enjoy, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
