One thing to keep in mind about SenseClusters is that you can achieve tremendous variation in your experiments simply by changing around the tokenization a little.
Now, this is our standard tokenization file, and you get this automatically: /<head[^<]*>\s*\w+\s*<\/head>/ /<sat.*>\s*\w+\s*<\/sat>/ /\w+'\w+/ /\w+/ These are Perl regular expressions, and are basically saying that strings of the form <head>word</head> <sat>word</sat> word word's will be identified as tokens. This corresponds to our standard view of words as being string separated tokens. So if you have a context like His boat's are by the <head>water</head> Then you get the following tokens. His boat's are by the <head>water</head> This is all well and good...but, what about the following tokenization scheme? /<head[^<]*>\s*\w+\s*<\/head>/ /\w\w\w/ What is this doing? Well, it considers tokens to be <head>words</head> tags and also 3 character sequences! We call this a poor man's stemmer. :) So in the above example, we'd end up with the following tokens : His boa are the <head>water</head> now, what is interesting about this is that you would then go on to identify features based on these three word sequences (for example, if you used bigrams they would be two three word sequences that occurred in order). Here I did an experiment with the Mexico-Brazil data where I used 4 character sequences as tokens, and then made up bigrams of these four character tokens. http://marimba.d.umn.edu/SC-htdocs/gram4-mexico-brazil1123242792/ What is fairly interesting about this is that I got a rather nice result of F-measure 71% in using these features. http://marimba.d.umn.edu/SC-htdocs/gram4-mexico-brazil1123242792/gram4-mexico-brazil.report If I do "normal" tokenization, then the result drops to 61%! http://marimba.d.umn.edu/SC-htdocs/user1123243053/ To be clear, the only difference between the experiment above and this one is how the tokenization was performed! http://marimba.d.umn.edu/SC-htdocs/user1123243053/user.report Now, you can also go in the other direction, and have tokens be represented by more than one word. For example, suppose you use the following tokenization scheme: /<head[^<]*>\s*\w+\s*<\/head>/ /\w+\s+\w+/ So here your tokens are two word sequences, so if you find bigrams, for example, they are made up of two two word sequences that occur in a particular order. Now, it seems like for this kind of feature you need a lot more data, so I ran on a larger set of the Mexico Brazil data, and I used a lower frequency cutoff. The results aren't good at all (52%) but the idea here is to simply give an example of how different the feature space looks. Here is the directory of output files: http://marimba.d.umn.edu/SC-htdocs/user1123243701/ and here you can see the "bigrams of bigrams" features... http://marimba.d.umn.edu/SC-htdocs/user1123243701/user.bigrams Now we have a 2nd order representation where we have a bigram by bigram matrix, and the context is represented by an averaged set bigram vectors that represent which bigrams each bigram in the context co-occurs with. Is this is a good idea? Well, that I don't know. But it's certainly a different representation than you might expect, and in certain sorts of text might be a nice choice. In any case, I find this to be very interesting, and strongly encourage you to contemplate the user of alternative tokenzation schemes. I'm especially encouraged by the results with the 4 character tokens, which are quite good really. -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
