Hi Anagha, Thanks for your THRC.xml file and the token.regex file. The problem you reported had to do with features as discovered by SenseClusters not being found in the text you were seeking to cluster, largely due to differences in tokenization. In your features you might have a bigram like
Beautiful Mind but in your test data that might have adjacent punctuation such as "Beautiful Mind" (with the quotes) and as a result the feature would not be matched. So, before I get into a very long explanation, here is the solution I think. You need to run preprocess.pl on your input text, to get it tokenized properly, prior to using discriminate.pl or the web interface. The feature identification programs that come from NSP (count and statistic) do tokenization with your given -token file. order1vec and order2vec (which do feature matching) do not. order1vec takes the input file as given and does the feature matching. So, in effect you have NSP providing you with features that have been identified using your token.regex tokenization scheme, while order1vec is matching features in text that has been tokenized differently (or not at all)! So, if you run preprocess.pl --token token.regex THRC.xml you will get two files named T_R.xml T_R.count They will be tokenized identically, the .xml file will be in sval2 format, and the .count file will be in plain text format. You could run nsp on the .count file to get features, and then match those features in T_R.xml, and I think all will be well. Or, if you are using discriminate.pl or the web interface, you should use T_R.xml as your input and essentially discard T_R.count. A count-style file will again be created for you, simply by stripping away the xml tags. I went back into discriminate.pl, and verified that preprocess.pl is not called. I also verified that order1vec.pl does not accept a --token file as an option, so therefore must not be doing tokenization. So, I am nearly certain of my explanation and solution, but you should double check of course. As a matter of procedure then, I would suggest the following - when using discriminate.pl or the web interface, think of the --token file as representing how your test and training data have **already been** tokenized, and if they have have not been tokenized in such a fashion, then you should use preprocess.pl prior to processing to get your data into the right format. In general I would strongly recommend using preprocess.pl on any data that is going to be processed by SenseClusters, making sure to provide whatever tokenization file you may use with preprocess.pl to SenseClusters as well. There is of course a question here - why doesn't discriminate.pl or the web interface call preprocess.pl. The reason for that is that there are so many optionswith preprocess.pl that it just didn't seem feasible, and I think we wanted users to take responsibility for their tokenization choices, as they have a profound impact on later processing. We should probably make this clearer in the documentation, hence this rather lengthy note. :) I hope this all makes sense, and that it actually fixes your problem. Let me know in either case! Cordially, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
