Dear Ted Peterson, I want to share my experiment and maybe open a discussion. I used data named begin.v.xml as Dear Ted said. I think the result of the experiment seems to be ok. I applied default settings that web interface offers. The result is as follows
ENGLISH (# of instances is 255) Precision = 42.46(76/179) Recall = 29.80(76/251+4) F-Measure = 35.02 When I applied the sense cluster to a Turkish corpus that I crafted, I got following result which is very similar to previous one of English TURKISH (# of instances is 142) Precision = 37.32(53/142) Recall = 34.42(53/142+12) F-Measure = 35.81 Turkish language is so hard and difficult language that the model for English is not easily applied to Turkisch corpus, since it has free-word-order structure, has vowel harmony, and is agglutinative language etc. For example previous experiment covers the word "ara" (find OR relation OR break OR interval etc... ) . This word "ara" can occur in lots of surface structure even for only one sense. For this experiment, among the forms of this word are - aradi - aramadım - arıyor - arada - arasında - aralarında end so on.... So I think that the information we need to solve sense is occasionally affixed to the main word structure as suffix. However, compared to results for both English and Turkish, these results are very similar to each other.... During the experiment, some questions arised in my mind, - Can the sense cluster algortihm be applied to a agglutinative language in a normal way english language is applied ? - How can we read these recall, precision, f-meausure. How we use a baseline algoirithm to evaluate the results ? - Does the sense cluster take the surface form of the word into consideration. As I mentioned, in Turkish, a word can occur different forms along with same sense ? Best wishes savas 2008/12/6 Ted Pedersen <[email protected]>: > Hi Savas, > > The data set you used isn't really intended to be used as input > (directly) to SenseClusters. The English lexical sample data is a > large collection that consists of many different words and their > correct sense - in general SenseClusters would be expecting to process > each of those sets of words individually.If you'd like to break that > data down into a form where SenseClusters can better deal with it, you > can use the preprocess.pl program to take care of that for you. More > details on that here... > > http://search.cpan.org/src/TPEDERSE/Text-SenseClusters-1.01/Toolkit/preprocess/sval2/preprocess.pl > > An even simpler alternative are the scripts in the /samples directory > that will break that data apart into the individual samples per word > and then run discriminate.pl on each of those words. You can find > those described in more detail here: > > http://search.cpan.org/src/TPEDERSE/Text-SenseClusters-1.01/samples/README.samples.pod > > If you are just getting started with SenseClusters and would like to > experiment with data for a single word (that is ready to run), you > might want to try the begin.v.xml data, found here: > > http://search.cpan.org/src/TPEDERSE/Text-SenseClusters-1.01/samples/Data/begin.v-test.xml > > Or, you might want to try out some of the name discrimination data found here > : > > http://www.d.umn.edu/~tpederse/namedata.html > > These have all been separated such that each file pertains to a different > name. > > I hope this helps! If you have further questions it might be best to > send to the senseclusters-users list - that way all developers see > them, and you are likely to get the fastest possible response! > > Cordially, > Ted > >>> Savas Yildirim wrote: >>>> >>>> Hi, >>>> I am using SenseCluster Web Interface, I used Ted Petersen's data in >>>> his web page. At last, I got a user.report file showing following >>>> result >>>> >>>> Precision = 3.51(302/8611) >>>> Recall = 3.51(302/8611+0) >>>> F-Measure = 3.51 >>>> >>>> And including some tables, matches etc... >>>> >>>> These precision, recall, and f-measure metrics seem to be very bad, Do >>>> I use the program in a wrong way ? >>>> >>>> >>>> This is my command used : >>>> discriminate.pl "eng-lex-sample.training.xml" --format f16.06 --token >>>> "token.regex" --feature bi --remove 5 --context o2 --clusters 10 >>>> --space vector --clmethod rb --crfun i2 --sim cos --label_remove 5 >>>> --label_stat ll --label_rank 10 --eval --prefix "user" >>>> >>>> How do I evaluate the result files,(e.g. user.report) >>>> >>> >> >> > > -- > Ted Pedersen > http://www.d.umn.edu/~tpederse > -- Savas Yildirim Istanbul Bilgi University & Universitat Tubingen Postal Address in Tuebingen: Seminar für Sprachwissenschaft Universität Tübingen Wilhelmstraße 19 Room 1.07 D-72074 Tübingen Postal Address in Istanbul: Sisli 34440 Dolapdere Kurtulusdere cad. No:47 Istanbul / Turkey Phone: (0090) (212) 311 50 00 ------------------------------------------------------------------------------ SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada. The future of the web can't happen without you. Join us at MIX09 to help pave the way to the Next Web now. Learn more and register at http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/ _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
