Hi Savas, Apologies for the rather slow response, the usual end of semester chaos has engulfed me. :)
See below for comments... On Fri, Dec 12, 2008 at 11:21 AM, Savas Yildirim <[email protected]> wrote: > Dear Ted Peterson, > > I want to share my experiment and maybe open a discussion. > I used data named begin.v.xml as Dear Ted said. I think the result of > the experiment seems to be ok. I applied default settings that web > interface offers. The result is as follows > > ENGLISH > (# of instances is 255) > Precision = 42.46(76/179) > Recall = 29.80(76/251+4) > F-Measure = 35.02 > > When I applied the sense cluster to a Turkish corpus that I crafted, I > got following result which is very similar to previous one of English > > TURKISH > (# of instances is 142) > Precision = 37.32(53/142) > Recall = 34.42(53/142+12) > F-Measure = 35.81 > > Turkish language is so hard and difficult language that the model for > English is not easily applied to Turkisch corpus, since it has > free-word-order structure, has vowel harmony, and is agglutinative > language etc. For example previous experiment covers the word "ara" > (find OR relation OR break OR interval etc... ) . This word "ara" can > occur in lots of surface structure even for only one sense. > For this experiment, among the forms of this word are > - aradi > - aramadım > - arıyor > - arada > - arasında > - aralarında > end so on.... > > So I think that the information we need to solve sense is occasionally > affixed to the main word structure as suffix. I haven't seen too many results with Turkish, so this is quite interesting. For a word like this, all the different forms of "ara" could be included as a target word, and then clustered to see if they fall into the same sense or not. > However, compared to results for both English and Turkish, these > results are very similar to each other.... > > During the experiment, some questions arised in my mind, > - Can the sense cluster algortihm be applied to a agglutinative > language in a normal way english language is applied ? Keeping in mind that my knowledge of agglutinative languages is somewhat limited, I believe the answer is yes. SenseClusters is strictly based on the corpus data, and does not use any sort of underlying linguistic models (so it will work on any language that you are able to tokenize in a way that is reasonable to you.) I suspect with Turkish there might be a need to tokenize the data a bit, so that the units you are interested in are represented via space separated strings. I'm not sure if that exactly makes sense, but the idea is that if words are tokenized in some way that you get the units of meaning that are interesting to you as tokens, then senseclusters can proceed exactly as it is...remember too that you can define the tokenization scheme via the --token option (which allows you to specify your units of tokenization as a regular expression). > - How can we read these recall, precision, f-meausure. How we use a > baseline algoirithm to evaluate the results ? Interesting question - a common baseline that I use is to cluster all the instances of a target word into a single cluster, and see how well that fares according to these measures. In many cases this can actually result in a fairly good f-measure, and can be a challenging baseline to beat. But, of course we should be able to beat this, since otherwise it means we are better off not trying to solve the word sense discrimination problem. This of course requires that you have data where you do have some idea of the correct clustering, otherwise the f-measure evaluation can't be performed. > - Does the sense cluster take the surface form of the word into > consideration. As I mentioned, in Turkish, a word can occur different > forms along with same sense ? Yes, SenseClusters relies entirely on the surface form. The different forms of a word can be identified as being of the same sense, and in fact synoyms can be identified in the same way. So, you can have different surface forms of your target word and cluster those to see if they are being used in the same sense. For example, in English I could have target words of the form "line" and "queue", and I could cluster those to see how many of the uses of "line" and "queue" were of the same sense, and which were of different senses. I hope this makes some sense. I'm not very familiar with Turkish so I'm not sure if these answers and analogies will be helpful, so please do keep asking questions if you have any doubts! Cordially, Ted > > Best wishes > savas > > > 2008/12/6 Ted Pedersen <[email protected]>: >> Hi Savas, >> >> The data set you used isn't really intended to be used as input >> (directly) to SenseClusters. The English lexical sample data is a >> large collection that consists of many different words and their >> correct sense - in general SenseClusters would be expecting to process >> each of those sets of words individually.If you'd like to break that >> data down into a form where SenseClusters can better deal with it, you >> can use the preprocess.pl program to take care of that for you. More >> details on that here... >> >> http://search.cpan.org/src/TPEDERSE/Text-SenseClusters-1.01/Toolkit/preprocess/sval2/preprocess.pl >> >> An even simpler alternative are the scripts in the /samples directory >> that will break that data apart into the individual samples per word >> and then run discriminate.pl on each of those words. You can find >> those described in more detail here: >> >> http://search.cpan.org/src/TPEDERSE/Text-SenseClusters-1.01/samples/README.samples.pod >> >> If you are just getting started with SenseClusters and would like to >> experiment with data for a single word (that is ready to run), you >> might want to try the begin.v.xml data, found here: >> >> http://search.cpan.org/src/TPEDERSE/Text-SenseClusters-1.01/samples/Data/begin.v-test.xml >> >> Or, you might want to try out some of the name discrimination data found >> here : >> >> http://www.d.umn.edu/~tpederse/namedata.html >> >> These have all been separated such that each file pertains to a different >> name. >> >> I hope this helps! If you have further questions it might be best to >> send to the senseclusters-users list - that way all developers see >> them, and you are likely to get the fastest possible response! >> >> Cordially, >> Ted >> >>>> Savas Yildirim wrote: >>>>> >>>>> Hi, >>>>> I am using SenseCluster Web Interface, I used Ted Petersen's data in >>>>> his web page. At last, I got a user.report file showing following >>>>> result >>>>> >>>>> Precision = 3.51(302/8611) >>>>> Recall = 3.51(302/8611+0) >>>>> F-Measure = 3.51 >>>>> >>>>> And including some tables, matches etc... >>>>> >>>>> These precision, recall, and f-measure metrics seem to be very bad, Do >>>>> I use the program in a wrong way ? >>>>> >>>>> >>>>> This is my command used : >>>>> discriminate.pl "eng-lex-sample.training.xml" --format f16.06 --token >>>>> "token.regex" --feature bi --remove 5 --context o2 --clusters 10 >>>>> --space vector --clmethod rb --crfun i2 --sim cos --label_remove 5 >>>>> --label_stat ll --label_rank 10 --eval --prefix "user" >>>>> >>>>> How do I evaluate the result files,(e.g. user.report) >>>>> >>>> >>> >>> >> >> -- >> Ted Pedersen >> http://www.d.umn.edu/~tpederse >> > > > > -- > Savas Yildirim > Istanbul Bilgi University & Universitat Tubingen > > Postal Address in Tuebingen: > Seminar für Sprachwissenschaft > Universität Tübingen > Wilhelmstraße 19 > Room 1.07 > D-72074 Tübingen > > Postal Address in Istanbul: > Sisli 34440 Dolapdere Kurtulusdere cad. No:47 > Istanbul / Turkey > Phone: > (0090) (212) 311 50 00 > -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------------------------------ SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada. The future of the web can't happen without you. Join us at MIX09 to help pave the way to the Next Web now. Learn more and register at http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/ _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
