Hi Savas,

Apologies for the rather slow response, the usual end of semester
chaos has engulfed me. :)

See below for comments...

On Fri, Dec 12, 2008 at 11:21 AM, Savas Yildirim <[email protected]> wrote:
> Dear Ted Peterson,
>
> I want to share my experiment and maybe open a discussion.
> I used data named begin.v.xml as Dear Ted  said. I think the result of
> the experiment seems to be ok. I applied default settings that web
> interface offers. The result is as follows
>
> ENGLISH
> (# of instances is 255)
> Precision = 42.46(76/179)
> Recall = 29.80(76/251+4)
> F-Measure = 35.02
>
> When I applied the sense cluster to a Turkish corpus that I crafted, I
> got following result which is very similar to previous one of English
>
> TURKISH
> (# of instances is 142)
> Precision = 37.32(53/142)
> Recall = 34.42(53/142+12)
> F-Measure = 35.81
>
> Turkish language is so hard and difficult language that the model for
> English is not easily applied to Turkisch corpus, since it has
> free-word-order structure, has vowel harmony, and is agglutinative
> language etc. For example previous experiment covers the word "ara"
> (find OR relation OR break OR interval etc... ) . This word "ara" can
> occur in lots of surface structure  even for only one sense.
> For this experiment, among the forms of this word are
> - aradi
> - aramadım
> - arıyor
> - arada
> - arasında
> - aralarında
> end so on....
>
> So I think that the information we need to solve sense is occasionally
> affixed to the main word structure as suffix.

I haven't seen too many results with Turkish, so this is quite
interesting. For a word like this, all the different forms of "ara"
could be included as a target word, and then clustered to see if they
fall into the same sense or not.

> However, compared to results for both English and Turkish, these
> results are very similar to each other....
>
> During the experiment, some questions arised in my mind,
> -  Can the sense cluster algortihm be applied to a agglutinative
> language in a normal way english language is applied ?

Keeping in mind that my knowledge of agglutinative languages is
somewhat limited, I believe the answer is yes. SenseClusters is
strictly based on the corpus data, and does not use any sort of
underlying linguistic models (so it will work on any language that you
are able to tokenize in a way that is reasonable to you.) I suspect
with Turkish there might be a need to tokenize the data a bit, so that
the units you are interested in are represented via space separated
strings. I'm not sure if that exactly makes sense, but the idea is
that if words are tokenized in some way that you get the units of
meaning that are interesting to you as tokens, then senseclusters can
proceed exactly as it is...remember too that you can define the
tokenization scheme via the --token option (which allows you to
specify your units of tokenization as a regular expression).

> -  How can we read these recall, precision, f-meausure. How we use a
> baseline algoirithm to evaluate the results ?

Interesting question - a common baseline that I use is to cluster all
the instances of a target word into a single cluster, and see how well
that fares according to these measures. In many cases this can
actually result in a fairly good f-measure, and can be a challenging
baseline to beat. But, of course we should be able to beat this, since
otherwise it means we are better off not trying to solve the word
sense discrimination problem. This of course requires that you have
data where you do have some idea of the correct clustering, otherwise
the f-measure evaluation can't be performed.

> -  Does the sense cluster take the surface form of the word into
> consideration. As I mentioned, in Turkish, a word can occur different
> forms along with same sense ?

Yes, SenseClusters relies entirely on the surface form. The different
forms of a word can be identified as being of the same sense, and in
fact synoyms can be identified in the same way. So, you can have
different surface forms of your target word and cluster those to see
if they are being used in the same sense.

For example, in English I could have target words of the form "line"
and "queue", and I could cluster those to see how many of the uses of
"line" and "queue" were of the same sense, and which were of different
senses.

I hope this makes some sense. I'm not very familiar with Turkish so
I'm not sure if these answers and analogies will be helpful, so please
do keep asking questions if you have any doubts!

Cordially,
Ted

>
> Best wishes
> savas
>
>
> 2008/12/6 Ted Pedersen <[email protected]>:
>> Hi Savas,
>>
>> The data set you used isn't really intended to be used as input
>> (directly) to SenseClusters. The English lexical sample data is a
>> large collection that consists of many different words and their
>> correct sense - in general SenseClusters would be expecting to process
>> each of those sets of words individually.If you'd like to break that
>> data down into a form where SenseClusters can better deal with it, you
>> can use the preprocess.pl program to take care of that for you. More
>> details on that here...
>>
>> http://search.cpan.org/src/TPEDERSE/Text-SenseClusters-1.01/Toolkit/preprocess/sval2/preprocess.pl
>>
>> An even simpler alternative are the scripts in the /samples directory
>> that will break that data apart into the individual samples per word
>> and then run discriminate.pl on each of those words. You can find
>> those described in more detail here:
>>
>> http://search.cpan.org/src/TPEDERSE/Text-SenseClusters-1.01/samples/README.samples.pod
>>
>> If you are just getting started with SenseClusters and would like to
>> experiment with data for a single word (that is ready to run), you
>> might want to try the begin.v.xml data, found here:
>>
>> http://search.cpan.org/src/TPEDERSE/Text-SenseClusters-1.01/samples/Data/begin.v-test.xml
>>
>> Or, you might want to try out some of the name discrimination data found 
>> here :
>>
>> http://www.d.umn.edu/~tpederse/namedata.html
>>
>> These have all been separated such that each file pertains to a different 
>> name.
>>
>> I hope this helps! If you have further questions it might be best to
>> send to the senseclusters-users list - that way all developers see
>> them, and you are likely to get the fastest possible response!
>>
>> Cordially,
>> Ted
>>
>>>> Savas Yildirim wrote:
>>>>>
>>>>> Hi,
>>>>> I am using SenseCluster Web Interface, I used Ted Petersen's data in
>>>>> his web page. At last, I got a user.report file showing following
>>>>> result
>>>>>
>>>>> Precision = 3.51(302/8611)
>>>>> Recall = 3.51(302/8611+0)
>>>>> F-Measure = 3.51
>>>>>
>>>>> And including some tables, matches etc...
>>>>>
>>>>> These precision, recall, and f-measure metrics seem to be very bad, Do
>>>>> I use the program in a wrong way ?
>>>>>
>>>>>
>>>>> This is my command used :
>>>>>  discriminate.pl "eng-lex-sample.training.xml" --format f16.06 --token
>>>>> "token.regex" --feature bi --remove 5 --context o2 --clusters 10
>>>>> --space vector --clmethod rb --crfun i2 --sim cos --label_remove 5
>>>>> --label_stat ll --label_rank 10  --eval --prefix "user"
>>>>>
>>>>> How do I evaluate the result files,(e.g. user.report)
>>>>>
>>>>
>>>
>>>
>>
>> --
>> Ted Pedersen
>> http://www.d.umn.edu/~tpederse
>>
>
>
>
> --
> Savas Yildirim
> Istanbul Bilgi University & Universitat Tubingen
>
> Postal Address in Tuebingen:
> Seminar für Sprachwissenschaft
> Universität Tübingen
> Wilhelmstraße 19
> Room 1.07
> D-72074 Tübingen
>
> Postal Address in Istanbul:
> Sisli 34440 Dolapdere Kurtulusdere cad. No:47
> Istanbul / Turkey
> Phone:
> (0090) (212) 311 50 00
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse
------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to