Dear Ted Peterson,

I want to share my experiment and maybe open a discussion.
I used data named begin.v.xml as Dear Ted  said. I think the result of
the experiment seems to be ok. I applied default settings that web
interface offers. The result is as follows

ENGLISH
(# of instances is 255)
Precision = 42.46(76/179)
Recall = 29.80(76/251+4)
F-Measure = 35.02

When I applied the sense cluster to a Turkish corpus that I crafted, I
got following result which is very similar to previous one of English

TURKISH
(# of instances is 142)
Precision = 37.32(53/142)
Recall = 34.42(53/142+12)
F-Measure = 35.81

Turkish language is so hard and difficult language that the model for
English is not easily applied to Turkisch corpus, since it has
free-word-order structure, has vowel harmony, and is agglutinative
language etc. For example previous experiment covers the word "ara"
(find OR relation OR break OR interval etc... ) . This word "ara" can
occur in lots of surface structure  even for only one sense.
For this experiment, among the forms of this word are
- aradi
- aramadım
- arıyor
- arada
- arasında
- aralarında
end so on....

So I think that the information we need to solve sense is occasionally
affixed to the main word structure as suffix.
However, compared to results for both English and Turkish, these
results are very similar to each other....

During the experiment, some questions arised in my mind,
-  Can the sense cluster algortihm be applied to a agglutinative
language in a normal way english language is applied ?
-  How can we read these recall, precision, f-meausure. How we use a
baseline algoirithm to evaluate the results ?
-  Does the sense cluster take the surface form of the word into
consideration. As I mentioned, in Turkish, a word can occur different
forms along with same sense ?

Best wishes
savas


2008/12/6 Ted Pedersen <[email protected]>:
> Hi Savas,
>
> The data set you used isn't really intended to be used as input
> (directly) to SenseClusters. The English lexical sample data is a
> large collection that consists of many different words and their
> correct sense - in general SenseClusters would be expecting to process
> each of those sets of words individually.If you'd like to break that
> data down into a form where SenseClusters can better deal with it, you
> can use the preprocess.pl program to take care of that for you. More
> details on that here...
>
> http://search.cpan.org/src/TPEDERSE/Text-SenseClusters-1.01/Toolkit/preprocess/sval2/preprocess.pl
>
> An even simpler alternative are the scripts in the /samples directory
> that will break that data apart into the individual samples per word
> and then run discriminate.pl on each of those words. You can find
> those described in more detail here:
>
> http://search.cpan.org/src/TPEDERSE/Text-SenseClusters-1.01/samples/README.samples.pod
>
> If you are just getting started with SenseClusters and would like to
> experiment with data for a single word (that is ready to run), you
> might want to try the begin.v.xml data, found here:
>
> http://search.cpan.org/src/TPEDERSE/Text-SenseClusters-1.01/samples/Data/begin.v-test.xml
>
> Or, you might want to try out some of the name discrimination data found here 
> :
>
> http://www.d.umn.edu/~tpederse/namedata.html
>
> These have all been separated such that each file pertains to a different 
> name.
>
> I hope this helps! If you have further questions it might be best to
> send to the senseclusters-users list - that way all developers see
> them, and you are likely to get the fastest possible response!
>
> Cordially,
> Ted
>
>>> Savas Yildirim wrote:
>>>>
>>>> Hi,
>>>> I am using SenseCluster Web Interface, I used Ted Petersen's data in
>>>> his web page. At last, I got a user.report file showing following
>>>> result
>>>>
>>>> Precision = 3.51(302/8611)
>>>> Recall = 3.51(302/8611+0)
>>>> F-Measure = 3.51
>>>>
>>>> And including some tables, matches etc...
>>>>
>>>> These precision, recall, and f-measure metrics seem to be very bad, Do
>>>> I use the program in a wrong way ?
>>>>
>>>>
>>>> This is my command used :
>>>>  discriminate.pl "eng-lex-sample.training.xml" --format f16.06 --token
>>>> "token.regex" --feature bi --remove 5 --context o2 --clusters 10
>>>> --space vector --clmethod rb --crfun i2 --sim cos --label_remove 5
>>>> --label_stat ll --label_rank 10  --eval --prefix "user"
>>>>
>>>> How do I evaluate the result files,(e.g. user.report)
>>>>
>>>
>>
>>
>
> --
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>



-- 
Savas Yildirim
Istanbul Bilgi University & Universitat Tubingen

Postal Address in Tuebingen:
Seminar für Sprachwissenschaft
Universität Tübingen
Wilhelmstraße 19
Room 1.07
D-72074 Tübingen

Postal Address in Istanbul:
Sisli 34440 Dolapdere Kurtulusdere cad. No:47
Istanbul / Turkey
Phone:
(0090) (212) 311 50 00
------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to