On Sat, 05 Jun 2004 21:15:23 +0200
 Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
Vladimir Yuryev wrote:

Hi, Andjej!

How you tested the Polish texts with what stemer?
Thanks,
Vladimir.


No reason to be too modest, Leo.. I tested your stemmer on English, Swedish and Polish texts (including F-measure vs. training set size plots), and it works exceptionally well indeed. Highly recommended!

Well, I have several corpora of Polish language, which together amount to roughly 90,000 words (nouns and verbs) having at least 4 inflected forms. This set is randomized (i.e. lines of words + forms are in random order). I've split this into two parts - one of a fixed size, as a test set, and one of variable size as a training set. Then I compile stemmer tables using variable number of training examples, and using differnt settings (trie, multi-trie, different optimizations, etc..). Then for each output table I test the precision/recall of correct base forms (lemmatization), and of ability to create unique stems (stemming). Finally, I select the "best" table, which gives reasonably good results vs. table size. To put it in plain terms, e.g. for tables roughly 300kB in size (created from training set of 3000 unique words + their forms) in best cases I get ~90% of correct stems, and ~70% of correct lemmas. Which is a _very_ good result!


--
Best regards,
Andrzej Bialecki

Thanks for the detailed description of the test of the Polish texts. It was very important for me.
Vladimir.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to