Re: Stemmer Implementation Strategy - feedback?

Grant Ingersoll Mon, 07 Aug 2006 04:34:39 -0700

Hey Marios,

It sounds like you have a reasonable plan and you have thoughtthrough the ideas. And the answer to many of your questions below is"it depends".

Do you have enough memory to hold the whole lexicon in memory? Isthis lexicon going to grow significantly over time? I have, in thepast (for other lexicon based resources), done similar tricks to whatLucene does with the term dictionary if things get too large.Namely, store the terms in lexicographic order and load every Xnumber of terms in memory (X is 128 or 64) as an index into thefile. This causes a little more searching than if everything is inmemory, but you don't have a choice if the lexicon is really large.

Is your lexicon approach going to be complete? I don't know Greek,so I don't know if you have a fixed set of roots. Also, I don't knowif Greek has a notion of light stemming versus more aggressivestemming. In Arabic, we found we had better results by doing lightstemming (as have other researchers.)

It appears to me, based on your research, you don't have much otherchoice. As for performance, you approach is much faster than youralternative :-) (i.e. doing it by hand.) Writing the stemmer seemspretty easy, so I would go for it and then test it to see if it meetsyour needs and, then, if you can, share it with others here.


-Grant

On Aug 4, 2006, at 1:29 PM, Marios Skounakis wrote:

Hi all,
The contrib section of Lucene contains a Greek Analyzer, whichhowever only doessome letter normalization (capitals to lowercase, accent removal)and basic stop
word removal.
I am interested in creating a Stemmer for the Greek Language to usewith Lucene(i.e. implement it as an analyzer). The Greek Language is quitedifferent fromEnglish (and most latin-related languages) in that it is highlyinflectional(?) -meaning that there is a large number of suffixes, many of which arenot produced
in a very straightforward way.

A quick internet search did not return much information - a couple of
non-publicly available papers and a Master's Thesis with a javascript
implementation which, however, seems to be somewhat lacking inprecision (i.e.produces erroneous stems). A disappointing picture, admittedly,when for theenglish language it is so easy to find a public domain high qualitystemmer like
Porter's...
Anyway, to cut a long story short, I had the following idea inorder to counterthe problem of the multiple suffixes and the high inflectionalnature of thelanguage: implement the stemmer using a combination of a lexicon ofstems (of themost common words) and a list of all possible suffixes. Thealgorithm for finding
the stem of a word would be something like:

- for each suffix in the list of suffixes
- remove the suffix from the word (if possible), producing acandindate stem
   - search the lexicon of stems for the candidate stem
       - if the search is succesful return the candidate stem
       - if the search is unsuccesful go to next suffix
- if the suffixes are exhausted and no match is found, the wordcannot be stemmed
(return the original word)

[By the way the algorithm is inspired by a paper which descripbes the
implementation of a lemmatizer in a similar way - citeseer link:
http://citeseer.ist.psu.edu/694579.html]

The question is:
Is such a strategy that depends on a leixcon of predefined stemsfor implementinga stemmer considered a major drawback? In theory it can be(compared to analgorith that works purely with rules, like Porter's) a drawback,but inpractice, with a lexicon of a few thousand stems, the stemmer couldachieve
pretty good recall (and good precision too).
Other issues to comment on are the lexicon size (which will have tobe embeddedin or accompany the stemmer), memory issues in running the stemmer(keep thelexicon in memory?), and performance issues (multiple lookups inthe lexiconcould make it much slower than a rule based stemmer?). In general,any feedback
would be appreciated.

Thanks in advance,

Marios Skounakis
---- Msg sent via eXis webmail

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Stemmer Implementation Strategy - feedback?

Reply via email to