Hey Marios,

It sounds like you have a reasonable plan and you have thought through the ideas. And the answer to many of your questions below is "it depends".

Do you have enough memory to hold the whole lexicon in memory? Is this lexicon going to grow significantly over time? I have, in the past (for other lexicon based resources), done similar tricks to what Lucene does with the term dictionary if things get too large. Namely, store the terms in lexicographic order and load every X number of terms in memory (X is 128 or 64) as an index into the file. This causes a little more searching than if everything is in memory, but you don't have a choice if the lexicon is really large.

Is your lexicon approach going to be complete? I don't know Greek, so I don't know if you have a fixed set of roots. Also, I don't know if Greek has a notion of light stemming versus more aggressive stemming. In Arabic, we found we had better results by doing light stemming (as have other researchers.)

It appears to me, based on your research, you don't have much other choice. As for performance, you approach is much faster than your alternative :-) (i.e. doing it by hand.) Writing the stemmer seems pretty easy, so I would go for it and then test it to see if it meets your needs and, then, if you can, share it with others here.

-Grant

On Aug 4, 2006, at 1:29 PM, Marios Skounakis wrote:




Hi all,

The contrib section of Lucene contains a Greek Analyzer, which however only does some letter normalization (capitals to lowercase, accent removal) and basic stop
word removal.

I am interested in creating a Stemmer for the Greek Language to use with Lucene (i.e. implement it as an analyzer). The Greek Language is quite different from English (and most latin-related languages) in that it is highly inflectional(?) - meaning that there is a large number of suffixes, many of which are not produced
in a very straightforward way.

A quick internet search did not return much information - a couple of
non-publicly available papers and a Master's Thesis with a javascript
implementation which, however, seems to be somewhat lacking in precision (i.e. produces erroneous stems). A disappointing picture, admittedly, when for the english language it is so easy to find a public domain high quality stemmer like
Porter's...

Anyway, to cut a long story short, I had the following idea in order to counter the problem of the multiple suffixes and the high inflectional nature of the language: implement the stemmer using a combination of a lexicon of stems (of the most common words) and a list of all possible suffixes. The algorithm for finding
the stem of a word would be something like:

- for each suffix in the list of suffixes
- remove the suffix from the word (if possible), producing a candindate stem
   - search the lexicon of stems for the candidate stem
       - if the search is succesful return the candidate stem
       - if the search is unsuccesful go to next suffix
- if the suffixes are exhausted and no match is found, the word cannot be stemmed
(return the original word)

[By the way the algorithm is inspired by a paper which descripbes the
implementation of a lemmatizer in a similar way - citeseer link:
http://citeseer.ist.psu.edu/694579.html]

The question is:

Is such a strategy that depends on a leixcon of predefined stems for implementing a stemmer considered a major drawback? In theory it can be (compared to an algorith that works purely with rules, like Porter's) a drawback, but in practice, with a lexicon of a few thousand stems, the stemmer could achieve
pretty good recall (and good precision too).

Other issues to comment on are the lexicon size (which will have to be embedded in or accompany the stemmer), memory issues in running the stemmer (keep the lexicon in memory?), and performance issues (multiple lookups in the lexicon could make it much slower than a rule based stemmer?). In general, any feedback
would be appreciated.

Thanks in advance,

Marios Skounakis
---- Msg sent via eXis webmail

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to