Hey Marios,
It sounds like you have a reasonable plan and you have thought
through the ideas. And the answer to many of your questions below is
"it depends".
Do you have enough memory to hold the whole lexicon in memory? Is
this lexicon going to grow significantly over time? I have, in the
past (for other lexicon based resources), done similar tricks to what
Lucene does with the term dictionary if things get too large.
Namely, store the terms in lexicographic order and load every X
number of terms in memory (X is 128 or 64) as an index into the
file. This causes a little more searching than if everything is in
memory, but you don't have a choice if the lexicon is really large.
Is your lexicon approach going to be complete? I don't know Greek,
so I don't know if you have a fixed set of roots. Also, I don't know
if Greek has a notion of light stemming versus more aggressive
stemming. In Arabic, we found we had better results by doing light
stemming (as have other researchers.)
It appears to me, based on your research, you don't have much other
choice. As for performance, you approach is much faster than your
alternative :-) (i.e. doing it by hand.) Writing the stemmer seems
pretty easy, so I would go for it and then test it to see if it meets
your needs and, then, if you can, share it with others here.
-Grant
On Aug 4, 2006, at 1:29 PM, Marios Skounakis wrote:
Hi all,
The contrib section of Lucene contains a Greek Analyzer, which
however only does
some letter normalization (capitals to lowercase, accent removal)
and basic stop
word removal.
I am interested in creating a Stemmer for the Greek Language to use
with Lucene
(i.e. implement it as an analyzer). The Greek Language is quite
different from
English (and most latin-related languages) in that it is highly
inflectional(?) -
meaning that there is a large number of suffixes, many of which are
not produced
in a very straightforward way.
A quick internet search did not return much information - a couple of
non-publicly available papers and a Master's Thesis with a javascript
implementation which, however, seems to be somewhat lacking in
precision (i.e.
produces erroneous stems). A disappointing picture, admittedly,
when for the
english language it is so easy to find a public domain high quality
stemmer like
Porter's...
Anyway, to cut a long story short, I had the following idea in
order to counter
the problem of the multiple suffixes and the high inflectional
nature of the
language: implement the stemmer using a combination of a lexicon of
stems (of the
most common words) and a list of all possible suffixes. The
algorithm for finding
the stem of a word would be something like:
- for each suffix in the list of suffixes
- remove the suffix from the word (if possible), producing a
candindate stem
- search the lexicon of stems for the candidate stem
- if the search is succesful return the candidate stem
- if the search is unsuccesful go to next suffix
- if the suffixes are exhausted and no match is found, the word
cannot be stemmed
(return the original word)
[By the way the algorithm is inspired by a paper which descripbes the
implementation of a lemmatizer in a similar way - citeseer link:
http://citeseer.ist.psu.edu/694579.html]
The question is:
Is such a strategy that depends on a leixcon of predefined stems
for implementing
a stemmer considered a major drawback? In theory it can be
(compared to an
algorith that works purely with rules, like Porter's) a drawback,
but in
practice, with a lexicon of a few thousand stems, the stemmer could
achieve
pretty good recall (and good precision too).
Other issues to comment on are the lexicon size (which will have to
be embedded
in or accompany the stemmer), memory issues in running the stemmer
(keep the
lexicon in memory?), and performance issues (multiple lookups in
the lexicon
could make it much slower than a rule based stemmer?). In general,
any feedback
would be appreciated.
Thanks in advance,
Marios Skounakis
---- Msg sent via eXis webmail
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org
Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]