Improved(?) Swedish snowball stemmer
------------------------------------
Key: LUCENE-1515
URL: https://issues.apache.org/jira/browse/LUCENE-1515
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
Affects Versions: 2.4
Reporter: Karl Wettin
Snowball stemmer for Swedish lacks support for '-an' and '-ans' related suffix
stripping, ending up with non compatible stems for example "klocka", "klockor",
"klockornas", "klockAN", "klockANS". Complete list of new suffix stripping
rules:
{pre}
'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas'
'ans' 'ansen' 'ansens' 'anser' 'ansera' 'anserar' 'anserna'
'ansernas'
'iera'
(delete)
{pre}
The problem is all the exceptions (e.g. svans|svan, finans|fin, nyans|ny) and
this is an attempt at solving that problem. The rules and exceptions are based
on the [SAOL|http://en.wikipedia.org/wiki/Svenska_Akademiens_Ordlista] entries
suffixed with 'an' and 'ans'. There a few known problematic stemming rules but
seems to work quite a bit better than the current SwedishStemmer. It would not
be a bad idea to check all of SAOL entries in order to make sure the integrity
of the rules.
My Snowball syntax skills are rather limited so I'm certain the code could be
optimized quite a bit.
*The code is released under BSD and not ASL*. I've been posting a bit in the
Snowball forum and privatly to Martin Porter himself but never got any response
so now I post it here instead in hope for some momentum.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]