[lingu-dev] Questions to implement an hunspell stemmer in Java

Frédéric Glorieux Mon, 22 Nov 2010 16:56:32 -0800


Hi all,


Sorry for cross posting if some one read the same message on the users list.

I'm working for european medieval languages (latin for now and soon :
old french, occitan). Searching in texts need stemming. I have
implemented a first working prototype of a Java stemmer, using huspell
dic and aff files, on medieval latin (code will of course be open
source). We started on the dict-la extension project
http://extensions.services.openoffice.org/project/dict-la (thanks a
lot). I have two questions I would be glad to solve in your way, to be
sure that lexical resources developped in our context could be also used
with hunspell. Sadly, I'm not able to read well C, so I have to ask
questions in (bad) English.

A first problem of medieval languages is to have no exact orthography.
For example, “philosophia”, “fylosofya”, “phylozofia”, and all the
possible combinations are right graphies, because these are the graphies
in the manuscript. Latin stems~lemmas in la.dic, if it's possible,
should be kept with their classical graphy, with "ph" for words coming
from greek (philosophia), but "f" for others (faber), same for "y"
(gymnasium, icon) and others. If I understood hunspell rightly, then,
"ph f" or "y i" should not be ICONV rules (unlike "æ ae, è e, ę e...").
I tried a while the idea of REP rules, but I was affraid of all possible
combinations, (y i, i y, z s, s z, ph f, f ph). In a spellchecker, it's
not a critic problem if the right word is not suggested, or if it takes
time, but for a stemmer, too much lookups is expensive. So I implement a
kind of PHONE rules. The code is working, but I'm not really proud of
what I done. First, I haven't really understood the aspell syntax,
sounding like a pre regex era like Porter snowball, so I conclude that I
will not be able to explain it to linguists. For now, to stay compatible
with hunspell, I'm only using simple substitutions (like REP rules)  "ph
f", sometimes verbose (bb b, cc c, dd d...). The implementation is also
a problem. How to apply rules ? I choose the easiest way to understand
for the rule writer, it's a sequence, a program. Real example : 1) ph f,
2) ch k, 3) h _  (strip 'h' after 'ph' and 'ch' resolution). What to do
with a PHONE result ? For now, I maintain a map of the dic file whith
phone reduction as a key, and stems~lemmas as values. Should I apply
phone rules to the affixes ? I should confess that I added the needed
affixes (ex: (ros)-ae=(ros)-e), faster than to code. Any advice are
welcome to find the best way to keep linguistic knowledge on medieval
latin in hunspell syntax.

Second problem, irregular verbs. Like for english (write, wrote,
written) latin (classical or medieval) has a lot (~3500) of irregular
verbs (ex: concedo, concessi, concessum). For the dic file I was able to
understand (in fact, english and latin) the solution was to open a dic
entry for the irregular verbal radical. It's surely perfect for a
spellchecker, but a big problem for stemming (searching for "concedo"
will not find "concessimus" because this form is stemed as "concessi").
The documentation seems to promote another approach, the optional data
fields
 sing al:sang al:sung
 sang st:sing
 sung st:sing
English affix files seems to not yet follow this syntax. Is it too early
to use it ? What could be broken ? For very irregular conjugation (ex:
la:sum, fr:être) common solution seems to open a dic line for each form.
But in latin, a verb like sum appears in different compound with very
different meaning. It's not a good idea to reduce "presentes" to the
stem "sum" by a "prae" (or "pre") suffix rule. Better approach seems to
keep complete conjugation of "sum" in affix rules. But,
is it still an hunspell limit to not allow complete strip of stem ? (ex
: "sum", "erat" ; "sum/." "SFX . sum erat sum").

Sorry for a so long and compact message, the patience is paid by a
little demo

http://elec.enc.sorbonne.fr/tomcat55/cartulaires/select/?q=gratia
gratia find gratiam (a flexion rule) but also graciam (a phone rule)
http://elec.enc.sorbonne.fr/tomcat55/cartulaires/select/?q=dico
dico find also dictum or dixerunt (st: otional field).
Idea came from this project http://code.google.com/p/lucene-hunspell/,
but the code is written from scratch less lucene centric.

thanks in advance for all advice, I would be glad to not code on sand.

--
Frédéric Glorieux

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lingucomponent.openoffice.org
For additional commands, e-mail: dev-h...@lingucomponent.openoffice.org

[lingu-dev] Questions to implement an hunspell stemmer in Java

Reply via email to