Hi all,
Sorry for cross posting if some one read the same message on the users list. I'm working for european medieval languages (latin for now and soon : old french, occitan). Searching in texts need stemming. I have implemented a first working prototype of a Java stemmer, using huspell dic and aff files, on medieval latin (code will of course be open source). We started on the dict-la extension project http://extensions.services.openoffice.org/project/dict-la (thanks a lot). I have two questions I would be glad to solve in your way, to be sure that lexical resources developped in our context could be also used with hunspell. Sadly, I'm not able to read well C, so I have to ask questions in (bad) English. A first problem of medieval languages is to have no exact orthography. For example, “philosophia”, “fylosofya”, “phylozofia”, and all the possible combinations are right graphies, because these are the graphies in the manuscript. Latin stems~lemmas in la.dic, if it's possible, should be kept with their classical graphy, with "ph" for words coming from greek (philosophia), but "f" for others (faber), same for "y" (gymnasium, icon) and others. If I understood hunspell rightly, then, "ph f" or "y i" should not be ICONV rules (unlike "æ ae, è e, ę e..."). I tried a while the idea of REP rules, but I was affraid of all possible combinations, (y i, i y, z s, s z, ph f, f ph). In a spellchecker, it's not a critic problem if the right word is not suggested, or if it takes time, but for a stemmer, too much lookups is expensive. So I implement a kind of PHONE rules. The code is working, but I'm not really proud of what I done. First, I haven't really understood the aspell syntax, sounding like a pre regex era like Porter snowball, so I conclude that I will not be able to explain it to linguists. For now, to stay compatible with hunspell, I'm only using simple substitutions (like REP rules) "ph f", sometimes verbose (bb b, cc c, dd d...). The implementation is also a problem. How to apply rules ? I choose the easiest way to understand for the rule writer, it's a sequence, a program. Real example : 1) ph f, 2) ch k, 3) h _ (strip 'h' after 'ph' and 'ch' resolution). What to do with a PHONE result ? For now, I maintain a map of the dic file whith phone reduction as a key, and stems~lemmas as values. Should I apply phone rules to the affixes ? I should confess that I added the needed affixes (ex: (ros)-ae=(ros)-e), faster than to code. Any advice are welcome to find the best way to keep linguistic knowledge on medieval latin in hunspell syntax. Second problem, irregular verbs. Like for english (write, wrote, written) latin (classical or medieval) has a lot (~3500) of irregular verbs (ex: concedo, concessi, concessum). For the dic file I was able to understand (in fact, english and latin) the solution was to open a dic entry for the irregular verbal radical. It's surely perfect for a spellchecker, but a big problem for stemming (searching for "concedo" will not find "concessimus" because this form is stemed as "concessi"). The documentation seems to promote another approach, the optional data fields sing al:sang al:sung sang st:sing sung st:sing English affix files seems to not yet follow this syntax. Is it too early to use it ? What could be broken ? For very irregular conjugation (ex: la:sum, fr:être) common solution seems to open a dic line for each form. But in latin, a verb like sum appears in different compound with very different meaning. It's not a good idea to reduce "presentes" to the stem "sum" by a "prae" (or "pre") suffix rule. Better approach seems to keep complete conjugation of "sum" in affix rules. But, is it still an hunspell limit to not allow complete strip of stem ? (ex : "sum", "erat" ; "sum/." "SFX . sum erat sum"). Sorry for a so long and compact message, the patience is paid by a little demo http://elec.enc.sorbonne.fr/tomcat55/cartulaires/select/?q=gratia gratia find gratiam (a flexion rule) but also graciam (a phone rule) http://elec.enc.sorbonne.fr/tomcat55/cartulaires/select/?q=dico dico find also dictum or dixerunt (st: otional field). Idea came from this project http://code.google.com/p/lucene-hunspell/, but the code is written from scratch less lucene centric. thanks in advance for all advice, I would be glad to not code on sand. -- Frédéric Glorieux --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lingucomponent.openoffice.org For additional commands, e-mail: dev-h...@lingucomponent.openoffice.org