This is a follow-up to my previous announcement with some notes on the Lingala hunspell package.
Lingala is a Bantu language and as such has a very complicated verbal morphology. This complexity has made it difficult to develop open source spell checkers for other Bantu languages - the existing packages are simple word lists that don't even attempt to crack the verb system. Only the Swahili package is large enough to provide decent coverage of everyday texts, but it could still be improved. The hunspell-ln package is the first attempt I know of to handle Bantu verbal morphology completely in an affix file. In addition, Lingala is a tone language and has vowel harmony marked orthographically - both of these features are handled correctly in the affix file as well. With all of this in mind I'd encourage anyone interested in Bantu languages to have a look at the "developer's pack" for hunspell-ln in the SVN repository here: http://lingala.svn.sourceforge.net/viewvc/lingala/hunspell/ It's best to start with the README-dev file, which describes the different files in the developer's pack, and some useful makefile targets for dictionary maintenance and development. http://lingala.svn.sourceforge.net/viewvc/lingala/hunspell/README-dev?view=markup Noun classes are stored in the files, nc*, and these get assigned appropriate affix flags which generate the correct plurals. This much is straightforward. Verbs in Lingala are formed from a "radical", e.g. "bák", to which various optional semantic adjuncts can be added in more-or-less predictable ways, "bákis", "bákisam", "bákisel", etc. To these are added obligatory prefixes and suffixes indicating personal pronouns and tense. After a lot of experimentation, the best solution for spell checking seems to be to store the radicals+adjuncts as words in the .dic file, and add the prefixes and suffixes using the affix file. There are many reasons for this choice - among them the fact that this simplifies the necessary affixes to within the scope of what hunspell can handle. Also, it is very difficult to predict which semantic adjuncts "work" with which radicals, and in which order and in which combinations. This all depends on semantics, and so (in my view) is best left to lexicography vs. automatic generation. And in reality, there aren't that many of these combinations used in everyday Lingala. The 900 or so in the first release account for at least 90% of the verb forms in the web corpus (up to diacritic differences). In any case, the combined radicals+adjuncts are stored in the (poorly-named) "radicals.txt" in the repository. A perl script adds the correct affix flags (two for each word in radicals.txt - "W" for high tone suffixes and "V" for low tone suffixes) and enters the words in ln_CD.dic. The affix file uses some special features of hunspell to handle these words. First, the words from radicals.txt get, additionally, a "Z" flag that is marked as "NEEDAFFIX" in the affix file - this is because the tense/pronoun affixes are required -- the radicals themselves are usually not valid words. Then, the required prefix/suffix pair is treated as a circumfix X (see the "CIRCUMFIX" declaration in the affix file) - so a typical suffix (here a subset of "V" for the habitual tense) is implemented as follows: SFX V Y 63 ... SFX V 0 aka/PX . +HABITUAL1 SFX V 0 ɛkɛ/PX [ɛɛ́][^aáeéiíoóuúɛɛ́ɔɔ́] +HABITUAL1 SFX V 0 ɛkɛ/PX [ɛɛ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́] +HABITUAL1 SFX V 0 ɛkɛ/PX [ɛɛ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́] +HABITUAL1 SFX V 0 ɔkɔ/PX [ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́] +HABITUAL1 SFX V 0 ɔkɔ/PX [ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́] +HABITUAL1 SFX V 0 ɔkɔ/PX [ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́] +HABITUAL1 ... PFX P Y 8 PFX P 0 na/X . PFX P 0 o/X . PFX P 0 a/X . PFX P 0 e/X . PFX P 0 to/X . PFX P 0 bo/X . PFX P 0 ba/X . PFX P 0 i/X . In the .dic file, "bikol/ZV" would become "bikolaka/PX" and then "nabikolaka", "obikolaka", etc. The complicated-looking cases in the V suffix handle vowel harmony - "bɔtɔl/ZV" would become "bɔtɔlɔkɔ/PX", etc. Everything appears to work nicely. To this point I've only looked at morphology of three Bantu languages in any deep way - Lingala, Kinyarwanda, and Swahili, but my naive hope is that this approach could serve as a model for developing hunspell packages for other Bantu languages. The top candidates (based on having found a sufficient amount of text on the web with a crawler) would be: Kikongo, Kikuyu, Luganda, Ndebele (nd/nr), Ndonga, Northern Sotho, Nyanja/Chichewa, Rundi, Kinyarwanda, Swati, Sesotho, Swahili, Setswana, Tsonga, Venda, Xhosa, and Zulu. Comments, questions, suggestions appreciated. Kevin --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lingucomponent.openoffice.org For additional commands, e-mail: dev-h...@lingucomponent.openoffice.org