[lingu-dev] Notes on hunspell-ln

Kevin Scannell Tue, 07 Apr 2009 21:10:57 -0700

This is a follow-up to my previous announcement with some notes on the
Lingala hunspell package.

Lingala is a Bantu language and as such has a very complicated verbal
morphology. This complexity has made it difficult to develop open
source spell checkers for other Bantu languages - the existing
packages are simple word lists that don't even attempt to crack the
verb system. Only the Swahili package is large enough to provide
decent coverage of everyday texts, but it could still be improved.

The hunspell-ln package is the first attempt I know of to handle Bantu
verbal morphology completely in an affix file. In addition, Lingala
is a tone language and has vowel harmony marked orthographically -
both of these features are handled correctly in the affix file as
well. With all of this in mind I'd encourage anyone interested in
Bantu languages to have a look at the "developer's pack" for
hunspell-ln in the SVN repository here:

http://lingala.svn.sourceforge.net/viewvc/lingala/hunspell/

It's best to start with the README-dev file, which describes the
different files in the developer's pack, and some useful makefile
targets for dictionary maintenance and development.

http://lingala.svn.sourceforge.net/viewvc/lingala/hunspell/README-dev?view=markup

Noun classes are stored in the files, nc*, and these get assigned
appropriate affix flags which generate the correct plurals. This much
is straightforward.

Verbs in Lingala are formed from a "radical", e.g. "bák", to which
various optional semantic adjuncts can be added in more-or-less
predictable ways, "bákis", "bákisam", "bákisel", etc. To these are
added obligatory prefixes and suffixes indicating personal pronouns
and tense. After a lot of experimentation, the best solution for
spell checking seems to be to store the radicals+adjuncts as words in
the .dic file, and add the prefixes and suffixes using the affix file.
There are many reasons for this choice - among them the fact that
this simplifies the necessary affixes to within the scope of what
hunspell can handle. Also, it is very difficult to predict which
semantic adjuncts "work" with which radicals, and in which order and
in which combinations. This all depends on semantics, and so (in my
view) is best left to lexicography vs. automatic generation. And in
reality, there aren't that many of these combinations used in everyday
Lingala. The 900 or so in the first release account for at least 90%
of the verb forms in the web corpus (up to diacritic differences).

In any case, the combined radicals+adjuncts are stored in the
(poorly-named) "radicals.txt" in the repository. A perl script adds
the correct affix flags (two for each word in radicals.txt - "W" for
high tone suffixes and "V" for low tone suffixes) and enters the words
in ln_CD.dic.

The affix file uses some special features of hunspell to handle these
words. First, the words from radicals.txt get, additionally, a "Z"
flag that is marked as "NEEDAFFIX" in the affix file - this is because
the tense/pronoun affixes are required -- the radicals themselves are
usually not valid words. Then, the required prefix/suffix pair is
treated as a circumfix X (see the "CIRCUMFIX" declaration in the affix
file) - so a typical suffix (here a subset of "V" for the habitual
tense) is implemented as follows:

SFX V Y 63
...
SFX V 0 aka/PX . +HABITUAL1
SFX V 0 ɛkɛ/PX [ɛɛ́][^aáeéiíoóuúɛɛ́ɔɔ́] +HABITUAL1
SFX V 0 ɛkɛ/PX [ɛɛ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́] +HABITUAL1
SFX V 0 ɛkɛ/PX
[ɛɛ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́]
+HABITUAL1
SFX V 0 ɔkɔ/PX [ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́] +HABITUAL1
SFX V 0 ɔkɔ/PX [ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́] +HABITUAL1
SFX V 0 ɔkɔ/PX
[ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́]
+HABITUAL1
...

PFX P Y 8
PFX P 0 na/X .
PFX P 0 o/X .
PFX P 0 a/X .
PFX P 0 e/X .
PFX P 0 to/X .
PFX P 0 bo/X .
PFX P 0 ba/X .
PFX P 0 i/X .

In the .dic file, "bikol/ZV" would become "bikolaka/PX" and then
"nabikolaka", "obikolaka", etc. The complicated-looking cases in the
V suffix handle vowel harmony - "bɔtɔl/ZV" would become "bɔtɔlɔkɔ/PX",
etc.

Everything appears to work nicely. To this point I've only looked at
morphology of three Bantu languages in any deep way - Lingala,
Kinyarwanda, and Swahili, but my naive hope is that this approach
could serve as a model for developing hunspell packages for other
Bantu languages. The top candidates (based on having found a
sufficient amount of text on the web with a crawler) would be:
Kikongo, Kikuyu, Luganda, Ndebele (nd/nr), Ndonga, Northern Sotho,
Nyanja/Chichewa, Rundi, Kinyarwanda, Swati, Sesotho, Swahili,
Setswana, Tsonga, Venda, Xhosa, and Zulu.

Comments, questions, suggestions appreciated.

Kevin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lingucomponent.openoffice.org
For additional commands, e-mail: dev-h...@lingucomponent.openoffice.org

[lingu-dev] Notes on hunspell-ln

Reply via email to