Re: [lingu-dev] Notes on hunspell-ln

Németh László Thu, 09 Apr 2009 10:21:20 -0700

Hi,

I'm very glad to hear your success story about handling complex
morphology. There are tools for simpler morphologies (I will post a
letter about automatic compression of 900 thousand breton words next
week), but we cannot list and compress the possible words of a
language with complex morphology. I plan to develop statictical
versions from the affixcompress tools, but it would be fine to support
formal description based dictionary development, too. Any automatized
methods have big advantage (for example, the source of the Hungarian
spelling dictionary uses (undocumented) awk scripts and M4 macros),
but the best tool would be a (restricted) compiler to support
two-level morphology based morphological descriptions for Hunspell
dictionaries (see also
http://www.lrec-conf.org/proceedings/lrec2008/pdf/274_paper.pdf), or a
similar generalized morphology description language.


Regards,
László



2009/4/8 Kevin Scannell <ksca...@gmail.com>:
> This is a follow-up to my previous announcement with some notes on the
> Lingala hunspell package.
>
> Lingala is a Bantu language and as such has a very complicated verbal
> morphology.  This complexity has made it difficult to develop open
> source spell checkers for other Bantu languages - the existing
> packages are simple word lists that don't even attempt to crack the
> verb system.  Only the Swahili package is large enough to provide
> decent coverage of everyday texts, but it could still be improved.
>
> The hunspell-ln package is the first attempt I know of to handle Bantu
> verbal morphology completely in an affix file.  In addition, Lingala
> is a tone language and has vowel harmony marked orthographically -
> both of these features are handled correctly in the affix file as
> well.   With all of this in mind I'd encourage anyone interested in
> Bantu languages to have a look at the "developer's pack" for
> hunspell-ln in the SVN repository here:
>
> http://lingala.svn.sourceforge.net/viewvc/lingala/hunspell/
>
> It's best to start with the README-dev file, which describes the
> different files in the developer's pack, and some useful makefile
> targets for dictionary maintenance and development.
>
> http://lingala.svn.sourceforge.net/viewvc/lingala/hunspell/README-dev?view=markup
>
> Noun classes are stored in the files, nc*, and these get assigned
> appropriate affix flags which generate the correct plurals.  This much
> is straightforward.
>
> Verbs in Lingala are formed from a "radical", e.g. "bák", to which
> various optional semantic adjuncts can be added in more-or-less
> predictable ways, "bákis", "bákisam", "bákisel", etc.   To these are
> added obligatory prefixes and suffixes indicating personal pronouns
> and tense.   After a lot of experimentation, the best solution for
> spell checking seems to be to store the radicals+adjuncts as words in
> the .dic file, and add the prefixes and suffixes using the affix file.
>  There are many reasons for this choice - among them the fact that
> this simplifies the necessary affixes to within the scope of what
> hunspell can handle.   Also, it is very difficult to predict which
> semantic adjuncts "work" with which radicals, and in which order and
> in which combinations.  This all depends on semantics, and so (in my
> view) is best left to lexicography vs. automatic generation.  And in
> reality, there aren't that many of these combinations used in everyday
> Lingala.   The 900 or so in the first release account for at least 90%
> of the verb forms in the web corpus (up to diacritic differences).
>
> In any case, the combined radicals+adjuncts are stored in the
> (poorly-named) "radicals.txt" in the repository.   A perl script adds
> the correct affix flags (two for each word in radicals.txt - "W" for
> high tone suffixes and "V" for low tone suffixes) and enters the words
> in ln_CD.dic.
>
> The affix file uses some special features of hunspell to handle these
> words.  First, the words from radicals.txt get, additionally, a "Z"
> flag that is marked as "NEEDAFFIX" in the affix file - this is because
> the tense/pronoun affixes are required -- the radicals themselves are
> usually not valid words.    Then, the required prefix/suffix pair is
> treated as a circumfix X (see the "CIRCUMFIX" declaration in the affix
> file) - so a typical suffix (here a subset of "V" for the habitual
> tense) is implemented as follows:
>
>
> SFX V Y 63
> ...
> SFX V 0 aka/PX  .   +HABITUAL1
> SFX V 0 ɛkɛ/PX  [ɛɛ́][^aáeéiíoóuúɛɛ́ɔɔ́]   +HABITUAL1
> SFX V 0 ɛkɛ/PX  [ɛɛ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́]  +HABITUAL1
> SFX V 0 ɛkɛ/PX
> [ɛɛ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́]
> +HABITUAL1
> SFX V 0 ɔkɔ/PX  [ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́]   +HABITUAL1
> SFX V 0 ɔkɔ/PX  [ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́]  +HABITUAL1
> SFX V 0 ɔkɔ/PX
> [ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́][^aáeéiíoóuúɛɛ́ɔɔ́]
> +HABITUAL1
> ...
>
>
> PFX P Y 8
> PFX P 0 na/X    .
> PFX P 0 o/X .
> PFX P 0 a/X .
> PFX P 0 e/X .
> PFX P 0 to/X    .
> PFX P 0 bo/X    .
> PFX P 0 ba/X    .
> PFX P 0 i/X .
>
> In the .dic file, "bikol/ZV" would become "bikolaka/PX" and then
> "nabikolaka", "obikolaka", etc.   The complicated-looking cases in the
> V suffix handle vowel harmony - "bɔtɔl/ZV" would become "bɔtɔlɔkɔ/PX",
> etc.
>
> Everything appears to work nicely.   To this point I've only looked at
> morphology of three Bantu languages in any deep way - Lingala,
> Kinyarwanda, and Swahili, but my naive hope is that this approach
> could serve as a model for developing hunspell packages for other
> Bantu languages.   The top candidates (based on having found a
> sufficient amount of text on the web with a crawler) would be:
> Kikongo, Kikuyu, Luganda, Ndebele (nd/nr), Ndonga, Northern Sotho,
> Nyanja/Chichewa, Rundi, Kinyarwanda, Swati, Sesotho, Swahili,
> Setswana, Tsonga, Venda, Xhosa, and Zulu.
>
> Comments, questions, suggestions appreciated.
>
> Kevin
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lingucomponent.openoffice.org
> For additional commands, e-mail: dev-h...@lingucomponent.openoffice.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lingucomponent.openoffice.org
For additional commands, e-mail: dev-h...@lingucomponent.openoffice.org

Re: [lingu-dev] Notes on hunspell-ln

Reply via email to