Hello all, This is an announcement of a new package called "charlifter" that does statistical diacritic restoration:
https://sourceforge.net/project/showfiles.php?group_id=256316&package_id=317046 and two new open source word lists, one for Lingala (joint work with Denis Jacquerye): https://sourceforge.net/project/showfiles.php?group_id=256316&package_id=317051 and one for Hawaiian: http://borel.slu.edu/ispell/haw_US.zip The charlifter script is language-independent - all you need to do is provide it with some plain text in the language of interest with all of the diacritical marks in place. From this the script "learns" where the diacritics belong, statistically. You can also improve performance by feeding it a word list during the training phase. I've built and packaged pre-trained models for several languages, including Irish, French, Lingala, Samoan, and Hawaiian - see the directories "charlifter-*" here: http://lingala.svn.sourceforge.net/viewvc/lingala/ Once you've trained a language model, or installed one of the models above, you can feed plain ASCII text to the script and it restores the diacritics or extended Unicode characters that are missing: Irish: $ echo "an chead teanga oifigiuil" | sf.pl -r ga an chéad teanga oifigiúil Lingala (note the open vowels "ɔ" are restored correctly): $ echo "Ngolo, nina, zambi ikamwisi bango." | sf.pl -r ln Ngɔlɔ, niná, zambí ikamwísí bangó. Hawaiian: $ echo "Olelo aku 'o Papa" | sf.pl -r haw ʻŌlelo aku ʻo Pāpā etc.... This work ties in closely with my Crúbadán project which is gathering text corpora for 400+ languages with a web crawler: http://borel.slu.edu/crubadan/ Lingala is a good example. When written properly, it uses diacritics to indicate tone, and also uses the open vowels "ɔ" and "ɛ", but 95% of what is written on the web is in plain ASCII (no tone marks, "o" and "e" in place of "ɔ" and "ɛ"). Therefore, to use the web corpus effectively for language modelling purposes, it is important to restore these ASCII texts to the proper encoding as best as possible. The spell checkers for Lingala and Hawaiian came directly from this approach - train charlifter on the small amount (say 5%) of web text with correct diacritics in place, the restore the other 95% and use the resulting large corpus to generate frequency lists for hand-editing, just as we've done with many other Crúbadán languages. Please contact me if you're interested in trying to develop a new word list using this approach. I'm particularly interested in African languages. Kevin _______________________________________________ Aspell-devel mailing list [email protected] http://lists.gnu.org/mailman/listinfo/aspell-devel
