Hello, This is an important feature for word processing in these languages. With a Python version of your Perl script, we could make an OpenOffice.org extension to support whole text diacritics restoration (also OCRed text restoration).
Regards, László 2009/4/8 Kevin Scannell <ksca...@gmail.com>: > Hello all, > > [Sorry for cross-posting - sending this to aspell-devel and a12n as well] > > This is an announcement of a new package called "charlifter" that > does statistical diacritic restoration: > > https://sourceforge.net/project/showfiles.php?group_id=256316&package_id=317046 > > and two new open source word lists, one for Lingala (joint work with > Denis Jacquerye): > > https://sourceforge.net/project/showfiles.php?group_id=256316&package_id=317051 > > and one for Hawaiian: > > http://borel.slu.edu/ispell/haw_US.zip > > > > The charlifter script is language-independent - all you need to do is > provide it with some plain text in the language of interest with all > of the diacritical marks in place. From this the script "learns" > where the diacritics belong, statistically. You can also improve > performance by feeding it a word list during the training phase. > I've built and packaged pre-trained models for several languages, > including Irish, French, Lingala, Samoan, and Hawaiian - see the > directories "charlifter-*" here: > > http://lingala.svn.sourceforge.net/viewvc/lingala/ > > Once you've trained a language model, or installed one of the models > above, you can feed plain ASCII text to the script and it restores the > diacritics or extended Unicode characters that are missing: > > Irish: > $ echo "an chead teanga oifigiuil" | sf.pl -r ga > an chéad teanga oifigiúil > > Lingala (note the open vowels "ɔ" are restored correctly): > $ echo "Ngolo, nina, zambi ikamwisi bango." | sf.pl -r ln > Ngɔlɔ, niná, zambí ikamwísí bangó. > > Hawaiian: > $ echo "Olelo aku 'o Papa" | sf.pl -r haw > ʻŌlelo aku ʻo Pāpā > > etc.... > > > This work ties in closely with my Crúbadán project which is gathering > text corpora for 400+ languages with a web crawler: > > http://borel.slu.edu/crubadan/ > > Lingala is a good example. When written properly, it uses diacritics > to indicate tone, and also uses the open vowels "ɔ" and "ɛ", but 95% > of what is written on the web is in plain ASCII (no tone marks, "o" > and "e" in place of "ɔ" and "ɛ"). Therefore, to use the web corpus > effectively for language modelling purposes, it is important to > restore these ASCII texts to the proper encoding as best as possible. > > The spell checkers for Lingala and Hawaiian came directly from this > approach - train charlifter on the small amount (say 5%) of web text > with correct diacritics in place, the restore the other 95% and use > the resulting large corpus to generate frequency lists for > hand-editing, just as we've done with many other Crúbadán languages. > > Please contact me if you're interested in trying to develop a new word > list using this approach. I'm particularly interested in African > languages. > > Kevin > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lingucomponent.openoffice.org > For additional commands, e-mail: dev-h...@lingucomponent.openoffice.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lingucomponent.openoffice.org For additional commands, e-mail: dev-h...@lingucomponent.openoffice.org