
This is an important feature for word processing in these languages.
With a Python version of your Perl script, we could make an
OpenOffice.org extension to support whole text diacritics restoration
(also OCRed text restoration).


2009/4/8 Kevin Scannell <ksca...@gmail.com>:
> Hello all,
> [Sorry for cross-posting - sending this to aspell-devel and a12n as well]
>  This is an announcement of a new package called "charlifter" that
> does statistical diacritic restoration:
> https://sourceforge.net/project/showfiles.php?group_id=256316&package_id=317046
> and two new open source word lists, one for Lingala (joint work with
> Denis Jacquerye):
> https://sourceforge.net/project/showfiles.php?group_id=256316&package_id=317051
> and one for Hawaiian:
> http://borel.slu.edu/ispell/haw_US.zip
> The charlifter script is language-independent - all you need to do is
> provide it with some plain text in the language of interest with all
> of the diacritical marks in place.   From this the script "learns"
> where the diacritics belong, statistically.   You can also improve
> performance by feeding it a word list during the training phase.
> I've built and packaged pre-trained models for several languages,
> including Irish, French, Lingala, Samoan, and Hawaiian - see the
> directories "charlifter-*" here:
> http://lingala.svn.sourceforge.net/viewvc/lingala/
> Once you've trained a language model, or installed one of the models
> above, you can feed plain ASCII text to the script and it restores the
> diacritics or extended Unicode characters that are missing:
> Irish:
> $ echo "an chead teanga oifigiuil" | sf.pl -r ga
> an chéad teanga oifigiúil
> Lingala (note the open vowels "ɔ" are restored correctly):
> $ echo "Ngolo, nina, zambi ikamwisi bango." | sf.pl -r ln
> Ngɔlɔ, niná, zambí ikamwísí bangó.
> Hawaiian:
> $ echo "Olelo aku 'o Papa" | sf.pl -r haw
> ʻŌlelo aku ʻo Pāpā
> etc....
> This work ties in closely with my Crúbadán project which is gathering
> text corpora for 400+ languages with a web crawler:
> http://borel.slu.edu/crubadan/
> Lingala is a good example.  When written properly, it uses diacritics
> to indicate tone, and also uses the open vowels "ɔ" and "ɛ", but 95%
> of what is written on the web is in plain ASCII (no tone marks, "o"
> and "e" in place of "ɔ" and "ɛ").    Therefore, to use the web corpus
> effectively for language modelling purposes, it is important to
> restore these ASCII texts to the proper encoding as best as possible.
> The spell checkers for Lingala and Hawaiian came directly from this
> approach - train charlifter on the small amount (say 5%) of web text
> with correct diacritics in place, the restore the other 95% and use
> the resulting large corpus to generate frequency lists for
> hand-editing, just as we've done with many other Crúbadán languages.
> Please contact me if you're interested in trying to develop a new word
> list using this approach.  I'm particularly interested in African
> languages.
> Kevin
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lingucomponent.openoffice.org
> For additional commands, e-mail: dev-h...@lingucomponent.openoffice.org

To unsubscribe, e-mail: dev-unsubscr...@lingucomponent.openoffice.org
For additional commands, e-mail: dev-h...@lingucomponent.openoffice.org

Reply via email to