Re: Experiment: How good is quality of OpenNLP models for various languages.

Markus Jelsma Mon, 13 Jun 2022 04:07:23 -0700

Hello Leszek,

You can, unfortunately, not just use all data from UD with OpenNLP as is.
There are several issues with it that either need some preprocessing, or
complete rejection of the UD training files.


* Interpunction: the sentence detector and tokenizer must know about line
separator and token separator interpunction, these symbols are not the same
for all languages, even though they may sometimes look the same. OpenNLP
does have something for this (-eosChars), but i could not seem to get it to
work. So instead, i passed all text through several simple sed
transformations. I used sed instead of tr because tr did not seem to
understand multibyte characters. We do, of course, the same preprocessing
in our Java code when we use the trained models. You may also want to
normalize all forms of quotations marks to a single form, when training and
when using the models. Look for ideographic comma's, periods, exclamation
marks a,d question marks. Also transform the Devanagari and Urdu periods.
When using the models, be sure to get rid of abnormal whitespaces, they are
countless: https://www.fileformat.info/info/unicode/category/Zs/list.htm

* License issues: some training files have licensing issues and therefore
the text and tokens are rendered unusable. These trainig files must not be
used. You can find them easily by grepping for multiple occurences of
underscores. Most if not all text is replaced by underscores in those
files. Keep token separation using whitespace (or not) in mind, non-CJ has
_ _ _ _ _, CJ has ______.

* Not real live data: tokenizer and POS tagger for several languages can
become easily confused so you may want to add additional custom training
sentences. Tokenizers for most whitespace separated languages have trouble
with abbreviations followed by a full stop mid-sentence, e.g: "And so Mr.
Duck went back to Duck City.". Also, the POS-tagger can sometimes become
confused when it encounters seemingly  conflicting samples in the different
corpora. It can be very annoying when an obvious PROPer noun is
consistently tagged as an ADJective.

These are just some of the issues you can expect when dealing with UD. Not
all problems can be solved easily. I haven't found out why the lemmatizer
breaks for some languages. Keep in mind that for German and Czech you need
a lot of memory allocated.

Cheers,
Markus

Op ma 13 jun. 2022 om 08:38 schreef <[email protected]>:

> Hi
> I wondered how good/or bad is the quality of OpenNLP models for various
> types of languages (latin alphabet, cyrilic alphabet, abjads, ideographic).
> I wrote a program to download Universal Dependencies treebank
> https://universaldependencies.org/, train and evaluate OpenNLP models for
> a language (sentence-detector, tokenizer, pos-tagger, lemmatizer).
> The program and evaluation results are available at
> https://github.com/abzif/babzel
> This program may be useful for a somebody who wants to train generic
> models for a desired language with small effort. Universal Dependencies
> support a lot of languages so it is good for this purpose.
> The evaluation results show that models trained for alphabetic languages
> (latin, cyrylic, abjads) seems to have really good quality.
> Chinese-Jananese-Korean models are not that good. Also lemmatizer fails
> with exception for some languages.
> Maybe the results may be an inspiration for improvements.
> Thanks
> Leszek
>

Re: Experiment: How good is quality of OpenNLP models for various languages.

Reply via email to