Hi Markus
Thanks for your comments
Problems with interpunction characters, different end of sentence characters
are mitigated by normalizing text before processing. Normalization means
lowercasing and folding to ascii equivalent characters (icu4j is used for this
purpose).
This is not perfect but improves the situation.
Sentences are validated. Those sentences which are empty, contains _ _ _ ,
token list does not equal sentence etc, are considered incorrect and rejected.
Only "valid" sentences are used for training.
Some languages needs an extra "transformation" of data to be usable - in
example for German language
I observed that lemmatizer fails for some languages:
german - Compound nouns are inconsistently lemmatized. Sometimes they are
lemmatized to the full word, but sometimes they are lemmatized to their last
word. In example: kundendienstzentrums => zentrum, geheimdienste => dienst
It causes an enormous number of outcomes and lemmatizer fails with
out of memory error. "Incorrect" sentences are rejected by validator so
lemmatizer model can be trained.
russian - fails with out of memory error
arabic - training is very slow, however model is computed but then fails at
deserialization (something like string too long or something like that)
Regards
Leszek
Od: "Markus Jelsma" <[email protected]>
Do: [email protected];
Wysłane: 13:15 Poniedziałek 2022-06-13
Temat: Re: Experiment: How good is quality of OpenNLP models for various
languages.
> Hello Leszek,
>
> You can, unfortunately, not just use all data from UD with OpenNLP as
is.
> There are several issues with it that either need some preprocessing,
or
> complete rejection of the UD training files.
>
> * Interpunction: the sentence detector and tokenizer must know about
line
> separator and token separator interpunction, these symbols are not the
same
> for all languages, even though they may sometimes look the same.
OpenNLP
> does have something for this (-eosChars), but i could not seem to get
it to
> work. So instead, i passed all text through several simple sed
> transformations. I used sed instead of tr because tr did not seem to
> understand multibyte characters. We do, of course, the same
preprocessing
> in our Java code when we use the trained models. You may also want to
> normalize all forms of quotations marks to a single form, when
training and
> when using the models. Look for ideographic comma's, periods,
exclamation
> marks a,d question marks. Also transform the Devanagari and Urdu
periods.
> When using the models, be sure to get rid of abnormal whitespaces,
they are
> countless:
https://www.fileformat.info/info/unicode/category/Zs/list.htm
>
> * License issues: some training files have licensing issues and
therefore
> the text and tokens are rendered unusable. These trainig files must
not be
> used. You can find them easily by grepping for multiple occurences of
> underscores. Most if not all text is replaced by underscores in those
> files. Keep token separation using whitespace (or not) in mind, non-CJ
has
> _ _ _ _ _, CJ has ______.
>
> * Not real live data: tokenizer and POS tagger for several languages
can
> become easily confused so you may want to add additional custom
training
> sentences. Tokenizers for most whitespace separated languages have
trouble
> with abbreviations followed by a full stop mid-sentence, e.g: "And so
Mr.
> Duck went back to Duck City.". Also, the POS-tagger can sometimes
become
> confused when it encounters seemingly conflicting samples in the
different
> corpora. It can be very annoying when an obvious PROPer noun is
> consistently tagged as an ADJective.
>
> These are just some of the issues you can expect when dealing with UD.
Not
> all problems can be solved easily. I haven't found out why the
lemmatizer
> breaks for some languages. Keep in mind that for German and Czech you
need
> a lot of memory allocated.
>
> Cheers,
> Markus
>
> Op ma 13 jun. 2022 om 08:38 schreef :
>
> > Hi
> > I wondered how good/or bad is the quality of OpenNLP models for
various
> > types of languages (latin alphabet, cyrilic alphabet, abjads,
ideographic).
> > I wrote a program to download Universal Dependencies treebank
> > https://universaldependencies.org/, train and evaluate OpenNLP
models for
> > a language (sentence-detector, tokenizer, pos-tagger, lemmatizer).
> > The program and evaluation results are available at
> > https://github.com/abzif/babzel
> > This program may be useful for a somebody who wants to train generic
> > models for a desired language with small effort. Universal
Dependencies
> > support a lot of languages so it is good for this purpose.
> > The evaluation results show that models trained for alphabetic
languages
> > (latin, cyrylic, abjads) seems to have really good quality.
> > Chinese-Jananese-Korean models are not that good. Also lemmatizer
fails
> > with exception for some languages.
> > Maybe the results may be an inspiration for improvements.
> > Thanks
> > Leszek
> >
>