Hello Leszek, You can, unfortunately, not just use all data from UD with OpenNLP as is. There are several issues with it that either need some preprocessing, or complete rejection of the UD training files.
* Interpunction: the sentence detector and tokenizer must know about line separator and token separator interpunction, these symbols are not the same for all languages, even though they may sometimes look the same. OpenNLP does have something for this (-eosChars), but i could not seem to get it to work. So instead, i passed all text through several simple sed transformations. I used sed instead of tr because tr did not seem to understand multibyte characters. We do, of course, the same preprocessing in our Java code when we use the trained models. You may also want to normalize all forms of quotations marks to a single form, when training and when using the models. Look for ideographic comma's, periods, exclamation marks a,d question marks. Also transform the Devanagari and Urdu periods. When using the models, be sure to get rid of abnormal whitespaces, they are countless: https://www.fileformat.info/info/unicode/category/Zs/list.htm * License issues: some training files have licensing issues and therefore the text and tokens are rendered unusable. These trainig files must not be used. You can find them easily by grepping for multiple occurences of underscores. Most if not all text is replaced by underscores in those files. Keep token separation using whitespace (or not) in mind, non-CJ has _ _ _ _ _, CJ has ______. * Not real live data: tokenizer and POS tagger for several languages can become easily confused so you may want to add additional custom training sentences. Tokenizers for most whitespace separated languages have trouble with abbreviations followed by a full stop mid-sentence, e.g: "And so Mr. Duck went back to Duck City.". Also, the POS-tagger can sometimes become confused when it encounters seemingly conflicting samples in the different corpora. It can be very annoying when an obvious PROPer noun is consistently tagged as an ADJective. These are just some of the issues you can expect when dealing with UD. Not all problems can be solved easily. I haven't found out why the lemmatizer breaks for some languages. Keep in mind that for German and Czech you need a lot of memory allocated. Cheers, Markus Op ma 13 jun. 2022 om 08:38 schreef <[email protected]>: > Hi > I wondered how good/or bad is the quality of OpenNLP models for various > types of languages (latin alphabet, cyrilic alphabet, abjads, ideographic). > I wrote a program to download Universal Dependencies treebank > https://universaldependencies.org/, train and evaluate OpenNLP models for > a language (sentence-detector, tokenizer, pos-tagger, lemmatizer). > The program and evaluation results are available at > https://github.com/abzif/babzel > This program may be useful for a somebody who wants to train generic > models for a desired language with small effort. Universal Dependencies > support a lot of languages so it is good for this purpose. > The evaluation results show that models trained for alphabetic languages > (latin, cyrylic, abjads) seems to have really good quality. > Chinese-Jananese-Korean models are not that good. Also lemmatizer fails > with exception for some languages. > Maybe the results may be an inspiration for improvements. > Thanks > Leszek >
