Hello Gabriele,
I took a quick look at the paisa corpus. I am not sure the OpenNLP command
line app can handle it out of the box. Did you use the command:
opennlp SentenceDetectorTrainer -model it-sent.bin -lang it -data
paisa.annotated.CoNLL.utf8 -encoding UTF-8
If you did, then it is treating the data as if it is in openNLP format i.e. one
sentence per line w/the last non-whitespace character as the end of sentence.
I tried to use conllx format (opennlp SentenceDetectorTrainer.conllx), but that
requires a detokenizer dictionary. I also tried conllu format, but I was not
sure how to set the number of sentences per sample, I don’t think paisa is
properly conllu either. So I think that the reason that you are getting poor
performance is because we don’t have a paisa format stream.
The raw file (paisa.raw.utf8) is not formatted in the OpenNLP format either. I
think what you need to do is read BOTH files. You need to create a stream of
SentenceSamples.
I can look into this further if you need help, but I am not the best person on
the team to do this.
Daniel
> On Dec 1, 2017, at 7:02 AM, Gabriele Vaccari <[email protected]>
> wrote:
>
> Hi all,
>
> I'm trying to use openNLP to train some models for Italian, basically to get
> some familiarity with the API. To provide some background, I'm familiar with
> machine learning concepts and understand what an NLP pipeline looks like,
> however this is the first time I actually have to go ahead and put together
> an application with all this.
> So I started with the sentence detector. I was able to train an Italian SD
> with a corpus of sentences from http://www.corpusitaliano.it/en/. However the
> performance of the detector is somewhat below my expectations. It makes
> pretty obvious mistakes, like failing to recognize an end-of-sentence full
> stop (example below*), or failing to spot an abbreviation preceded by
> punctuation (I've posted the issue 1163 on
> Jira<https://issues.apache.org/jira/browse/OPENNLP-1163> for this).
>
> Even though the documentation is very good, I feel it lacks some best
> practices and suggestions. For instance:
>
> * Is my sentence detection training set supposed to have consistent
> documents or will a bunch of random sentences with a blank line every 20-30
> work?
> * Do my training examples in openNLP native format need to be formatted in
> a special way? Will the algo ignore stuff like extra white spaces or tabs
> between words? Do examples with a lot of punctuation like quotes or
> parenthesis somehow affect the outcome?
> * How many training examples (or events) are recommended?
> * Is it better to provide a case sensitive abbreviation dictionary vs case
> insensitive?
> * Is the issue 1163 a known problem? I think other languages as French
> might have the same thing happening.
> * Are there examples of complete production-grade data sets in Italian or
> other languages that have been successfully used to train openNLP tools?
>
> I believe I could find most of these questions by just looking at the code,
> but someone who already went through it maybe could point me in the right
> direction.
> Basically, I'm asking for best practices and pro tips.
>
> Thank you
>
> * failure to recognize EOS full stop:
> SENT_1: Molteplici furono i passi che portarono alla nascita di questa
> disciplina.
> SENT_2: Il primo, sia a livello di importanza che di ordine cronologico, è
> l'avvento dei calcolatori ed il continuo interesse rivolto ad essi. Già nel
> 1623, grazie a Willhelm
> Sickhart<https://it.wikipedia.org/w/index.php?title=Willhelm_Sickhart&action=edit&redlink=1>,
> si arrivò a creare macchine in grado di effettuare calcoli matematici con
> numeri fino a sei cifre, anche se non in maniera autonoma.
>
>
> Gabriele Vaccari
> Dedalus SpA
>