Re: openNLP best practices - sentence detector

Dan Russ Fri, 08 Dec 2017 07:02:57 -0800

Hello Gabriele,
    I took a quick look at the paisa corpus.  I am not sure the OpenNLP command 
line app can handle it out of the box.  Did you use the command:


opennlp SentenceDetectorTrainer -model it-sent.bin -lang it -data 
paisa.annotated.CoNLL.utf8 -encoding UTF-8

If you did, then it is treating the data as if it is in openNLP format i.e. one 
sentence per line w/the last non-whitespace character as the end of sentence.

I tried to use conllx format (opennlp SentenceDetectorTrainer.conllx), but that 
requires a detokenizer dictionary.  I also tried conllu format, but I was not 
sure how to set the number of sentences per sample, I don’t think paisa is 
properly conllu either.  So I think that the reason that you are getting poor 
performance is because we don’t have a paisa format stream.

The raw file (paisa.raw.utf8) is not formatted in the OpenNLP format either. I 
think what you need to do is read BOTH files. You need to create a stream of 
SentenceSamples. 

I can look into this further if you need help, but I am not the best person on 
the team to do this.
Daniel



> On Dec 1, 2017, at 7:02 AM, Gabriele Vaccari <[email protected]> 
> wrote:
> 
> Hi all,
> 
> I'm trying to use openNLP to train some models for Italian, basically to get 
> some familiarity with the API. To provide some background, I'm familiar with 
> machine learning concepts and understand what an NLP pipeline looks like, 
> however this is the first time I actually have to go ahead and put together 
> an application with all this.
> So I started with the sentence detector. I was able to train an Italian SD 
> with a corpus of sentences from http://www.corpusitaliano.it/en/. However the 
> performance of the detector is somewhat below my expectations. It makes 
> pretty obvious mistakes, like failing to recognize an end-of-sentence full 
> stop (example below*), or failing to spot an abbreviation preceded by 
> punctuation (I've posted the issue 1163 on 
> Jira<https://issues.apache.org/jira/browse/OPENNLP-1163> for this).
> 
> Even though the documentation is very good, I feel it lacks some best 
> practices and suggestions. For instance:
> 
>  *   Is my sentence detection training set supposed to have consistent 
> documents or will a bunch of random sentences with a blank line every 20-30 
> work?
>  *   Do my training examples in openNLP native format need to be formatted in 
> a special way? Will the algo ignore stuff like extra white spaces or tabs 
> between words? Do examples with a lot of punctuation like quotes or 
> parenthesis somehow affect the outcome?
>  *   How many training examples (or events) are recommended?
>  *   Is it better to provide a case sensitive abbreviation dictionary vs case 
> insensitive?
>  *   Is the issue 1163 a known problem? I think other languages as French 
> might have the same thing happening.
>  *   Are there examples of complete production-grade data sets in Italian or 
> other languages that have been successfully used to train openNLP tools?
> 
> I believe I could find most of these questions by just looking at the code, 
> but someone who already went through it maybe could point me in the right 
> direction.
> Basically, I'm asking for best practices and pro tips.
> 
> Thank you
> 
> * failure to recognize EOS full stop:
> SENT_1: Molteplici furono i passi che portarono alla nascita di questa 
> disciplina.
> SENT_2: Il primo, sia a livello di importanza che di ordine cronologico, è 
> l'avvento dei calcolatori ed il continuo interesse rivolto ad essi. Già nel 
> 1623, grazie a Willhelm 
> Sickhart<https://it.wikipedia.org/w/index.php?title=Willhelm_Sickhart&action=edit&redlink=1>,
>  si arrivò a creare macchine in grado di effettuare calcoli matematici con 
> numeri fino a sei cifre, anche se non in maniera autonoma.
> 
> 
> Gabriele Vaccari
> Dedalus SpA
>

Re: openNLP best practices - sentence detector

Reply via email to