I forgot to mention that there is an Italian Universal Dependencies corpus that 
you can use to train a sentence detector.  It is in CoNLL-U format, which is 
support by OpenNLP.
Dan


> On Dec 1, 2017, at 7:02 AM, Gabriele Vaccari <[email protected]> 
> wrote:
> 
> Hi all,
> 
> I'm trying to use openNLP to train some models for Italian, basically to get 
> some familiarity with the API. To provide some background, I'm familiar with 
> machine learning concepts and understand what an NLP pipeline looks like, 
> however this is the first time I actually have to go ahead and put together 
> an application with all this.
> So I started with the sentence detector. I was able to train an Italian SD 
> with a corpus of sentences from http://www.corpusitaliano.it/en/. However the 
> performance of the detector is somewhat below my expectations. It makes 
> pretty obvious mistakes, like failing to recognize an end-of-sentence full 
> stop (example below*), or failing to spot an abbreviation preceded by 
> punctuation (I've posted the issue 1163 on 
> Jira<https://issues.apache.org/jira/browse/OPENNLP-1163> for this).
> 
> Even though the documentation is very good, I feel it lacks some best 
> practices and suggestions. For instance:
> 
>  *   Is my sentence detection training set supposed to have consistent 
> documents or will a bunch of random sentences with a blank line every 20-30 
> work?
>  *   Do my training examples in openNLP native format need to be formatted in 
> a special way? Will the algo ignore stuff like extra white spaces or tabs 
> between words? Do examples with a lot of punctuation like quotes or 
> parenthesis somehow affect the outcome?
>  *   How many training examples (or events) are recommended?
>  *   Is it better to provide a case sensitive abbreviation dictionary vs case 
> insensitive?
>  *   Is the issue 1163 a known problem? I think other languages as French 
> might have the same thing happening.
>  *   Are there examples of complete production-grade data sets in Italian or 
> other languages that have been successfully used to train openNLP tools?
> 
> I believe I could find most of these questions by just looking at the code, 
> but someone who already went through it maybe could point me in the right 
> direction.
> Basically, I'm asking for best practices and pro tips.
> 
> Thank you
> 
> * failure to recognize EOS full stop:
> SENT_1: Molteplici furono i passi che portarono alla nascita di questa 
> disciplina.
> SENT_2: Il primo, sia a livello di importanza che di ordine cronologico, è 
> l'avvento dei calcolatori ed il continuo interesse rivolto ad essi. Già nel 
> 1623, grazie a Willhelm 
> Sickhart<https://it.wikipedia.org/w/index.php?title=Willhelm_Sickhart&action=edit&redlink=1>,
>  si arrivò a creare macchine in grado di effettuare calcoli matematici con 
> numeri fino a sei cifre, anche se non in maniera autonoma.
> 
> 
> Gabriele Vaccari
> Dedalus SpA
> 

Reply via email to