I forgot to mention that there is an Italian Universal Dependencies corpus that you can use to train a sentence detector. It is in CoNLL-U format, which is support by OpenNLP. Dan
> On Dec 1, 2017, at 7:02 AM, Gabriele Vaccari <[email protected]> > wrote: > > Hi all, > > I'm trying to use openNLP to train some models for Italian, basically to get > some familiarity with the API. To provide some background, I'm familiar with > machine learning concepts and understand what an NLP pipeline looks like, > however this is the first time I actually have to go ahead and put together > an application with all this. > So I started with the sentence detector. I was able to train an Italian SD > with a corpus of sentences from http://www.corpusitaliano.it/en/. However the > performance of the detector is somewhat below my expectations. It makes > pretty obvious mistakes, like failing to recognize an end-of-sentence full > stop (example below*), or failing to spot an abbreviation preceded by > punctuation (I've posted the issue 1163 on > Jira<https://issues.apache.org/jira/browse/OPENNLP-1163> for this). > > Even though the documentation is very good, I feel it lacks some best > practices and suggestions. For instance: > > * Is my sentence detection training set supposed to have consistent > documents or will a bunch of random sentences with a blank line every 20-30 > work? > * Do my training examples in openNLP native format need to be formatted in > a special way? Will the algo ignore stuff like extra white spaces or tabs > between words? Do examples with a lot of punctuation like quotes or > parenthesis somehow affect the outcome? > * How many training examples (or events) are recommended? > * Is it better to provide a case sensitive abbreviation dictionary vs case > insensitive? > * Is the issue 1163 a known problem? I think other languages as French > might have the same thing happening. > * Are there examples of complete production-grade data sets in Italian or > other languages that have been successfully used to train openNLP tools? > > I believe I could find most of these questions by just looking at the code, > but someone who already went through it maybe could point me in the right > direction. > Basically, I'm asking for best practices and pro tips. > > Thank you > > * failure to recognize EOS full stop: > SENT_1: Molteplici furono i passi che portarono alla nascita di questa > disciplina. > SENT_2: Il primo, sia a livello di importanza che di ordine cronologico, è > l'avvento dei calcolatori ed il continuo interesse rivolto ad essi. Già nel > 1623, grazie a Willhelm > Sickhart<https://it.wikipedia.org/w/index.php?title=Willhelm_Sickhart&action=edit&redlink=1>, > si arrivò a creare macchine in grado di effettuare calcoli matematici con > numeri fino a sei cifre, anche se non in maniera autonoma. > > > Gabriele Vaccari > Dedalus SpA >
