Hi Dan, Thank you for your reply. The Paisa' corpus has the option to download the files in a format similar to the openNLP one, which by the way I understand is regular sentences each on a new line without special markup.
Eventually I was able to train a sentence detector with the Paisa and a few thousands randomly generated samples featuring the abbreviations. I don't have tested the performance against a test set yet, but empirically it appears it's working. The Italian Universal Dependencies corpus you mention sounds like something I would want to have. I'll look into it. Thanks again Inviato da iPhone > Il giorno 08 dic 2017, alle ore 16:04, Dan Russ <[email protected]> ha > scritto: > > I forgot to mention that there is an Italian Universal Dependencies corpus > that you can use to train a sentence detector. It is in CoNLL-U format, > which is support by OpenNLP. > Dan > > >> On Dec 1, 2017, at 7:02 AM, Gabriele Vaccari <[email protected]> >> wrote: >> >> Hi all, >> >> I'm trying to use openNLP to train some models for Italian, basically to get >> some familiarity with the API. To provide some background, I'm familiar with >> machine learning concepts and understand what an NLP pipeline looks like, >> however this is the first time I actually have to go ahead and put together >> an application with all this. >> So I started with the sentence detector. I was able to train an Italian SD >> with a corpus of sentences from http://www.corpusitaliano.it/en/. However >> the performance of the detector is somewhat below my expectations. It makes >> pretty obvious mistakes, like failing to recognize an end-of-sentence full >> stop (example below*), or failing to spot an abbreviation preceded by >> punctuation (I've posted the issue 1163 on >> Jira<https://issues.apache.org/jira/browse/OPENNLP-1163> for this). >> >> Even though the documentation is very good, I feel it lacks some best >> practices and suggestions. For instance: >> >> * Is my sentence detection training set supposed to have consistent >> documents or will a bunch of random sentences with a blank line every 20-30 >> work? >> * Do my training examples in openNLP native format need to be formatted in >> a special way? Will the algo ignore stuff like extra white spaces or tabs >> between words? Do examples with a lot of punctuation like quotes or >> parenthesis somehow affect the outcome? >> * How many training examples (or events) are recommended? >> * Is it better to provide a case sensitive abbreviation dictionary vs case >> insensitive? >> * Is the issue 1163 a known problem? I think other languages as French >> might have the same thing happening. >> * Are there examples of complete production-grade data sets in Italian or >> other languages that have been successfully used to train openNLP tools? >> >> I believe I could find most of these questions by just looking at the code, >> but someone who already went through it maybe could point me in the right >> direction. >> Basically, I'm asking for best practices and pro tips. >> >> Thank you >> >> * failure to recognize EOS full stop: >> SENT_1: Molteplici furono i passi che portarono alla nascita di questa >> disciplina. >> SENT_2: Il primo, sia a livello di importanza che di ordine cronologico, è >> l'avvento dei calcolatori ed il continuo interesse rivolto ad essi. Già nel >> 1623, grazie a Willhelm >> Sickhart<https://it.wikipedia.org/w/index.php?title=Willhelm_Sickhart&action=edit&redlink=1>, >> si arrivò a creare macchine in grado di effettuare calcoli matematici con >> numeri fino a sei cifre, anche se non in maniera autonoma. >> >> >> Gabriele Vaccari >> Dedalus SpA >> >
