That is great to hear. Can you send out how you did it? Did you use the raw text? Dan
> On Dec 8, 2017, at 11:31 AM, Gabriele Vaccari <[email protected]> > wrote: > > Hi Dan, > Thank you for your reply. > The Paisa' corpus has the option to download the files in a format similar to > the openNLP one, which by the way I understand is regular sentences each on a > new line without special markup. > > Eventually I was able to train a sentence detector with the Paisa and a few > thousands randomly generated samples featuring the abbreviations. I don't > have tested the performance against a test set yet, but empirically it > appears it's working. > > The Italian Universal Dependencies corpus you mention sounds like something I > would want to have. I'll look into it. > > Thanks again > > Inviato da iPhone > >> Il giorno 08 dic 2017, alle ore 16:04, Dan Russ <[email protected]> ha >> scritto: >> >> I forgot to mention that there is an Italian Universal Dependencies corpus >> that you can use to train a sentence detector. It is in CoNLL-U format, >> which is support by OpenNLP. >> Dan >> >> >>> On Dec 1, 2017, at 7:02 AM, Gabriele Vaccari <[email protected]> >>> wrote: >>> >>> Hi all, >>> >>> I'm trying to use openNLP to train some models for Italian, basically to >>> get some familiarity with the API. To provide some background, I'm familiar >>> with machine learning concepts and understand what an NLP pipeline looks >>> like, however this is the first time I actually have to go ahead and put >>> together an application with all this. >>> So I started with the sentence detector. I was able to train an Italian SD >>> with a corpus of sentences from http://www.corpusitaliano.it/en/. However >>> the performance of the detector is somewhat below my expectations. It makes >>> pretty obvious mistakes, like failing to recognize an end-of-sentence full >>> stop (example below*), or failing to spot an abbreviation preceded by >>> punctuation (I've posted the issue 1163 on >>> Jira<https://issues.apache.org/jira/browse/OPENNLP-1163> for this). >>> >>> Even though the documentation is very good, I feel it lacks some best >>> practices and suggestions. For instance: >>> >>> * Is my sentence detection training set supposed to have consistent >>> documents or will a bunch of random sentences with a blank line every 20-30 >>> work? >>> * Do my training examples in openNLP native format need to be formatted >>> in a special way? Will the algo ignore stuff like extra white spaces or >>> tabs between words? Do examples with a lot of punctuation like quotes or >>> parenthesis somehow affect the outcome? >>> * How many training examples (or events) are recommended? >>> * Is it better to provide a case sensitive abbreviation dictionary vs >>> case insensitive? >>> * Is the issue 1163 a known problem? I think other languages as French >>> might have the same thing happening. >>> * Are there examples of complete production-grade data sets in Italian or >>> other languages that have been successfully used to train openNLP tools? >>> >>> I believe I could find most of these questions by just looking at the code, >>> but someone who already went through it maybe could point me in the right >>> direction. >>> Basically, I'm asking for best practices and pro tips. >>> >>> Thank you >>> >>> * failure to recognize EOS full stop: >>> SENT_1: Molteplici furono i passi che portarono alla nascita di questa >>> disciplina. >>> SENT_2: Il primo, sia a livello di importanza che di ordine cronologico, è >>> l'avvento dei calcolatori ed il continuo interesse rivolto ad essi. Già nel >>> 1623, grazie a Willhelm >>> Sickhart<https://it.wikipedia.org/w/index.php?title=Willhelm_Sickhart&action=edit&redlink=1>, >>> si arrivò a creare macchine in grado di effettuare calcoli matematici con >>> numeri fino a sei cifre, anche se non in maniera autonoma. >>> >>> >>> Gabriele Vaccari >>> Dedalus SpA >>> >>
