Re: openNLP best practices - sentence detector

Gabriele Vaccari Fri, 08 Dec 2017 08:31:56 -0800

Hi Dan,
Thank you for your reply. 
The Paisa' corpus has the option to download the files in a format similar to 
the openNLP one, which by the way I understand is regular sentences each on a 
new line without special markup.


Eventually I was able to train a sentence detector with the Paisa and a few 
thousands randomly generated samples featuring the abbreviations. I don't have 
tested the performance against a test set yet, but empirically it appears it's 
working. 

The Italian Universal Dependencies corpus you mention sounds like something I 
would want to have. I'll look into it.

Thanks again

Inviato da iPhone

> Il giorno 08 dic 2017, alle ore 16:04, Dan Russ <[email protected]> ha 
> scritto:
> 
> I forgot to mention that there is an Italian Universal Dependencies corpus 
> that you can use to train a sentence detector.  It is in CoNLL-U format, 
> which is support by OpenNLP.
> Dan
> 
> 
>> On Dec 1, 2017, at 7:02 AM, Gabriele Vaccari <[email protected]> 
>> wrote:
>> 
>> Hi all,
>> 
>> I'm trying to use openNLP to train some models for Italian, basically to get 
>> some familiarity with the API. To provide some background, I'm familiar with 
>> machine learning concepts and understand what an NLP pipeline looks like, 
>> however this is the first time I actually have to go ahead and put together 
>> an application with all this.
>> So I started with the sentence detector. I was able to train an Italian SD 
>> with a corpus of sentences from http://www.corpusitaliano.it/en/. However 
>> the performance of the detector is somewhat below my expectations. It makes 
>> pretty obvious mistakes, like failing to recognize an end-of-sentence full 
>> stop (example below*), or failing to spot an abbreviation preceded by 
>> punctuation (I've posted the issue 1163 on 
>> Jira<https://issues.apache.org/jira/browse/OPENNLP-1163> for this).
>> 
>> Even though the documentation is very good, I feel it lacks some best 
>> practices and suggestions. For instance:
>> 
>> *   Is my sentence detection training set supposed to have consistent 
>> documents or will a bunch of random sentences with a blank line every 20-30 
>> work?
>> *   Do my training examples in openNLP native format need to be formatted in 
>> a special way? Will the algo ignore stuff like extra white spaces or tabs 
>> between words? Do examples with a lot of punctuation like quotes or 
>> parenthesis somehow affect the outcome?
>> *   How many training examples (or events) are recommended?
>> *   Is it better to provide a case sensitive abbreviation dictionary vs case 
>> insensitive?
>> *   Is the issue 1163 a known problem? I think other languages as French 
>> might have the same thing happening.
>> *   Are there examples of complete production-grade data sets in Italian or 
>> other languages that have been successfully used to train openNLP tools?
>> 
>> I believe I could find most of these questions by just looking at the code, 
>> but someone who already went through it maybe could point me in the right 
>> direction.
>> Basically, I'm asking for best practices and pro tips.
>> 
>> Thank you
>> 
>> * failure to recognize EOS full stop:
>> SENT_1: Molteplici furono i passi che portarono alla nascita di questa 
>> disciplina.
>> SENT_2: Il primo, sia a livello di importanza che di ordine cronologico, è 
>> l'avvento dei calcolatori ed il continuo interesse rivolto ad essi. Già nel 
>> 1623, grazie a Willhelm 
>> Sickhart<https://it.wikipedia.org/w/index.php?title=Willhelm_Sickhart&action=edit&redlink=1>,
>>  si arrivò a creare macchine in grado di effettuare calcoli matematici con 
>> numeri fino a sei cifre, anche se non in maniera autonoma.
>> 
>> 
>> Gabriele Vaccari
>> Dedalus SpA
>> 
>

Re: openNLP best practices - sentence detector

Reply via email to