Re: openNLP best practices - sentence detector

Dan Russ Fri, 08 Dec 2017 08:35:19 -0800

That is great to hear.  Can you send out how you did it?  Did you use the raw 
text?
Dan


> On Dec 8, 2017, at 11:31 AM, Gabriele Vaccari <[email protected]> 
> wrote:
> 
> Hi Dan,
> Thank you for your reply. 
> The Paisa' corpus has the option to download the files in a format similar to 
> the openNLP one, which by the way I understand is regular sentences each on a 
> new line without special markup.
> 
> Eventually I was able to train a sentence detector with the Paisa and a few 
> thousands randomly generated samples featuring the abbreviations. I don't 
> have tested the performance against a test set yet, but empirically it 
> appears it's working. 
> 
> The Italian Universal Dependencies corpus you mention sounds like something I 
> would want to have. I'll look into it.
> 
> Thanks again
> 
> Inviato da iPhone
> 
>> Il giorno 08 dic 2017, alle ore 16:04, Dan Russ <[email protected]> ha 
>> scritto:
>> 
>> I forgot to mention that there is an Italian Universal Dependencies corpus 
>> that you can use to train a sentence detector.  It is in CoNLL-U format, 
>> which is support by OpenNLP.
>> Dan
>> 
>> 
>>> On Dec 1, 2017, at 7:02 AM, Gabriele Vaccari <[email protected]> 
>>> wrote:
>>> 
>>> Hi all,
>>> 
>>> I'm trying to use openNLP to train some models for Italian, basically to 
>>> get some familiarity with the API. To provide some background, I'm familiar 
>>> with machine learning concepts and understand what an NLP pipeline looks 
>>> like, however this is the first time I actually have to go ahead and put 
>>> together an application with all this.
>>> So I started with the sentence detector. I was able to train an Italian SD 
>>> with a corpus of sentences from http://www.corpusitaliano.it/en/. However 
>>> the performance of the detector is somewhat below my expectations. It makes 
>>> pretty obvious mistakes, like failing to recognize an end-of-sentence full 
>>> stop (example below*), or failing to spot an abbreviation preceded by 
>>> punctuation (I've posted the issue 1163 on 
>>> Jira<https://issues.apache.org/jira/browse/OPENNLP-1163> for this).
>>> 
>>> Even though the documentation is very good, I feel it lacks some best 
>>> practices and suggestions. For instance:
>>> 
>>> *   Is my sentence detection training set supposed to have consistent 
>>> documents or will a bunch of random sentences with a blank line every 20-30 
>>> work?
>>> *   Do my training examples in openNLP native format need to be formatted 
>>> in a special way? Will the algo ignore stuff like extra white spaces or 
>>> tabs between words? Do examples with a lot of punctuation like quotes or 
>>> parenthesis somehow affect the outcome?
>>> *   How many training examples (or events) are recommended?
>>> *   Is it better to provide a case sensitive abbreviation dictionary vs 
>>> case insensitive?
>>> *   Is the issue 1163 a known problem? I think other languages as French 
>>> might have the same thing happening.
>>> *   Are there examples of complete production-grade data sets in Italian or 
>>> other languages that have been successfully used to train openNLP tools?
>>> 
>>> I believe I could find most of these questions by just looking at the code, 
>>> but someone who already went through it maybe could point me in the right 
>>> direction.
>>> Basically, I'm asking for best practices and pro tips.
>>> 
>>> Thank you
>>> 
>>> * failure to recognize EOS full stop:
>>> SENT_1: Molteplici furono i passi che portarono alla nascita di questa 
>>> disciplina.
>>> SENT_2: Il primo, sia a livello di importanza che di ordine cronologico, è 
>>> l'avvento dei calcolatori ed il continuo interesse rivolto ad essi. Già nel 
>>> 1623, grazie a Willhelm 
>>> Sickhart<https://it.wikipedia.org/w/index.php?title=Willhelm_Sickhart&action=edit&redlink=1>,
>>>  si arrivò a creare macchine in grado di effettuare calcoli matematici con 
>>> numeri fino a sei cifre, anche se non in maniera autonoma.
>>> 
>>> 
>>> Gabriele Vaccari
>>> Dedalus SpA
>>> 
>>

Re: openNLP best practices - sentence detector

Reply via email to