Re: openNLP best practices - sentence detector

Dan Russ Fri, 08 Dec 2017 09:12:47 -0800

I am not sure this make a real difference in Italian, but there are numerous 
cases where the annotated tokens are converted from upper to lower case.  In 
text id 7000002, the sentence:


    E’ l’arredamento, soprattutto, a colpire.  

Is annotated:


1       è       essere  V       V       num=s|per=3|mod=i|ten=p 0       ROOT
2       l'      il      R       RD      num=s|gen=n     3       det
3       arredamento     arredamento     S       S       num=s|gen=m     1       
pred
4       ,       ,       F       FF      _       5       punc
5       soprattutto     soprattutto     B       B       _       3       mod
6       ,       ,       F       FF      _       5       punc
7       a       a       E       E       _       1       arg
8       colpire colpire V       V       mod=f   7       prep
9       .       .       F       FS      _       1       punc

Notice the first letter (word) is not capitalized.
Dan

> On Dec 8, 2017, at 11:31 AM, Gabriele Vaccari <[email protected]> 
> wrote:
> 
> Hi Dan,
> Thank you for your reply. 
> The Paisa' corpus has the option to download the files in a format similar to 
> the openNLP one, which by the way I understand is regular sentences each on a 
> new line without special markup.
> 
> Eventually I was able to train a sentence detector with the Paisa and a few 
> thousands randomly generated samples featuring the abbreviations. I don't 
> have tested the performance against a test set yet, but empirically it 
> appears it's working. 
> 
> The Italian Universal Dependencies corpus you mention sounds like something I 
> would want to have. I'll look into it.
> 
> Thanks again
> 
> Inviato da iPhone
> 
>> Il giorno 08 dic 2017, alle ore 16:04, Dan Russ <[email protected]> ha 
>> scritto:
>> 
>> I forgot to mention that there is an Italian Universal Dependencies corpus 
>> that you can use to train a sentence detector.  It is in CoNLL-U format, 
>> which is support by OpenNLP.
>> Dan
>> 
>> 
>>> On Dec 1, 2017, at 7:02 AM, Gabriele Vaccari <[email protected]> 
>>> wrote:
>>> 
>>> Hi all,
>>> 
>>> I'm trying to use openNLP to train some models for Italian, basically to 
>>> get some familiarity with the API. To provide some background, I'm familiar 
>>> with machine learning concepts and understand what an NLP pipeline looks 
>>> like, however this is the first time I actually have to go ahead and put 
>>> together an application with all this.
>>> So I started with the sentence detector. I was able to train an Italian SD 
>>> with a corpus of sentences from http://www.corpusitaliano.it/en/. However 
>>> the performance of the detector is somewhat below my expectations. It makes 
>>> pretty obvious mistakes, like failing to recognize an end-of-sentence full 
>>> stop (example below*), or failing to spot an abbreviation preceded by 
>>> punctuation (I've posted the issue 1163 on 
>>> Jira<https://issues.apache.org/jira/browse/OPENNLP-1163> for this).
>>> 
>>> Even though the documentation is very good, I feel it lacks some best 
>>> practices and suggestions. For instance:
>>> 
>>> *   Is my sentence detection training set supposed to have consistent 
>>> documents or will a bunch of random sentences with a blank line every 20-30 
>>> work?
>>> *   Do my training examples in openNLP native format need to be formatted 
>>> in a special way? Will the algo ignore stuff like extra white spaces or 
>>> tabs between words? Do examples with a lot of punctuation like quotes or 
>>> parenthesis somehow affect the outcome?
>>> *   How many training examples (or events) are recommended?
>>> *   Is it better to provide a case sensitive abbreviation dictionary vs 
>>> case insensitive?
>>> *   Is the issue 1163 a known problem? I think other languages as French 
>>> might have the same thing happening.
>>> *   Are there examples of complete production-grade data sets in Italian or 
>>> other languages that have been successfully used to train openNLP tools?
>>> 
>>> I believe I could find most of these questions by just looking at the code, 
>>> but someone who already went through it maybe could point me in the right 
>>> direction.
>>> Basically, I'm asking for best practices and pro tips.
>>> 
>>> Thank you
>>> 
>>> * failure to recognize EOS full stop:
>>> SENT_1: Molteplici furono i passi che portarono alla nascita di questa 
>>> disciplina.
>>> SENT_2: Il primo, sia a livello di importanza che di ordine cronologico, è 
>>> l'avvento dei calcolatori ed il continuo interesse rivolto ad essi. Già nel 
>>> 1623, grazie a Willhelm 
>>> Sickhart<https://it.wikipedia.org/w/index.php?title=Willhelm_Sickhart&action=edit&redlink=1>,
>>>  si arrivò a creare macchine in grado di effettuare calcoli matematici con 
>>> numeri fino a sei cifre, anche se non in maniera autonoma.
>>> 
>>> 
>>> Gabriele Vaccari
>>> Dedalus SpA
>>> 
>>

Re: openNLP best practices - sentence detector

Reply via email to