Re: openNLP best practices - sentence detector

Gabriele Vaccari Sat, 09 Dec 2017 07:19:49 -0800

The Paisa corpus has the option to download the search results as plain text, 
with every sentence on a new line. It's not "good" plain text though. The 
sentences are generated as if you joined the token list with spaces so you get 
stuff like:


"Mi piace bere caffè , tè e succo d ' arancia ." 

As you can see, each token, including punctuation marks, is separated by 
spaces. This is going to mess with the trainer, so I did some regex 
preprocessing to remove extra spaces. 

Then you have to make sure you do three things:

1) make sure you have a blank line every 20-ish lines to mark end/start of 
documents. The docs mention this too.

2) you must call the opennlp SentenceDetectorTrainer command with the 
-eosCharacters option and specify a end-of-sentence character string. I 
recommend this if your language may separate words with punctuation other than 
white spaces, e.g. the apostrophe (sigle quote) in Italian. In the phrase 
L'AMORE, the L and AMORE must be parsed as two separate tokens. This is 
especially important if you have abbreviations that might be preceded by these 
characters. The sentence detector model uses the tokens before and after the 
end-of-sentence character as features to make a prediction, but it will read 
characters before the EOS char until it finds a white space OR an EOS 
character. Yes, the EOS chars are used to detect both sentence AND token 
boundaries.
So if you don't pay attention to this when you train the model, it will fail to 
recognize abbreviations when they are preceded by a special token-delimiting 
character.

3) I added to my training set a few thousands of randomly generated sentences 
which contain punctuated abbreviations from my abbreviations.xml file. 
The sentences are just sequences of words separated by white spaces taken 
randomly from an Italian dictionary. Each sentence contains at least one 
punctuated abbreviation. I constructed the dictionary by taking each unique 
word from the other samples in my training set.
Then for each sentence I generated other 10-15 random sentences without the 
abbreviation, and combined them into documents. Then added them to the training 
set. 
The sentences don't make actual sense in Italian. I did this to feed the model 
some more examples of sentences containing abbreviations so that it learns to 
recognize them

And that's all.
As I said, I haven't rigorously tested my assumptions yet, but empirically it 
looks like it works.

Cheers,
Gabriele

Inviato da iPhone

> Il giorno 08 dic 2017, alle ore 17:34, Dan Russ < @gmail.com> ha scritto:
> 
> That is great to hear.  Can you send out how you did it?  Did you use the raw 
> text?
> Dan
> 
>> On Dec 8, 2017, at 11:31 AM, Gabriele Vaccari <[email protected]> 
>> wrote:
>> 
>> Hi Dan,
>> Thank you for your reply. 
>> The Paisa' corpus has the option to download the files in a format similar 
>> to the openNLP one, which by the way I understand is regular sentences each 
>> on a new line without special markup.
>> 
>> Eventually I was able to train a sentence detector with the Paisa and a few 
>> thousands randomly generated samples featuring the abbreviations. I don't 
>> have tested the performance against a test set yet, but empirically it 
>> appears it's working. 
>> 
>> The Italian Universal Dependencies corpus you mention sounds like something 
>> I would want to have. I'll look into it.
>> 
>> Thanks again
>> 
>> Inviato da iPhone
>> 
>>> Il giorno 08 dic 2017, alle ore 16:04, Dan Russ <[email protected]> ha 
>>> scritto:
>>> 
>>> I forgot to mention that there is an Italian Universal Dependencies corpus 
>>> that you can use to train a sentence detector.  It is in CoNLL-U format, 
>>> which is support by OpenNLP.
>>> Dan
>>> 
>>> 
>>>> On Dec 1, 2017, at 7:02 AM, Gabriele Vaccari <[email protected]> 
>>>> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> I'm trying to use openNLP to train some models for Italian, basically to 
>>>> get some familiarity with the API. To provide some background, I'm 
>>>> familiar with machine learning concepts and understand what an NLP 
>>>> pipeline looks like, however this is the first time I actually have to go 
>>>> ahead and put together an application with all this.
>>>> So I started with the sentence detector. I was able to train an Italian SD 
>>>> with a corpus of sentences from http://www.corpusitaliano.it/en/. However 
>>>> the performance of the detector is somewhat below my expectations. It 
>>>> makes pretty obvious mistakes, like failing to recognize an 
>>>> end-of-sentence full stop (example below*), or failing to spot an 
>>>> abbreviation preceded by punctuation (I've posted the issue 1163 on 
>>>> Jira<https://issues.apache.org/jira/browse/OPENNLP-1163> for this).
>>>> 
>>>> Even though the documentation is very good, I feel it lacks some best 
>>>> practices and suggestions. For instance:
>>>> 
>>>> *   Is my sentence detection training set supposed to have consistent 
>>>> documents or will a bunch of random sentences with a blank line every 
>>>> 20-30 work?
>>>> *   Do my training examples in openNLP native format need to be formatted 
>>>> in a special way? Will the algo ignore stuff like extra white spaces or 
>>>> tabs between words? Do examples with a lot of punctuation like quotes or 
>>>> parenthesis somehow affect the outcome?
>>>> *   How many training examples (or events) are recommended?
>>>> *   Is it better to provide a case sensitive abbreviation dictionary vs 
>>>> case insensitive?
>>>> *   Is the issue 1163 a known problem? I think other languages as French 
>>>> might have the same thing happening.
>>>> *   Are there examples of complete production-grade data sets in Italian 
>>>> or other languages that have been successfully used to train openNLP tools?
>>>> 
>>>> I believe I could find most of these questions by just looking at the 
>>>> code, but someone who already went through it maybe could point me in the 
>>>> right direction.
>>>> Basically, I'm asking for best practices and pro tips.
>>>> 
>>>> Thank you
>>>> 
>>>> * failure to recognize EOS full stop:
>>>> SENT_1: Molteplici furono i passi che portarono alla nascita di questa 
>>>> disciplina.
>>>> SENT_2: Il primo, sia a livello di importanza che di ordine cronologico, è 
>>>> l'avvento dei calcolatori ed il continuo interesse rivolto ad essi. Già 
>>>> nel 1623, grazie a Willhelm 
>>>> Sickhart<https://it.wikipedia.org/w/index.php?title=Willhelm_Sickhart&action=edit&redlink=1>,
>>>>  si arrivò a creare macchine in grado di effettuare calcoli matematici con 
>>>> numeri fino a sei cifre, anche se non in maniera autonoma.
>>>> 
>>>> 
>>>> Gabriele Vaccari
>>>> Dedalus SpA
>>>> 
>>> 
>

Re: openNLP best practices - sentence detector

Reply via email to