I am not sure this make a real difference in Italian, but there are numerous
cases where the annotated tokens are converted from upper to lower case. In
text id 7000002, the sentence:
E’ l’arredamento, soprattutto, a colpire.
Is annotated:
1 è essere V V num=s|per=3|mod=i|ten=p 0 ROOT
2 l' il R RD num=s|gen=n 3 det
3 arredamento arredamento S S num=s|gen=m 1
pred
4 , , F FF _ 5 punc
5 soprattutto soprattutto B B _ 3 mod
6 , , F FF _ 5 punc
7 a a E E _ 1 arg
8 colpire colpire V V mod=f 7 prep
9 . . F FS _ 1 punc
Notice the first letter (word) is not capitalized.
Dan
> On Dec 8, 2017, at 11:31 AM, Gabriele Vaccari <[email protected]>
> wrote:
>
> Hi Dan,
> Thank you for your reply.
> The Paisa' corpus has the option to download the files in a format similar to
> the openNLP one, which by the way I understand is regular sentences each on a
> new line without special markup.
>
> Eventually I was able to train a sentence detector with the Paisa and a few
> thousands randomly generated samples featuring the abbreviations. I don't
> have tested the performance against a test set yet, but empirically it
> appears it's working.
>
> The Italian Universal Dependencies corpus you mention sounds like something I
> would want to have. I'll look into it.
>
> Thanks again
>
> Inviato da iPhone
>
>> Il giorno 08 dic 2017, alle ore 16:04, Dan Russ <[email protected]> ha
>> scritto:
>>
>> I forgot to mention that there is an Italian Universal Dependencies corpus
>> that you can use to train a sentence detector. It is in CoNLL-U format,
>> which is support by OpenNLP.
>> Dan
>>
>>
>>> On Dec 1, 2017, at 7:02 AM, Gabriele Vaccari <[email protected]>
>>> wrote:
>>>
>>> Hi all,
>>>
>>> I'm trying to use openNLP to train some models for Italian, basically to
>>> get some familiarity with the API. To provide some background, I'm familiar
>>> with machine learning concepts and understand what an NLP pipeline looks
>>> like, however this is the first time I actually have to go ahead and put
>>> together an application with all this.
>>> So I started with the sentence detector. I was able to train an Italian SD
>>> with a corpus of sentences from http://www.corpusitaliano.it/en/. However
>>> the performance of the detector is somewhat below my expectations. It makes
>>> pretty obvious mistakes, like failing to recognize an end-of-sentence full
>>> stop (example below*), or failing to spot an abbreviation preceded by
>>> punctuation (I've posted the issue 1163 on
>>> Jira<https://issues.apache.org/jira/browse/OPENNLP-1163> for this).
>>>
>>> Even though the documentation is very good, I feel it lacks some best
>>> practices and suggestions. For instance:
>>>
>>> * Is my sentence detection training set supposed to have consistent
>>> documents or will a bunch of random sentences with a blank line every 20-30
>>> work?
>>> * Do my training examples in openNLP native format need to be formatted
>>> in a special way? Will the algo ignore stuff like extra white spaces or
>>> tabs between words? Do examples with a lot of punctuation like quotes or
>>> parenthesis somehow affect the outcome?
>>> * How many training examples (or events) are recommended?
>>> * Is it better to provide a case sensitive abbreviation dictionary vs
>>> case insensitive?
>>> * Is the issue 1163 a known problem? I think other languages as French
>>> might have the same thing happening.
>>> * Are there examples of complete production-grade data sets in Italian or
>>> other languages that have been successfully used to train openNLP tools?
>>>
>>> I believe I could find most of these questions by just looking at the code,
>>> but someone who already went through it maybe could point me in the right
>>> direction.
>>> Basically, I'm asking for best practices and pro tips.
>>>
>>> Thank you
>>>
>>> * failure to recognize EOS full stop:
>>> SENT_1: Molteplici furono i passi che portarono alla nascita di questa
>>> disciplina.
>>> SENT_2: Il primo, sia a livello di importanza che di ordine cronologico, è
>>> l'avvento dei calcolatori ed il continuo interesse rivolto ad essi. Già nel
>>> 1623, grazie a Willhelm
>>> Sickhart<https://it.wikipedia.org/w/index.php?title=Willhelm_Sickhart&action=edit&redlink=1>,
>>> si arrivò a creare macchine in grado di effettuare calcoli matematici con
>>> numeri fino a sei cifre, anche se non in maniera autonoma.
>>>
>>>
>>> Gabriele Vaccari
>>> Dedalus SpA
>>>
>>