I have 1966 very short documents (as in the former example) which are
splitted in 2525 segments.
As you can argue many documents should not be splitted at all.
For evaluation I've just splitted the data-set in two equal parts.
Is the Sentence Detector able to split also on non dot characters? In my
case there should be also other characters delimiting the end of a segment,
such as: colon (:), dash (-), various kind of quotation marks (", `, ',
...).
The other trap is that I shouldn't split on every dot.
For example:
"Hello, my name is Riccardo. I've studied computer science in 2002 - he
said - and I finished in 2009." Then he began to type something on his
keyboard. It was binary code!
Should be segmented as:
"Hello, my name is Riccardo. I've studied computer science in 2002
- he said -
and I finished in 2009."
Then he began to type something on his keyboard. It was binary code!
Cheers,
Riccardo
2013/3/26 James Kosin <[email protected]>
> On 3/25/2013 11:31 AM, Riccardo Tasso wrote:
>
>> Hi, I'm trying to use OpenNLP SentenceDetector to split italian sentences
>> (without abbreviations) which represent speeches.
>>
>> I have a quite big data-set annotated by human experts in which each
>> document is a line of text, segmented in one or more pieces depending on
>> our needs.
>>
>> To better understand my case, if the line is the following:
>> I'm not able to play tennis - he said - You're right - replied his wife
>>
>> The right segmentation should be:
>> I'm not able to play tennis
>> - he said -
>> You're right
>> - replied his wife
>>
>> I decided to try a statistical approach to segment my text, and the
>> SentenceDetector seems to be the right choice to me.
>>
>> I've build the training set in the format specified in
>> http://opennlp.apache.org/**documentation/1.5.2-**
>> incubating/manual/opennlp.**html#tools.sentdetect.training<http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect.training>
>> which
>> is:
>>
>> - one segment per line
>> - a blank line to separate two documents
>>
>>
>> To evaluate performance I've divided my dataset in one for training and
>> one
>> for validation but the performance was quite low:
>> Precision: 0.4485549132947977
>> Recall: 0.3038371182458888
>> F-Measure: 0.3622782446311859
>>
>> Since I've used default values I guess there should be some way to obtain
>> better results...or maybe do I need another model?
>>
>> Thanks,
>> Riccardo
>>
>> Riccardo,
>
> How many sentences, and documents in your training set?
>
> James
>