Hi, I'm trying to use OpenNLP SentenceDetector to split italian sentences (without abbreviations) which represent speeches.
I have a quite big data-set annotated by human experts in which each document is a line of text, segmented in one or more pieces depending on our needs. To better understand my case, if the line is the following: I'm not able to play tennis - he said - You're right - replied his wife The right segmentation should be: I'm not able to play tennis - he said - You're right - replied his wife I decided to try a statistical approach to segment my text, and the SentenceDetector seems to be the right choice to me. I've build the training set in the format specified in http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect.training which is: - one segment per line - a blank line to separate two documents To evaluate performance I've divided my dataset in one for training and one for validation but the performance was quite low: Precision: 0.4485549132947977 Recall: 0.3038371182458888 F-Measure: 0.3622782446311859 Since I've used default values I guess there should be some way to obtain better results...or maybe do I need another model? Thanks, Riccardo
