Re: sentence detector newline behavior

Tim Miller Mon, 27 Jan 2014 06:54:00 -0800

OK, with the most recent version I am able to replicate the performanceI was getting before. Thanks a lot Jörn!

Assuming this is in the next incremental release of opennlp, how quicklycan we get a re-trained model into cTAKES? I heard from a researcher atAMIA who tried cTAKES and because of this bug in the way we handlesentences was trying to find an outside sentence detector as apreprocess to cTAKES, and frankly that is insane. We should be able toget something this simple right. And I think this is the kind of thingthat can leave new users scratching their heads and doubting our overallcompetence.

James, I believe you are usually the one who rebuilds the models? Whatwould be the best way to incorporate the data I have that has someinstances of non-sentence terminating newlines?


Tim


On 01/27/2014 06:10 AM, Jörn Kottmann wrote:

On 01/26/2014 11:29 PM, Miller, Timothy wrote:

Yes, this fixes the whitespace sentence issue but the evaluation issue
remains. I believe the problem is in SentenceSampleStream, where in the
following block the whitespace trim happens before the <LF> character is
replaced with the \n character. So test sentences that ended with <LF>
will be one character longer than they should be.

>       sentence = sentence.trim();
>       sentence = replaceNewLineEscapeTags(sentence);
>       sentencesString.append(sentence);
>       int end = sentencesString.length();
>       sentenceSpans.add(new Span(begin, end));
>       sentencesString.append(' ');

Yes, that must be the issue. During training the new line is inlucdedin the span, and duringdetection the white space remover creates a span without the new linechar.

I suggest that the evaluator just ignores white space differencesbetween sentences. My test case then

has the expected performance numbers.

What do you think?

Anyway, I committed the change. Please give it a try.

Jörn

Re: sentence detector newline behavior

Reply via email to