OK, with the most recent version I am able to replicate the performance I was getting before. Thanks a lot Jörn!

Assuming this is in the next incremental release of opennlp, how quickly can we get a re-trained model into cTAKES? I heard from a researcher at AMIA who tried cTAKES and because of this bug in the way we handle sentences was trying to find an outside sentence detector as a preprocess to cTAKES, and frankly that is insane. We should be able to get something this simple right. And I think this is the kind of thing that can leave new users scratching their heads and doubting our overall competence.

James, I believe you are usually the one who rebuilds the models? What would be the best way to incorporate the data I have that has some instances of non-sentence terminating newlines?

Tim


On 01/27/2014 06:10 AM, Jörn Kottmann wrote:
On 01/26/2014 11:29 PM, Miller, Timothy wrote:
Yes, this fixes the whitespace sentence issue but the evaluation issue
remains. I believe the problem is in SentenceSampleStream, where in the
following block the whitespace trim happens before the <LF> character is
replaced with the \n character. So test sentences that ended with <LF>
will be one character longer than they should be.

>       sentence = sentence.trim();
>       sentence = replaceNewLineEscapeTags(sentence);
>       sentencesString.append(sentence);
>       int end = sentencesString.length();
>       sentenceSpans.add(new Span(begin, end));
>       sentencesString.append(' ');

Yes, that must be the issue. During training the new line is inlucded in the span, and during detection the white space remover creates a span without the new line char.

I suggest that the evaluator just ignores white space differences between sentences. My test case then
has the expected performance numbers.

What do you think?

Anyway, I committed the change. Please give it a try.

Jörn

Reply via email to