On 01/27/2014 02:35 PM, Masanz, James J. wrote:
Tim, is the training data something you can share publicly? Or privately? I
can't publicly share the data that has been used to train the sentence
detector, I can only share the models that get built. And you can't build a
model from an existing model + more data, you need all the training data
together.
It is from the MIMIC corpus which I definitely can't share publicly, but
it's worth looking into whether I could share it privately with another
person who has a signed data use agreement.
Regarding how quickly we can get this out there, I can train a new sentence
detector in a day or two. But that's just the first step - to really
incorporate this, I would suggest this be a point release. We would need a
release manager for that. Right now I don't have time for that. I haven't
heard a consensus saying whether this should be the new behavior.
Yeah I suppose this is subject to the scale of the changes we make.
From what I remember we are going to need code changes to make optional the
code that splits at line breaks, or was your test replacing the existing cTAKES
sentence detector and just using OpenNLP directly.
That is a good point, and something I was wondering about. Having now
looked at both the ctakes and opennlp code for the sentence splitter it
seems like there is a lot of overlap. I would've thought it was just a
matter of converting annotations into our type system. So I'm curious if
there is some justification for why there seems to be duplication (or if
I'm hallucinating it).
Tim
-- James
-----Original Message-----
From: Tim Miller [mailto:[email protected]]
Sent: Monday, January 27, 2014 8:52 AM
To: [email protected]
Subject: Re: sentence detector newline behavior
OK, with the most recent version I am able to replicate the performance
I was getting before. Thanks a lot Jörn!
Assuming this is in the next incremental release of opennlp, how quickly
can we get a re-trained model into cTAKES? I heard from a researcher at
AMIA who tried cTAKES and because of this bug in the way we handle
sentences was trying to find an outside sentence detector as a
preprocess to cTAKES, and frankly that is insane. We should be able to
get something this simple right. And I think this is the kind of thing
that can leave new users scratching their heads and doubting our overall
competence.
James, I believe you are usually the one who rebuilds the models? What
would be the best way to incorporate the data I have that has some
instances of non-sentence terminating newlines?
Tim
On 01/27/2014 06:10 AM, Jörn Kottmann wrote:
On 01/26/2014 11:29 PM, Miller, Timothy wrote:
Yes, this fixes the whitespace sentence issue but the evaluation issue
remains. I believe the problem is in SentenceSampleStream, where in the
following block the whitespace trim happens before the <LF> character is
replaced with the \n character. So test sentences that ended with <LF>
will be one character longer than they should be.
sentence = sentence.trim();
sentence = replaceNewLineEscapeTags(sentence);
sentencesString.append(sentence);
int end = sentencesString.length();
sentenceSpans.add(new Span(begin, end));
sentencesString.append(' ');
Yes, that must be the issue. During training the new line is inlucded
in the span, and during
detection the white space remover creates a span without the new line
char.
I suggest that the evaluator just ignores white space differences
between sentences. My test case then
has the expected performance numbers.
What do you think?
Anyway, I committed the change. Please give it a try.
Jörn