I didn't write the cTAKES sentence detector so I can't answer definitively but I do know it was originally written using what is now a pretty old version of OpenNLP and needed some things you couldn't get from the out-of-the-box OpenNLP at the time. From what I remember the things specific to it were - the list of end of sentence candidate characters - and the handling of newlines
-- James -----Original Message----- From: Tim Miller [mailto:[email protected]] Sent: Monday, January 27, 2014 1:45 PM To: [email protected] Subject: Re: sentence detector newline behavior On 01/27/2014 02:35 PM, Masanz, James J. wrote: > Tim, is the training data something you can share publicly? Or privately? I > can't publicly share the data that has been used to train the sentence > detector, I can only share the models that get built. And you can't build a > model from an existing model + more data, you need all the training data > together. It is from the MIMIC corpus which I definitely can't share publicly, but it's worth looking into whether I could share it privately with another person who has a signed data use agreement. > Regarding how quickly we can get this out there, I can train a new sentence > detector in a day or two. But that's just the first step - to really > incorporate this, I would suggest this be a point release. We would need a > release manager for that. Right now I don't have time for that. I haven't > heard a consensus saying whether this should be the new behavior. Yeah I suppose this is subject to the scale of the changes we make. > From what I remember we are going to need code changes to make optional the > code that splits at line breaks, or was your test replacing the existing > cTAKES sentence detector and just using OpenNLP directly. That is a good point, and something I was wondering about. Having now looked at both the ctakes and opennlp code for the sentence splitter it seems like there is a lot of overlap. I would've thought it was just a matter of converting annotations into our type system. So I'm curious if there is some justification for why there seems to be duplication (or if I'm hallucinating it). Tim > > -- James > > -----Original Message----- > From: Tim Miller [mailto:[email protected]] > Sent: Monday, January 27, 2014 8:52 AM > To: [email protected] > Subject: Re: sentence detector newline behavior > > OK, with the most recent version I am able to replicate the performance > I was getting before. Thanks a lot Jörn! > > Assuming this is in the next incremental release of opennlp, how quickly > can we get a re-trained model into cTAKES? I heard from a researcher at > AMIA who tried cTAKES and because of this bug in the way we handle > sentences was trying to find an outside sentence detector as a > preprocess to cTAKES, and frankly that is insane. We should be able to > get something this simple right. And I think this is the kind of thing > that can leave new users scratching their heads and doubting our overall > competence. > > James, I believe you are usually the one who rebuilds the models? What > would be the best way to incorporate the data I have that has some > instances of non-sentence terminating newlines? > > Tim > > > On 01/27/2014 06:10 AM, Jörn Kottmann wrote: >> On 01/26/2014 11:29 PM, Miller, Timothy wrote: >>> Yes, this fixes the whitespace sentence issue but the evaluation issue >>> remains. I believe the problem is in SentenceSampleStream, where in the >>> following block the whitespace trim happens before the <LF> character is >>> replaced with the \n character. So test sentences that ended with <LF> >>> will be one character longer than they should be. >>> >>>>> sentence = sentence.trim(); >>>>> sentence = replaceNewLineEscapeTags(sentence); >>>>> sentencesString.append(sentence); >>>>> int end = sentencesString.length(); >>>>> sentenceSpans.add(new Span(begin, end)); >>>>> sentencesString.append(' '); >> Yes, that must be the issue. During training the new line is inlucded >> in the span, and during >> detection the white space remover creates a span without the new line >> char. >> >> I suggest that the evaluator just ignores white space differences >> between sentences. My test case then >> has the expected performance numbers. >> >> What do you think? >> >> Anyway, I committed the change. Please give it a try. >> >> Jörn
