For clarity, I'd like to stress that the opennlp sentence model distributed with ctakes today does 'work' with sentences that span newlines - as I understand it, this model ignores newline tokens (or newlines are not provided as features to that model).
I believe the improvements Tim and others are suggesting are for a new sentence model + feature representation that takes advantage of newlines as features. Whatever we do, I believe we need backwards compatibility - those who are using the current sentence model may need to continue using it. To that end: * If we upgrade to the newest version of opennlp, will the old model work (and produce the same results)? * If a contributor trains a new model that uses a different feature representation, I believe that should go into a new Sentence Detector AnalysisEngine (or the same AE but with different configuration parameters), so users have a choice between the old and the new. -vj On Mon, Jan 27, 2014 at 1:09 PM, digital paula <cybersat...@hotmail.com>wrote: > > > > Tim, > > I just had to chime in on a comment you made. My deadline has been > extended a bit on my pressing issue but I do intend to get back to testing > per VJ's fix or maybe another fix is in the works based on latest > emails...I need to read them again since a lot has been stated on the issue. > > Okay, as a new user (working w/cTAKES since October) I have never thought > what you had stated: > > "And I think this is the kind of thing that can leave new users > scratching their heads and doubting our overall competence." > > Yeah, the sentence-spanning-newline issue was a problem so I just brought > attention to it by my post of inquiry earlier this month on VJ's fix from > last month and worked around it with treating narrative as one string. > > Anyone who's looked at the code would appreciate and acknowledge that > cTAKES is a powerful and complex application. I'm overall impressed with > it and I intend to continue to use it, improve it, and grow with it. I've > been delving deeper into cTAKES on the machine learning aspect...I'm > struggling a bit with it and if anything I scratch my head and doubt my > competence. ;-) > > Regards, > Paula > > > Date: Mon, 27 Jan 2014 09:52:00 -0500 > > From: timothy.mil...@childrens.harvard.edu > > To: dev@ctakes.apache.org > > Subject: Re: sentence detector newline behavior > > > > OK, with the most recent version I am able to replicate the performance > > I was getting before. Thanks a lot Jörn! > > > > Assuming this is in the next incremental release of opennlp, how quickly > > can we get a re-trained model into cTAKES? I heard from a researcher at > > AMIA who tried cTAKES and because of this bug in the way we handle > > sentences was trying to find an outside sentence detector as a > > preprocess to cTAKES, and frankly that is insane. We should be able to > > get something this simple right. And I think this is the kind of thing > > that can leave new users scratching their heads and doubting our overall > > competence. > > > > James, I believe you are usually the one who rebuilds the models? What > > would be the best way to incorporate the data I have that has some > > instances of non-sentence terminating newlines? > > > > Tim > > > > > > On 01/27/2014 06:10 AM, Jörn Kottmann wrote: > > > On 01/26/2014 11:29 PM, Miller, Timothy wrote: > > >> Yes, this fixes the whitespace sentence issue but the evaluation issue > > >> remains. I believe the problem is in SentenceSampleStream, where in > the > > >> following block the whitespace trim happens before the <LF> character > is > > >> replaced with the \n character. So test sentences that ended with <LF> > > >> will be one character longer than they should be. > > >> > > >>> > sentence = sentence.trim(); > > >>> > sentence = replaceNewLineEscapeTags(sentence); > > >>> > sentencesString.append(sentence); > > >>> > int end = sentencesString.length(); > > >>> > sentenceSpans.add(new Span(begin, end)); > > >>> > sentencesString.append(' '); > > > > > > Yes, that must be the issue. During training the new line is inlucded > > > in the span, and during > > > detection the white space remover creates a span without the new line > > > char. > > > > > > I suggest that the evaluator just ignores white space differences > > > between sentences. My test case then > > > has the expected performance numbers. > > > > > > What do you think? > > > > > > Anyway, I committed the change. Please give it a try. > > > > > > Jörn > > > > >