RE: sentence detector newline behavior

digital paula Mon, 27 Jan 2014 10:13:06 -0800

Tim,
 
I just had to chime in on a comment you made.    My deadline has been extended 
a bit on my pressing issue but I do intend to get back to testing per VJ's fix 
or maybe another fix is in the works based on latest emails...I need to read 
them again since a lot has been stated on the issue. 
 
Okay, as a new user (working w/cTAKES since October) I have never thought what 
you had stated:
 
 "And I think this is the kind of thing that can leave new users scratching 
their heads and doubting our overall competence."  
 
Yeah, the sentence-spanning-newline issue was a problem so I just brought 
attention to it by my post of inquiry earlier this month on VJ's fix from last 
month and worked around it with treating narrative as one string.  
 
Anyone who's looked at the code would appreciate and acknowledge that cTAKES is 
a powerful and complex application.  I'm overall impressed with it and I intend 
to continue to use it, improve it, and grow with it.  I've been delving deeper 
into cTAKES on the machine learning aspect...I'm struggling a bit with it and 
if anything I scratch my head and doubt my competence. ;-)  
 
Regards,
Paula
 
> Date: Mon, 27 Jan 2014 09:52:00 -0500
> From: [email protected]
> To: [email protected]
> Subject: Re: sentence detector newline behavior
> 
> OK, with the most recent version I am able to replicate the performance 
> I was getting before. Thanks a lot Jörn!
> 
> Assuming this is in the next incremental release of opennlp, how quickly 
> can we get a re-trained model into cTAKES? I heard from a researcher at 
> AMIA who tried cTAKES and because of this bug in the way we handle 
> sentences was trying to find an outside sentence detector as a 
> preprocess to cTAKES, and frankly that is insane. We should be able to 
> get something this simple right. And I think this is the kind of thing 
> that can leave new users scratching their heads and doubting our overall 
> competence.
> 
> James, I believe you are usually the one who rebuilds the models? What 
> would be the best way to incorporate the data I have that has some 
> instances of non-sentence terminating newlines?
> 
> Tim
> 
> 
> On 01/27/2014 06:10 AM, Jörn Kottmann wrote:
> > On 01/26/2014 11:29 PM, Miller, Timothy wrote:
> >> Yes, this fixes the whitespace sentence issue but the evaluation issue
> >> remains. I believe the problem is in SentenceSampleStream, where in the
> >> following block the whitespace trim happens before the <LF> character is
> >> replaced with the \n character. So test sentences that ended with <LF>
> >> will be one character longer than they should be.
> >>
> >>> >       sentence = sentence.trim();
> >>> >       sentence = replaceNewLineEscapeTags(sentence);
> >>> >       sentencesString.append(sentence);
> >>> >       int end = sentencesString.length();
> >>> >       sentenceSpans.add(new Span(begin, end));
> >>> >       sentencesString.append(' ');
> >
> > Yes, that must be the issue. During training the new line is inlucded 
> > in the span, and during
> > detection the white space remover creates a span without the new line 
> > char.
> >
> > I suggest that the evaluator just ignores white space differences 
> > between sentences. My test case then
> > has the expected performance numbers.
> >
> > What do you think?
> >
> > Anyway, I committed the change. Please give it a try.
> >
> > Jörn
>
RE: sentence detector newline behavior

Reply via email to