For clarity, I'd like to stress that the opennlp sentence model distributed
with ctakes today does 'work' with sentences that span newlines - as I
understand it, this model ignores newline tokens (or newlines are not
provided as features to that model).

I believe the improvements Tim and others are suggesting are for a new
sentence model + feature representation that takes advantage of newlines as
features.

Whatever we do, I believe we need backwards compatibility - those who are
using the current sentence model may need to continue using it.  To that
end:
* If we upgrade to the newest version of opennlp, will the old model work
(and produce the same results)?
* If a contributor trains a new model that uses a different feature
representation, I believe that should go into a new Sentence Detector
AnalysisEngine (or the same AE but with different configuration
parameters), so users have a choice between the old and the new.

-vj


On Mon, Jan 27, 2014 at 1:09 PM, digital paula <cybersat...@hotmail.com>wrote:

>
>
>
> Tim,
>
> I just had to chime in on a comment you made.    My deadline has been
> extended a bit on my pressing issue but I do intend to get back to testing
> per VJ's fix or maybe another fix is in the works based on latest
> emails...I need to read them again since a lot has been stated on the issue.
>
> Okay, as a new user (working w/cTAKES since October) I have never thought
> what you had stated:
>
>  "And I think this is the kind of thing that can leave new users
> scratching their heads and doubting our overall competence."
>
> Yeah, the sentence-spanning-newline issue was a problem so I just brought
> attention to it by my post of inquiry earlier this month on VJ's fix from
> last month and worked around it with treating narrative as one string.
>
> Anyone who's looked at the code would appreciate and acknowledge that
> cTAKES is a powerful and complex application.  I'm overall impressed with
> it and I intend to continue to use it, improve it, and grow with it.  I've
> been delving deeper into cTAKES on the machine learning aspect...I'm
> struggling a bit with it and if anything I scratch my head and doubt my
> competence. ;-)
>
> Regards,
> Paula
>
> > Date: Mon, 27 Jan 2014 09:52:00 -0500
> > From: timothy.mil...@childrens.harvard.edu
> > To: dev@ctakes.apache.org
> > Subject: Re: sentence detector newline behavior
> >
> > OK, with the most recent version I am able to replicate the performance
> > I was getting before. Thanks a lot Jörn!
> >
> > Assuming this is in the next incremental release of opennlp, how quickly
> > can we get a re-trained model into cTAKES? I heard from a researcher at
> > AMIA who tried cTAKES and because of this bug in the way we handle
> > sentences was trying to find an outside sentence detector as a
> > preprocess to cTAKES, and frankly that is insane. We should be able to
> > get something this simple right. And I think this is the kind of thing
> > that can leave new users scratching their heads and doubting our overall
> > competence.
> >
> > James, I believe you are usually the one who rebuilds the models? What
> > would be the best way to incorporate the data I have that has some
> > instances of non-sentence terminating newlines?
> >
> > Tim
> >
> >
> > On 01/27/2014 06:10 AM, Jörn Kottmann wrote:
> > > On 01/26/2014 11:29 PM, Miller, Timothy wrote:
> > >> Yes, this fixes the whitespace sentence issue but the evaluation issue
> > >> remains. I believe the problem is in SentenceSampleStream, where in
> the
> > >> following block the whitespace trim happens before the <LF> character
> is
> > >> replaced with the \n character. So test sentences that ended with <LF>
> > >> will be one character longer than they should be.
> > >>
> > >>> >       sentence = sentence.trim();
> > >>> >       sentence = replaceNewLineEscapeTags(sentence);
> > >>> >       sentencesString.append(sentence);
> > >>> >       int end = sentencesString.length();
> > >>> >       sentenceSpans.add(new Span(begin, end));
> > >>> >       sentencesString.append(' ');
> > >
> > > Yes, that must be the issue. During training the new line is inlucded
> > > in the span, and during
> > > detection the white space remover creates a span without the new line
> > > char.
> > >
> > > I suggest that the evaluator just ignores white space differences
> > > between sentences. My test case then
> > > has the expected performance numbers.
> > >
> > > What do you think?
> > >
> > > Anyway, I committed the change. Please give it a try.
> > >
> > > Jörn
> >
>
>
>

Reply via email to