I didn't write the cTAKES sentence detector so I can't answer definitively but 
I do know it was originally written using what is now a pretty old version of 
OpenNLP and needed some things you couldn't get from the out-of-the-box OpenNLP 
at the time. From  what I remember the things specific to it were 
- the list of end of sentence candidate characters 
- and the handling of newlines

-- James

-----Original Message-----
From: Tim Miller [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Monday, January 27, 2014 1:45 PM
To: dev@ctakes.apache.org
Subject: Re: sentence detector newline behavior


On 01/27/2014 02:35 PM, Masanz, James J. wrote:
> Tim, is the training data something you can share publicly? Or privately?  I 
> can't publicly share the data that has been used to train the sentence 
> detector, I can only share the models that get built. And you can't build a 
> model from an existing model + more data, you need all the training data 
> together.

It is from the MIMIC corpus which I definitely can't share publicly, but 
it's worth looking into whether I could share it privately with another 
person who has a signed data use agreement.

> Regarding how quickly we can get this out there, I can train a new sentence 
> detector in a day or two. But that's just the first step - to really 
> incorporate this, I would suggest this be a point release.   We would need a 
> release manager for that.  Right now I don't have time for that.  I haven't 
> heard a consensus saying whether this should be the new behavior.
Yeah I suppose this is subject to the scale of the changes we make.
>  From what I remember we are going to need code changes to make optional the 
> code that splits at line breaks, or was your test replacing the existing 
> cTAKES sentence detector and just using OpenNLP directly.

That is a good point, and something I was wondering about. Having now 
looked at both the ctakes and opennlp code for the sentence splitter it 
seems like there is a lot of overlap. I would've thought it was just a 
matter of converting annotations into our type system. So I'm curious if 
there is some justification for why there seems to be duplication (or if 
I'm hallucinating it).

Tim


>
> -- James
>
> -----Original Message-----
> From: Tim Miller [mailto:timothy.mil...@childrens.harvard.edu]
> Sent: Monday, January 27, 2014 8:52 AM
> To: dev@ctakes.apache.org
> Subject: Re: sentence detector newline behavior
>
> OK, with the most recent version I am able to replicate the performance
> I was getting before. Thanks a lot Jörn!
>
> Assuming this is in the next incremental release of opennlp, how quickly
> can we get a re-trained model into cTAKES? I heard from a researcher at
> AMIA who tried cTAKES and because of this bug in the way we handle
> sentences was trying to find an outside sentence detector as a
> preprocess to cTAKES, and frankly that is insane. We should be able to
> get something this simple right. And I think this is the kind of thing
> that can leave new users scratching their heads and doubting our overall
> competence.
>
> James, I believe you are usually the one who rebuilds the models? What
> would be the best way to incorporate the data I have that has some
> instances of non-sentence terminating newlines?
>
> Tim
>
>
> On 01/27/2014 06:10 AM, Jörn Kottmann wrote:
>> On 01/26/2014 11:29 PM, Miller, Timothy wrote:
>>> Yes, this fixes the whitespace sentence issue but the evaluation issue
>>> remains. I believe the problem is in SentenceSampleStream, where in the
>>> following block the whitespace trim happens before the <LF> character is
>>> replaced with the \n character. So test sentences that ended with <LF>
>>> will be one character longer than they should be.
>>>
>>>>>        sentence = sentence.trim();
>>>>>        sentence = replaceNewLineEscapeTags(sentence);
>>>>>        sentencesString.append(sentence);
>>>>>        int end = sentencesString.length();
>>>>>        sentenceSpans.add(new Span(begin, end));
>>>>>        sentencesString.append(' ');
>> Yes, that must be the issue. During training the new line is inlucded
>> in the span, and during
>> detection the white space remover creates a span without the new line
>> char.
>>
>> I suggest that the evaluator just ignores white space differences
>> between sentences. My test case then
>> has the expected performance numbers.
>>
>> What do you think?
>>
>> Anyway, I committed the change. Please give it a try.
>>
>> Jörn

Reply via email to