So perhaps we could re-train it to disambiguate newline characters as well?

Steve

On May 21, 2013, at 11:33 AM, "Savova, Guergana" 
<guergana.sav...@childrens.harvard.edu> wrote:

> The model is trained to disambiguate punctuation characters which in most 
> cases is the period.
> --Guergana
> 
> -----Original Message-----
> From: Steven Bethard [mailto:steven.beth...@colorado.edu] 
> Sent: Tuesday, May 21, 2013 12:07 PM
> To: dev@ctakes.apache.org
> Subject: Re: sentence detector newline behavior
> 
> On May 21, 2013, at 9:53 AM, "Savova, Guergana" 
> <guergana.sav...@childrens.harvard.edu> wrote:
>> The OpenNLP sentence segmenter is trained on clinical data (cannot remember 
>> exactly how many sentences were in the training corpus). This is the model 
>> distributed with cTAKES. The only hard rule is the new line.
> 
> If it's trained on clinical data, why does it need a hard rule for that? Why 
> isn't the model able to learn when to break on a newline or not?
> 
> Steve
> 
>> --Guergana
>> 
>> -----Original Message-----
>> From: Steven Bethard [mailto:steven.beth...@colorado.edu]
>> Sent: Tuesday, May 21, 2013 11:38 AM
>> To: dev@ctakes.apache.org
>> Subject: Re: sentence detector newline behavior
>> 
>> On May 21, 2013, at 9:02 AM, Tim Miller 
>> <timothy.mil...@childrens.harvard.edu> wrote:
>>> I think the whole reason to use a machine learning approach for 
>>> sentence detection should be to help weigh evidence with these cases 
>>> where hard rules cause problems, mainly 1) when a period does not end 
>>> a sentence, but also 2) where a newline does and does not mean end of 
>>> sentence.
>> 
>> Perhaps we should consider re-training the OpenNLP sentence segmenter on 
>> some clinical data? Presumably we can get sentences from the TreeBank 
>> annotations.
>> 
>> I don't know much about the OpenNLP sentence segmenter though. Does it only 
>> classify on periods? We'd want to classify all periods and newlines. And 
>> we'd want to add features that capture patterns like "XXX: YYY".
>> 
>> Steve
>> 
>>> It
>>> is of course bad that in your example if you don't put a sentence 
>>> break you will think that "extravascular findings" is negated. But it 
>>> is also bad if you put a sentence break immediately after the word 
>>> "and" at the end of a line and then you find that your language model 
>>> thinks that "and <eos>" is a good bigram.
>>> 
>>> I will create a jira for the parameter thing, and try to implement it 
>>> and see if it gets ok results with the existing model.
>>> Tim
>>> 
>>> On 05/21/2013 10:11 AM, Masanz, James J. wrote:
>>>> +1 for adding a boolean parameter, or perhaps instead a list of 
>>>> +section IDs
>>>> 
>>>> The sentence detector model was trained on data that always breaks at 
>>>> carriage returns.
>>>> 
>>>> It is important for text that is a list something like this:
>>>> 
>>>> Heart Rate: normal
>>>> ENT: negative
>>>> EXTRAVASCULAR FINDINGS: Severe prostatic enlargement.
>>>> 
>>>> And without breaking on the line ending, the word negative would 
>>>> negate extravascular findings
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: dev-return-1605-Masanz.James=mayo....@ctakes.apache.org
>>>> [mailto:dev-return-1605-Masanz.James=mayo....@ctakes.apache.org] On 
>>>> Behalf Of Miller, Timothy
>>>> Sent: Tuesday, May 21, 2013 7:07 AM
>>>> To: dev@ctakes.apache.org
>>>> Subject: sentence detector newline behavior
>>>> 
>>>> The sentence detector always ends a sentence where there are newlines.
>>>> This is a problem for some notes (e.g. MIMIC radiology notes) where 
>>>> a line can wrap in the  middle of a sentence at specified character 
>>>> offsets. In the comments for SentenceDetector, it seems to be split 
>>>> up very logically in that it first runs the opennlp sentence 
>>>> detector, then breaks any detected sentence wherever there is a newline. 
>>>> Questions:
>>>> 1) Would it be good to add a boolean parameter for breaking on newlines?
>>>> 2) If that section was removed/avoided, does the opennlp sentence 
>>>> detector give good results given our model? Or is the model trained 
>>>> on text that always breaks at carriage returns?
>>>> 
>>>> Tim
>>> 
>> 
> 

Reply via email to