Re: sentence detector newline behavior

Steven Bethard Tue, 21 May 2013 08:37:55 -0700

On May 21, 2013, at 9:02 AM, Tim Miller <timothy.mil...@childrens.harvard.edu> 
wrote:
> I think the whole reason to use a machine learning approach for sentence 
> detection should be to help weigh evidence with these cases where hard 
> rules cause problems, mainly 1) when a period does not end a sentence, 
> but also 2) where a newline does and does not mean end of sentence.


Perhaps we should consider re-training the OpenNLP sentence segmenter on some 
clinical data? Presumably we can get sentences from the TreeBank annotations.

I don't know much about the OpenNLP sentence segmenter though. Does it only 
classify on periods? We'd want to classify all periods and newlines. And we'd 
want to add features that capture patterns like "XXX: YYY".

Steve

> It 
> is of course bad that in your example if you don't put a sentence break 
> you will think that "extravascular findings" is negated. But it is also 
> bad if you put a sentence break immediately after the word "and" at the 
> end of a line and then you find that your language model thinks that 
> "and <eos>" is a good bigram.
> 
> I will create a jira for the parameter thing, and try to implement it 
> and see if it gets ok results with the existing model.
> Tim
> 
> On 05/21/2013 10:11 AM, Masanz, James J. wrote:
>> +1 for adding a boolean parameter, or perhaps instead a list of section IDs
>> 
>> The sentence detector model was trained on data that always breaks at 
>> carriage returns.
>> 
>> It is important for text that is a list something like this:
>> 
>> Heart Rate: normal
>> ENT: negative
>> EXTRAVASCULAR FINDINGS: Severe prostatic enlargement.
>> 
>> And without breaking on the line ending, the word negative would negate 
>> extravascular findings
>> 
>> 
>> -----Original Message-----
>> From: dev-return-1605-Masanz.James=mayo....@ctakes.apache.org 
>> [mailto:dev-return-1605-Masanz.James=mayo....@ctakes.apache.org] On Behalf 
>> Of Miller, Timothy
>> Sent: Tuesday, May 21, 2013 7:07 AM
>> To: dev@ctakes.apache.org
>> Subject: sentence detector newline behavior
>> 
>> The sentence detector always ends a sentence where there are newlines.
>> This is a problem for some notes (e.g. MIMIC radiology notes) where a
>> line can wrap in the  middle of a sentence at specified character
>> offsets. In the comments for SentenceDetector, it seems to be split up
>> very logically in that it first runs the opennlp sentence detector, then
>> breaks any detected sentence wherever there is a newline. Questions:
>> 1) Would it be good to add a boolean parameter for breaking on newlines?
>> 2) If that section was removed/avoided, does the opennlp sentence
>> detector give good results given our model? Or is the model trained on
>> text that always breaks at carriage returns?
>> 
>> Tim
>

Re: sentence detector newline behavior

Reply via email to