So perhaps we could re-train it to disambiguate newline characters as well?
Steve On May 21, 2013, at 11:33 AM, "Savova, Guergana" <guergana.sav...@childrens.harvard.edu> wrote: > The model is trained to disambiguate punctuation characters which in most > cases is the period. > --Guergana > > -----Original Message----- > From: Steven Bethard [mailto:steven.beth...@colorado.edu] > Sent: Tuesday, May 21, 2013 12:07 PM > To: dev@ctakes.apache.org > Subject: Re: sentence detector newline behavior > > On May 21, 2013, at 9:53 AM, "Savova, Guergana" > <guergana.sav...@childrens.harvard.edu> wrote: >> The OpenNLP sentence segmenter is trained on clinical data (cannot remember >> exactly how many sentences were in the training corpus). This is the model >> distributed with cTAKES. The only hard rule is the new line. > > If it's trained on clinical data, why does it need a hard rule for that? Why > isn't the model able to learn when to break on a newline or not? > > Steve > >> --Guergana >> >> -----Original Message----- >> From: Steven Bethard [mailto:steven.beth...@colorado.edu] >> Sent: Tuesday, May 21, 2013 11:38 AM >> To: dev@ctakes.apache.org >> Subject: Re: sentence detector newline behavior >> >> On May 21, 2013, at 9:02 AM, Tim Miller >> <timothy.mil...@childrens.harvard.edu> wrote: >>> I think the whole reason to use a machine learning approach for >>> sentence detection should be to help weigh evidence with these cases >>> where hard rules cause problems, mainly 1) when a period does not end >>> a sentence, but also 2) where a newline does and does not mean end of >>> sentence. >> >> Perhaps we should consider re-training the OpenNLP sentence segmenter on >> some clinical data? Presumably we can get sentences from the TreeBank >> annotations. >> >> I don't know much about the OpenNLP sentence segmenter though. Does it only >> classify on periods? We'd want to classify all periods and newlines. And >> we'd want to add features that capture patterns like "XXX: YYY". >> >> Steve >> >>> It >>> is of course bad that in your example if you don't put a sentence >>> break you will think that "extravascular findings" is negated. But it >>> is also bad if you put a sentence break immediately after the word >>> "and" at the end of a line and then you find that your language model >>> thinks that "and <eos>" is a good bigram. >>> >>> I will create a jira for the parameter thing, and try to implement it >>> and see if it gets ok results with the existing model. >>> Tim >>> >>> On 05/21/2013 10:11 AM, Masanz, James J. wrote: >>>> +1 for adding a boolean parameter, or perhaps instead a list of >>>> +section IDs >>>> >>>> The sentence detector model was trained on data that always breaks at >>>> carriage returns. >>>> >>>> It is important for text that is a list something like this: >>>> >>>> Heart Rate: normal >>>> ENT: negative >>>> EXTRAVASCULAR FINDINGS: Severe prostatic enlargement. >>>> >>>> And without breaking on the line ending, the word negative would >>>> negate extravascular findings >>>> >>>> >>>> -----Original Message----- >>>> From: dev-return-1605-Masanz.James=mayo....@ctakes.apache.org >>>> [mailto:dev-return-1605-Masanz.James=mayo....@ctakes.apache.org] On >>>> Behalf Of Miller, Timothy >>>> Sent: Tuesday, May 21, 2013 7:07 AM >>>> To: dev@ctakes.apache.org >>>> Subject: sentence detector newline behavior >>>> >>>> The sentence detector always ends a sentence where there are newlines. >>>> This is a problem for some notes (e.g. MIMIC radiology notes) where >>>> a line can wrap in the middle of a sentence at specified character >>>> offsets. In the comments for SentenceDetector, it seems to be split >>>> up very logically in that it first runs the opennlp sentence >>>> detector, then breaks any detected sentence wherever there is a newline. >>>> Questions: >>>> 1) Would it be good to add a boolean parameter for breaking on newlines? >>>> 2) If that section was removed/avoided, does the opennlp sentence >>>> detector give good results given our model? Or is the model trained >>>> on text that always breaks at carriage returns? >>>> >>>> Tim >>> >> >