One method I use for finding headings is "term" followed by either 2 or more instances of white space or a symbol (colon, comma, dash) followed by 1 or more instances of white space. Its really naive but works well because the "term" is from a controlled set. Thats not super helpful in your first example above unless those sections can be predefined.
Your second example seems a lot harder. Especially when there are valid number/period patterns at the end of the line. "Patient presented with fever of 102." or other measurements. On Sat, Aug 2, 2014 at 7:43 AM, Miller, Timothy < timothy.mil...@childrens.harvard.edu> wrote: > I'm annotating some oncology notes from SHARP right now, and they are > basically a nightmare for our current sentence segmentation model. Mainly > because they eschew explicit markers between sentences. I thought I'd ping > the list with some interesting examples just in case it stimulates ideas. > But it seems to me that at some point we'll have to augment the opennlp > module (preferable) or roll our own to handle cases like these. > > In this example a bunch of background is on one line with no punctuation > between logical breaks: > PE: Lymphnodes: neck and axilla without adenopathy Lungs: normal and clear > to auscultation CV: regular rate and rhythm without murmur or gallop , S1, > S2 normal, no murmur, click, rub or gal*, chest is clear without rales or > wheezing, no pedal edema, no JVD, no hepatosplenomegaly Breast: negative > findings right/left breast with mild swelling, warmth, mild erythema, > slightly tender, no seroma or hematoma Abdomen: Abdomen soft, non-tender. > > It would be preferable to me to put sentence breaks in between the > sections, so the first two sentences would be: > > 1) PE: Lymphonodes... > 2) Lungs: normal... > > but without any candidate characters to split the sentence I don't think > it is possible. > > Another example that breaks our model in a different way (truncated): > 1. Baseline labwork including tumor markers 2. Start DD AC on Friday 8/1 > with RN chemo teach 3. S U parent study > > Our model will break on the period after the number, so we'd probably get: > 1. > Baseline labwork including tumor markers 2. > Start DD.... 3. > S U parent study > > So the number is going in exactly the wrong place. Here it would be > preferable to get: > 1. > Baseline labwork... > 2. > Start DD... > 3. > S U parent study > > Anyways, just something to think about! The problem is much more complex > in clinical data than in edited text, but I'm sure we all knew that already > :) > > Tim > > > ________________________________________ > From: Miller, Timothy [timothy.mil...@childrens.harvard.edu] > Sent: Monday, July 28, 2014 2:38 PM > To: dev@ctakes.apache.org > Subject: Re: question about sentence segmentation > > Yes, you're right about that Britt. I've been doing some annotations side > by side with a treebank viewer and think I have a pretty good handle on the > actual rules. > > Basically, if a header or list identifier is followed by a period or a > newline it is considered a sentence break and otherwise it is part of the > sentence. > > e.g. > > 1. 20 mg flomax > > is two sentences, while: > > 1 - 20 mg flomax > > is one sentence. > > For headings: > > Allergies: Pt is allergic to aspirin. > > is one sentence, while: > > Allergies: > Pt is allergic to aspirin. > > is two sentences. > > I'm planning to follow these guidelines. > > Tim > > On 07/28/2014 01:53 PM, britt fitch wrote: > > Thanks for the document, Tim. It seems to not be explicit about how to > handle sentences occurring in lists. > > Are you still considering having the list number as outside of the > sentence? > > Thanks > > Britt > > On Jul 25, 2014, at 7:09 AM, Miller, Timothy < > timothy.mil...@childrens.harvard.edu><mailto: > timothy.mil...@childrens.harvard.edu> wrote: > > > > Checking with Guergana and other colleagues here the advice is to have the > sentence segmenter follow the treebank guidelines for sentence segmentation: > http://clear.colorado.edu/compsem/documents/treebank_guidelines.pdf > > They are a bit light on detail but fortunately we have some treebanked > data so I will use that for the training data and hopefully that will > illuminate the tricky cases. > > Tim > > ________________________________________ > From: Masanz, James J. [masanz.ja...@mayo.edu<mailto:masanz.ja...@mayo.edu > >] > Sent: Tuesday, July 15, 2014 4:39 PM > To: 'dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>' > Subject: RE: question about sentence segmentation > > Sorry, I don't know if there was a reason. > > If you haven't checked with Guergana, you might want to ask her if she had > a reason or if it was just the way it had been since that corpus was > created. > > -----Original Message----- > From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] > Sent: Tuesday, July 15, 2014 3:34 PM > To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org> > Subject: Re: question about sentence segmentation > > Thanks James, I was hoping to hear from you. I'll probably go ahead and > change the data to split sentences between the list header and list > element. > > You don't happen to know if there is any principled reason for the > original style or whether it was just an arbitrary convention? The only > thing I can think of is it might be hard to learn when to separate when > there is no period after the list header (as in your examples). I think > it's worth empirically checking on that point, but there might be other > reasons that I'm not thinking of. > > Thanks > Tim > > On 07/15/2014 03:27 PM, Masanz, James J. wrote: > > > I don't have an opinion about how it should work. > > But I can verify that the clinical notes from Mayo Clinic that were used > in the initial cTAKES sentence detector model had the list markers included > in the first sentence, so, for example, the following would be two > sentences, with each line a separate sentence. > > #1 Dilated esophagus. > #2 Adenocarcinoma > > -- James > > -----Original Message----- > From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] > Sent: Tuesday, July 15, 2014 6:04 AM > To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org> > Subject: RE: question about sentence segmentation > > > > My preference is to treat the list row number as outside of the sentence of > > > interest. Or if it is necessary to be included in a sentence, have it be a > sentence > on its own. > > I can get behind this, I think it makes the issue a bit cleaner, to either > have the list header as non-sentential or it's own sentence. As far as I > can tell, this is not the current default behavior. At least in my runs the > list header seems to get attached to the first following sentence, even in > cases where it starts with a digit and a period ("3. Magnesium oxide 400 mg > p.o. daily." is all one sentence). > This behavior is probably strongly dependent on the annotations we give > the sentence detector so as I'm prepping new training data I should have a > default in mind. > > Does anyone have any objections to changing the sentence detector behavior > to break list headers (things like "3." or "A " or "#5") as their own > sentence? > > Tim > > > ________________________________________ > From: Britt Fitch [britt.fi...@gmail.com<mailto:britt.fi...@gmail.com>] > Sent: Monday, July 14, 2014 8:29 AM > To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org> > Subject: Re: question about sentence segmentation > > My preference is to treat the list row number as outside of the sentence of > interest. > Or if it is necessary to be included in a sentence, have it be a sentence > on its own. > That won't be as straightforward as splitting on a period in cases > like "2. Magnesium > oxide 400 mg p.o. daily." > In cases where there are more than 1 written sentence like your example in > the original email, I'd prefer those were each a sentence rather than > making the entire list line a single sentence. > My feeling is that each line without terminating punctuation would be a > single sentence and would exclude the list number. > > As an aside, I have encountered several issues with numbered lists being > interpreted differently depending on > 1. what number is included at the start > for example: "2. Magnesium oxide 400 mg p.o. daily." vs "12. Magnesium > oxide 400 mg p.o. daily." (This appears to be a chunking issue where the > line starting with "12. Magnesium" is identified as starting with chunks > [O, > O, B-NP, B-NP, I-NP, B-NP, B-ADVP, O] even though the parts of speech > appear to be correct) > 2. whether there is a period at the end of a list > for example: "4. CHF" vs "4. CHF." (This appears to be an issue with the > chunker though which produces [O,O] in the first case and [B-VP, B-NP, O] > in the second. > > Cheers, > > Britt > > > > On Mon, Jul 14, 2014 at 7:50 AM, Miller, Timothy < > timothy.mil...@childrens.harvard.edu<mailto: > timothy.mil...@childrens.harvard.edu>> wrote: > > > > Just curious about an edge case regarding headers/lists and wondering what > people think the correct behavior and annotation are. > > In cases like this: > > #1 Dilated esophagus. > #2 Adenocarcinoma > > my intuition is that each whole line is one sentence. But then there are > cases where the number may be followed by multiple sentences on one line. > 1. EGD as a complex procedure. If there is an abnormality, obtain biopsies. > > For this example my intuition is not as clear. Should there be a break > after the "1." or should the first sentence be "1. EGD as a complex > procedure."? Again, my intuition leans towards the latter but it seems a > bit odd since the "1." kind of distributes over all the following sentences > (i.e. it's like a paragraph descriptor.) > > Does the period after the 1 matter? The number of sentences after the list > header? The fact that it's all on one line? Anything else? > > Tim > > > > > > > > > > >