One method I use for finding headings is "term" followed by either 2 or
more instances of white space or a symbol (colon, comma, dash) followed by
1 or more instances of white space.
Its really naive but works well because the "term" is from a controlled
set. Thats not super helpful in your first example above unless those
sections can be predefined.

Your second example seems a lot harder. Especially when there are valid
number/period patterns at the end of the line. "Patient presented with
fever of 102." or other measurements.


On Sat, Aug 2, 2014 at 7:43 AM, Miller, Timothy <
timothy.mil...@childrens.harvard.edu> wrote:

> I'm annotating some oncology notes from SHARP right now, and they are
> basically a nightmare for our current sentence segmentation model. Mainly
> because they eschew explicit markers between sentences. I thought I'd ping
> the list with some interesting examples just in case it stimulates ideas.
> But it seems to me that at some point we'll have to augment the opennlp
> module (preferable) or roll our own to handle cases like these.
>
> In this example a bunch of background is on one line with no punctuation
> between logical breaks:
> PE: Lymphnodes: neck and axilla without adenopathy Lungs: normal and clear
> to auscultation CV: regular rate and rhythm without murmur or gallop , S1,
> S2 normal, no murmur, click, rub or gal*, chest is clear without rales or
> wheezing, no pedal edema, no JVD, no hepatosplenomegaly Breast: negative
> findings right/left breast with mild swelling, warmth, mild erythema,
> slightly tender, no seroma or hematoma Abdomen: Abdomen soft, non-tender.
>
> It would be preferable to me to put sentence breaks in between the
> sections, so the first two sentences would be:
>
> 1) PE: Lymphonodes...
> 2) Lungs: normal...
>
> but without any candidate characters to split the sentence I don't think
> it is possible.
>
> Another example that breaks our model in a different way (truncated):
> 1. Baseline labwork including tumor markers  2. Start DD AC on Friday 8/1
> with RN chemo teach  3. S U parent study
>
> Our model will break on the period after the number, so we'd probably get:
> 1.
> Baseline labwork including tumor markers 2.
> Start DD.... 3.
> S U parent study
>
> So the number is going in exactly the wrong place. Here it would be
> preferable to get:
> 1.
> Baseline labwork...
> 2.
> Start DD...
> 3.
> S U parent study
>
> Anyways, just something to think about! The problem is much more complex
> in clinical data than in edited text, but I'm sure we all knew that already
> :)
>
> Tim
>
>
> ________________________________________
> From: Miller, Timothy [timothy.mil...@childrens.harvard.edu]
> Sent: Monday, July 28, 2014 2:38 PM
> To: dev@ctakes.apache.org
> Subject: Re: question about sentence segmentation
>
> Yes, you're right about that Britt. I've been doing some annotations side
> by side with a treebank viewer and think I have a pretty good handle on the
> actual rules.
>
> Basically, if a header or list identifier is followed by a period or a
> newline it is considered a sentence break and otherwise it is part of the
> sentence.
>
> e.g.
>
> 1. 20 mg flomax
>
> is two sentences, while:
>
> 1 - 20 mg flomax
>
> is one sentence.
>
> For headings:
>
> Allergies: Pt is allergic to aspirin.
>
> is one sentence, while:
>
> Allergies:
> Pt is allergic to aspirin.
>
> is two sentences.
>
> I'm planning to follow these guidelines.
>
> Tim
>
> On 07/28/2014 01:53 PM, britt fitch wrote:
>
> Thanks for the document, Tim. It seems to not be explicit about how to
> handle sentences occurring in lists.
>
> Are you still considering having the list number as outside of the
> sentence?
>
> Thanks
>
> Britt
>
> On Jul 25, 2014, at 7:09 AM, Miller, Timothy <
> timothy.mil...@childrens.harvard.edu><mailto:
> timothy.mil...@childrens.harvard.edu> wrote:
>
>
>
> Checking with Guergana and other colleagues here the advice is to have the
> sentence segmenter follow the treebank guidelines for sentence segmentation:
> http://clear.colorado.edu/compsem/documents/treebank_guidelines.pdf
>
> They are a bit light on detail but fortunately we have some treebanked
> data so I will use that for the training data and hopefully that will
> illuminate the tricky cases.
>
> Tim
>
> ________________________________________
> From: Masanz, James J. [masanz.ja...@mayo.edu<mailto:masanz.ja...@mayo.edu
> >]
> Sent: Tuesday, July 15, 2014 4:39 PM
> To: 'dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>'
> Subject: RE: question about sentence segmentation
>
> Sorry, I don't know if there was a reason.
>
> If you haven't checked with Guergana, you might want to ask her if she had
> a reason or if it was just the way it had been since that corpus was
> created.
>
> -----Original Message-----
> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
> Sent: Tuesday, July 15, 2014 3:34 PM
> To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>
> Subject: Re: question about sentence segmentation
>
> Thanks James, I was hoping to hear from you. I'll probably go ahead and
> change the data to split sentences between the list header and list
> element.
>
> You don't happen to know if there is any principled reason for the
> original style or whether it was just an arbitrary convention? The only
> thing I can think of is it might be hard to learn when to separate when
> there is no period after the list header (as in your examples). I think
> it's worth empirically checking on that point, but there might be other
> reasons that I'm not thinking of.
>
> Thanks
> Tim
>
> On 07/15/2014 03:27 PM, Masanz, James J. wrote:
>
>
> I don't have an opinion about how it should work.
>
> But I can verify that the clinical notes from Mayo Clinic that were used
> in the initial cTAKES sentence detector model had the list markers included
> in the first sentence, so, for example, the following would be two
> sentences, with each line a separate sentence.
>
> #1 Dilated esophagus.
> #2 Adenocarcinoma
>
> -- James
>
> -----Original Message-----
> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
> Sent: Tuesday, July 15, 2014 6:04 AM
> To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>
> Subject: RE: question about sentence segmentation
>
>
>
> My preference is to treat the list row number as outside of the sentence of
>
>
> interest. Or if it is necessary to be included in a sentence, have it be a
> sentence
> on its own.
>
> I can get behind this, I think it makes the issue a bit cleaner, to either
> have the list header as non-sentential or it's own sentence. As far as I
> can tell, this is not the current default behavior. At least in my runs the
> list header seems to get attached to the first following sentence, even in
> cases where it starts with a digit and a period ("3. Magnesium oxide 400 mg
> p.o. daily." is all one sentence).
> This behavior is probably strongly dependent on the annotations we give
> the sentence detector so as I'm prepping new training data I should have a
> default in mind.
>
> Does anyone have any objections to changing the sentence detector behavior
> to break list headers (things like "3." or "A " or "#5") as their own
> sentence?
>
> Tim
>
>
> ________________________________________
> From: Britt Fitch [britt.fi...@gmail.com<mailto:britt.fi...@gmail.com>]
> Sent: Monday, July 14, 2014 8:29 AM
> To: dev@ctakes.apache.org<mailto:dev@ctakes.apache.org>
> Subject: Re: question about sentence segmentation
>
> My preference is to treat the list row number as outside of the sentence of
> interest.
> Or if it is necessary to be included in a sentence, have it be a sentence
> on its own.
> That won't be as straightforward as splitting on a period in cases
> like "2. Magnesium
> oxide 400 mg p.o. daily."
> In cases where there are more than 1 written sentence like your example in
> the original email, I'd prefer those were each a sentence rather than
> making the entire list line a single sentence.
> My feeling is that each line without terminating punctuation would be a
> single sentence and would exclude the list number.
>
> As an aside, I have encountered several issues with numbered lists being
> interpreted differently depending on
> 1. what number is included at the start
> for example: "2. Magnesium oxide 400 mg p.o. daily." vs "12. Magnesium
> oxide 400 mg p.o. daily." (This appears to be a chunking issue where the
> line starting with "12. Magnesium" is identified as starting with chunks
> [O,
> O, B-NP, B-NP, I-NP, B-NP, B-ADVP, O] even though the parts of speech
> appear to be correct)
> 2. whether there is a period at the end of a list
> for example: "4. CHF" vs "4. CHF." (This appears to be an issue with the
> chunker though which produces [O,O] in the first case and [B-VP, B-NP, O]
> in the second.
>
> Cheers,
>
> Britt
>
>
>
> On Mon, Jul 14, 2014 at 7:50 AM, Miller, Timothy <
> timothy.mil...@childrens.harvard.edu<mailto:
> timothy.mil...@childrens.harvard.edu>> wrote:
>
>
>
> Just curious about an edge case regarding headers/lists and wondering what
> people think the correct behavior and annotation are.
>
> In cases like this:
>
> #1 Dilated esophagus.
> #2 Adenocarcinoma
>
> my intuition is that each whole line is one sentence. But then there are
> cases where the number may be followed by multiple sentences on one line.
> 1. EGD as a complex procedure. If there is an abnormality, obtain biopsies.
>
> For this example my intuition is not as clear. Should there be a break
> after the "1." or should the first sentence be "1. EGD as a complex
> procedure."? Again, my intuition leans towards the latter but it seems a
> bit odd since the "1." kind of distributes over all the following sentences
> (i.e. it's like a paragraph descriptor.)
>
> Does the period after the 1 matter? The number of sentences after the list
> header? The fact that it's all on one line? Anything else?
>
> Tim
>
>
>
>
>
>
>
>
>
>
>

Reply via email to