Re: question about sentence segmentation

2014-08-04 Thread Miller, Timothy
Very pleased to see so many people offer suggestions! Comparing some of
these different methods might make an interesting student project.

Sean:
 Just an fyi.  Does that make sense?  Haven't had my coffee ...
Makes perfect sense, the downside is it requires some kind of higher
level understanding during sentence segmentation to understand what the
hierarchy is. You could imagine something that looks similar but with a
different logical structure. Long term, some big joint model that does
all things simultaneously is definitely something I'm interested in.

Steve:
 Seems like rather than specifying a set of candidate characters, we
 want to specify a candidate boundary regular expression.
This might be something that would be possible with minimal changes to
the model.


John:
  why not just split sentences with regex's off a small list of defined onc 
 physical exam terms?
My preference for vanilla ctakes is always to do basic linguistic things
like tokenization and sentence segmentation without reference to
context-specific rules, just because it makes them less portable.
Obviously for specific use cases or applications (like what Britt is
probably doing) you will use whatever information makes sense for your
domain. But I think we could get maybe 75% of the remaining cases (which
are probably only 5% of the total # of cases) by using smarter boundary
conditions like Steve suggested.

Thanks again,
Tim


On 08/02/2014 01:26 PM, John Green wrote:
 I was thinking the same thing as Steve. Thats a pretty regular onc physical 
 exam, why not just split sentences with regex's off a small list of defined 
 onc physical exam terms? The interesting case would be breast, as this term 
 may appear in the body of a sentence (rather than just a term), but u could 
 use a regex sub match where u conditionally match breast first then one or 
 more key physical findings to correctly identify THAT breast word token as 
 the term, eg beginning of the sentence. I would recommend red flag physical 
 findings as they are more likely to always been in the body of the sentence, 
 for example, Breast: no lumps or masses palpable.


 I have a few other ideas if thats barking up the right tree.




 JG
 —
 Sent from Mailbox for iPhone

 On Sat, Aug 2, 2014 at 8:58 AM, Steven Bethard steven.beth...@gmail.com
 wrote:

 On Sat, Aug 2, 2014 at 7:43 AM, Miller, Timothy
 timothy.mil...@childrens.harvard.edu wrote:
 PE: Lymphnodes: neck and axilla without adenopathy Lungs: normal and clear 
 to auscultation CV: regular rate and rhythm without murmur or gallop , S1, 
 S2 normal, no murmur, click, rub or gal*, chest is clear without rales or 
 wheezing, no pedal edema, no JVD, no hepatosplenomegaly Breast: negative 
 findings right/left breast with mild swelling, warmth, mild erythema, 
 slightly tender, no seroma or hematoma Abdomen: Abdomen soft, non-tender.

 It would be preferable to me to put sentence breaks in between the 
 sections, so the first two sentences would be:

 1) PE: Lymphonodes...
 2) Lungs: normal...
 [snip]
 Another example that breaks our model in a different way (truncated):
 1. Baseline labwork including tumor markers  2. Start DD AC on Friday 8/1 
 with RN chemo teach  3. S U parent study
 [snip]
 Here it would be preferable to get:
 1.
 Baseline labwork...
 2.
 Start DD...
 3.
 S U parent study
 Seems like rather than specifying a set of candidate characters, we
 want to specify a candidate boundary regular expression. Something
 like, \p{P}|\b\p{Lu}|\b\p{N}, should cover all of the above cases:
 sentence boundaries may appear at punctuation marks, at uppercase
 letters after word boundaries, and at numbers after a word boundaries.
 Steve



Re: question about sentence segmentation

2014-08-02 Thread Britt Fitch
One method I use for finding headings is term followed by either 2 or
more instances of white space or a symbol (colon, comma, dash) followed by
1 or more instances of white space.
Its really naive but works well because the term is from a controlled
set. Thats not super helpful in your first example above unless those
sections can be predefined.

Your second example seems a lot harder. Especially when there are valid
number/period patterns at the end of the line. Patient presented with
fever of 102. or other measurements.


On Sat, Aug 2, 2014 at 7:43 AM, Miller, Timothy 
timothy.mil...@childrens.harvard.edu wrote:

 I'm annotating some oncology notes from SHARP right now, and they are
 basically a nightmare for our current sentence segmentation model. Mainly
 because they eschew explicit markers between sentences. I thought I'd ping
 the list with some interesting examples just in case it stimulates ideas.
 But it seems to me that at some point we'll have to augment the opennlp
 module (preferable) or roll our own to handle cases like these.

 In this example a bunch of background is on one line with no punctuation
 between logical breaks:
 PE: Lymphnodes: neck and axilla without adenopathy Lungs: normal and clear
 to auscultation CV: regular rate and rhythm without murmur or gallop , S1,
 S2 normal, no murmur, click, rub or gal*, chest is clear without rales or
 wheezing, no pedal edema, no JVD, no hepatosplenomegaly Breast: negative
 findings right/left breast with mild swelling, warmth, mild erythema,
 slightly tender, no seroma or hematoma Abdomen: Abdomen soft, non-tender.

 It would be preferable to me to put sentence breaks in between the
 sections, so the first two sentences would be:

 1) PE: Lymphonodes...
 2) Lungs: normal...

 but without any candidate characters to split the sentence I don't think
 it is possible.

 Another example that breaks our model in a different way (truncated):
 1. Baseline labwork including tumor markers  2. Start DD AC on Friday 8/1
 with RN chemo teach  3. S U parent study

 Our model will break on the period after the number, so we'd probably get:
 1.
 Baseline labwork including tumor markers 2.
 Start DD 3.
 S U parent study

 So the number is going in exactly the wrong place. Here it would be
 preferable to get:
 1.
 Baseline labwork...
 2.
 Start DD...
 3.
 S U parent study

 Anyways, just something to think about! The problem is much more complex
 in clinical data than in edited text, but I'm sure we all knew that already
 :)

 Tim


 
 From: Miller, Timothy [timothy.mil...@childrens.harvard.edu]
 Sent: Monday, July 28, 2014 2:38 PM
 To: dev@ctakes.apache.org
 Subject: Re: question about sentence segmentation

 Yes, you're right about that Britt. I've been doing some annotations side
 by side with a treebank viewer and think I have a pretty good handle on the
 actual rules.

 Basically, if a header or list identifier is followed by a period or a
 newline it is considered a sentence break and otherwise it is part of the
 sentence.

 e.g.

 1. 20 mg flomax

 is two sentences, while:

 1 - 20 mg flomax

 is one sentence.

 For headings:

 Allergies: Pt is allergic to aspirin.

 is one sentence, while:

 Allergies:
 Pt is allergic to aspirin.

 is two sentences.

 I'm planning to follow these guidelines.

 Tim

 On 07/28/2014 01:53 PM, britt fitch wrote:

 Thanks for the document, Tim. It seems to not be explicit about how to
 handle sentences occurring in lists.

 Are you still considering having the list number as outside of the
 sentence?

 Thanks

 Britt

 On Jul 25, 2014, at 7:09 AM, Miller, Timothy 
 timothy.mil...@childrens.harvard.edumailto:
 timothy.mil...@childrens.harvard.edu wrote:



 Checking with Guergana and other colleagues here the advice is to have the
 sentence segmenter follow the treebank guidelines for sentence segmentation:
 http://clear.colorado.edu/compsem/documents/treebank_guidelines.pdf

 They are a bit light on detail but fortunately we have some treebanked
 data so I will use that for the training data and hopefully that will
 illuminate the tricky cases.

 Tim

 
 From: Masanz, James J. [masanz.ja...@mayo.edumailto:masanz.ja...@mayo.edu
 ]
 Sent: Tuesday, July 15, 2014 4:39 PM
 To: 'dev@ctakes.apache.orgmailto:dev@ctakes.apache.org'
 Subject: RE: question about sentence segmentation

 Sorry, I don't know if there was a reason.

 If you haven't checked with Guergana, you might want to ask her if she had
 a reason or if it was just the way it had been since that corpus was
 created.

 -Original Message-
 From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
 Sent: Tuesday, July 15, 2014 3:34 PM
 To: dev@ctakes.apache.orgmailto:dev@ctakes.apache.org
 Subject: Re: question about sentence segmentation

 Thanks James, I was hoping to hear from you. I'll probably go ahead and
 change the data to split sentences between the list header

RE: question about sentence segmentation

2014-08-02 Thread Finan, Sean
Hi Tim,

 It would be preferable to me to put sentence breaks in between the sections, 
 so
 the first two sentences would be:
 
 1) PE: Lymphonodes...
 2) Lungs: normal...

The punctuation is (always) after the logical break, being Term:  for a 
Term:Definition list.  I think that the first three sentences should be
1) PE:
2) Lymphnodes: neck and ...
3) CV: regular and ...
Where the first line is an overarching Term: sentence (tree root), because each 
Term:Definition line that follows is within the physical exam.

Just an fyi.  Does that make sense?  Haven't had my coffee ...
Sean

 -Original Message-
 From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
 Sent: Saturday, August 02, 2014 7:44 AM
 To: dev@ctakes.apache.org
 Subject: RE: question about sentence segmentation
 
 I'm annotating some oncology notes from SHARP right now, and they are
 basically a nightmare for our current sentence segmentation model. Mainly
 because they eschew explicit markers between sentences. I thought I'd ping the
 list with some interesting examples just in case it stimulates ideas. But it 
 seems
 to me that at some point we'll have to augment the opennlp module (preferable)
 or roll our own to handle cases like these.
 
 In this example a bunch of background is on one line with no punctuation
 between logical breaks:
 PE: Lymphnodes: neck and axilla without adenopathy Lungs: normal and clear to
 auscultation CV: regular rate and rhythm without murmur or gallop , S1, S2
 normal, no murmur, click, rub or gal*, chest is clear without rales or 
 wheezing,
 no pedal edema, no JVD, no hepatosplenomegaly Breast: negative findings
 right/left breast with mild swelling, warmth, mild erythema, slightly tender, 
 no
 seroma or hematoma Abdomen: Abdomen soft, non-tender.
 
 It would be preferable to me to put sentence breaks in between the sections, 
 so
 the first two sentences would be:
 
 1) PE: Lymphonodes...
 2) Lungs: normal...
 
 but without any candidate characters to split the sentence I don't think it is
 possible.
 
 Another example that breaks our model in a different way (truncated):
 1. Baseline labwork including tumor markers  2. Start DD AC on Friday 8/1 with
 RN chemo teach  3. S U parent study
 
 Our model will break on the period after the number, so we'd probably get:
 1.
 Baseline labwork including tumor markers 2.
 Start DD 3.
 S U parent study
 
 So the number is going in exactly the wrong place. Here it would be preferable
 to get:
 1.
 Baseline labwork...
 2.
 Start DD...
 3.
 S U parent study
 
 Anyways, just something to think about! The problem is much more complex in
 clinical data than in edited text, but I'm sure we all knew that already :)
 
 Tim
 
 
 
 From: Miller, Timothy [timothy.mil...@childrens.harvard.edu]
 Sent: Monday, July 28, 2014 2:38 PM
 To: dev@ctakes.apache.org
 Subject: Re: question about sentence segmentation
 
 Yes, you're right about that Britt. I've been doing some annotations side by 
 side
 with a treebank viewer and think I have a pretty good handle on the actual 
 rules.
 
 Basically, if a header or list identifier is followed by a period or a 
 newline it is
 considered a sentence break and otherwise it is part of the sentence.
 
 e.g.
 
 1. 20 mg flomax
 
 is two sentences, while:
 
 1 - 20 mg flomax
 
 is one sentence.
 
 For headings:
 
 Allergies: Pt is allergic to aspirin.
 
 is one sentence, while:
 
 Allergies:
 Pt is allergic to aspirin.
 
 is two sentences.
 
 I'm planning to follow these guidelines.
 
 Tim
 
 On 07/28/2014 01:53 PM, britt fitch wrote:
 
 Thanks for the document, Tim. It seems to not be explicit about how to handle
 sentences occurring in lists.
 
 Are you still considering having the list number as outside of the sentence?
 
 Thanks
 
 Britt
 
 On Jul 25, 2014, at 7:09 AM, Miller, Timothy
 timothy.mil...@childrens.harvard.edumailto:timothy.mil...@childrens.harv
 ard.edu wrote:
 
 
 
 Checking with Guergana and other colleagues here the advice is to have the
 sentence segmenter follow the treebank guidelines for sentence segmentation:
 http://clear.colorado.edu/compsem/documents/treebank_guidelines.pdf
 
 They are a bit light on detail but fortunately we have some treebanked data 
 so I
 will use that for the training data and hopefully that will illuminate the 
 tricky
 cases.
 
 Tim
 
 
 From: Masanz, James J.
 [masanz.ja...@mayo.edumailto:masanz.ja...@mayo.edu]
 Sent: Tuesday, July 15, 2014 4:39 PM
 To: 'dev@ctakes.apache.orgmailto:dev@ctakes.apache.org'
 Subject: RE: question about sentence segmentation
 
 Sorry, I don't know if there was a reason.
 
 If you haven't checked with Guergana, you might want to ask her if she had a
 reason or if it was just the way it had been since that corpus was created.
 
 -Original Message-
 From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
 Sent: Tuesday, July 15, 2014 3:34 PM

Re: question about sentence segmentation

2014-08-02 Thread Steven Bethard
On Sat, Aug 2, 2014 at 7:43 AM, Miller, Timothy
timothy.mil...@childrens.harvard.edu wrote:
 PE: Lymphnodes: neck and axilla without adenopathy Lungs: normal and clear to 
 auscultation CV: regular rate and rhythm without murmur or gallop , S1, S2 
 normal, no murmur, click, rub or gal*, chest is clear without rales or 
 wheezing, no pedal edema, no JVD, no hepatosplenomegaly Breast: negative 
 findings right/left breast with mild swelling, warmth, mild erythema, 
 slightly tender, no seroma or hematoma Abdomen: Abdomen soft, non-tender.

 It would be preferable to me to put sentence breaks in between the sections, 
 so the first two sentences would be:

 1) PE: Lymphonodes...
 2) Lungs: normal...
[snip]
 Another example that breaks our model in a different way (truncated):
 1. Baseline labwork including tumor markers  2. Start DD AC on Friday 8/1 
 with RN chemo teach  3. S U parent study
[snip]
 Here it would be preferable to get:
 1.
 Baseline labwork...
 2.
 Start DD...
 3.
 S U parent study

Seems like rather than specifying a set of candidate characters, we
want to specify a candidate boundary regular expression. Something
like, \p{P}|\b\p{Lu}|\b\p{N}, should cover all of the above cases:
sentence boundaries may appear at punctuation marks, at uppercase
letters after word boundaries, and at numbers after a word boundaries.

Steve


Re: question about sentence segmentation

2014-07-28 Thread britt fitch
Thanks for the document, Tim. It seems to not be explicit about how to handle 
sentences occurring in lists. 

Are you still considering having the list number as outside of the sentence? 

Thanks

Britt

On Jul 25, 2014, at 7:09 AM, Miller, Timothy 
timothy.mil...@childrens.harvard.edu wrote:

 Checking with Guergana and other colleagues here the advice is to have the 
 sentence segmenter follow the treebank guidelines for sentence segmentation:
 http://clear.colorado.edu/compsem/documents/treebank_guidelines.pdf
 
 They are a bit light on detail but fortunately we have some treebanked data 
 so I will use that for the training data and hopefully that will illuminate 
 the tricky cases.
 
 Tim
 
 
 From: Masanz, James J. [masanz.ja...@mayo.edu]
 Sent: Tuesday, July 15, 2014 4:39 PM
 To: 'dev@ctakes.apache.org'
 Subject: RE: question about sentence segmentation
 
 Sorry, I don't know if there was a reason.
 
 If you haven't checked with Guergana, you might want to ask her if she had a 
 reason or if it was just the way it had been since that corpus was created.
 
 -Original Message-
 From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
 Sent: Tuesday, July 15, 2014 3:34 PM
 To: dev@ctakes.apache.org
 Subject: Re: question about sentence segmentation
 
 Thanks James, I was hoping to hear from you. I'll probably go ahead and
 change the data to split sentences between the list header and list element.
 
 You don't happen to know if there is any principled reason for the
 original style or whether it was just an arbitrary convention? The only
 thing I can think of is it might be hard to learn when to separate when
 there is no period after the list header (as in your examples). I think
 it's worth empirically checking on that point, but there might be other
 reasons that I'm not thinking of.
 
 Thanks
 Tim
 
 On 07/15/2014 03:27 PM, Masanz, James J. wrote:
 I don't have an opinion about how it should work.
 
 But I can verify that the clinical notes from Mayo Clinic that were used in 
 the initial cTAKES sentence detector model had the list markers included in 
 the first sentence, so, for example, the following would be two sentences, 
 with each line a separate sentence.
 
 #1 Dilated esophagus.
 #2 Adenocarcinoma
 
 -- James
 
 -Original Message-
 From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
 Sent: Tuesday, July 15, 2014 6:04 AM
 To: dev@ctakes.apache.org
 Subject: RE: question about sentence segmentation
 
 My preference is to treat the list row number as outside of the sentence of
 interest. Or if it is necessary to be included in a sentence, have it be a 
 sentence
 on its own.
 
 I can get behind this, I think it makes the issue a bit cleaner, to either 
 have the list header as non-sentential or it's own sentence. As far as I can 
 tell, this is not the current default behavior. At least in my runs the list 
 header seems to get attached to the first following sentence, even in cases 
 where it starts with a digit and a period (3. Magnesium oxide 400 mg p.o. 
 daily. is all one sentence).
 This behavior is probably strongly dependent on the annotations we give the 
 sentence detector so as I'm prepping new training data I should have a 
 default in mind.
 
 Does anyone have any objections to changing the sentence detector behavior 
 to break list headers (things like 3. or A  or #5) as their own 
 sentence?
 
 Tim
 
 
 
 From: Britt Fitch [britt.fi...@gmail.com]
 Sent: Monday, July 14, 2014 8:29 AM
 To: dev@ctakes.apache.org
 Subject: Re: question about sentence segmentation
 
 My preference is to treat the list row number as outside of the sentence of
 interest.
 Or if it is necessary to be included in a sentence, have it be a sentence
 on its own.
 That won't be as straightforward as splitting on a period in cases
 like 2. Magnesium
 oxide 400 mg p.o. daily.
 In cases where there are more than 1 written sentence like your example in
 the original email, I'd prefer those were each a sentence rather than
 making the entire list line a single sentence.
 My feeling is that each line without terminating punctuation would be a
 single sentence and would exclude the list number.
 
 As an aside, I have encountered several issues with numbered lists being
 interpreted differently depending on
 1. what number is included at the start
 for example: 2. Magnesium oxide 400 mg p.o. daily. vs 12. Magnesium
 oxide 400 mg p.o. daily. (This appears to be a chunking issue where the
 line starting with 12. Magnesium is identified as starting with chunks [O,
 O, B-NP, B-NP, I-NP, B-NP, B-ADVP, O] even though the parts of speech
 appear to be correct)
 2. whether there is a period at the end of a list
 for example: 4. CHF vs 4. CHF. (This appears to be an issue with the
 chunker though which produces [O,O] in the first case and [B-VP, B-NP, O]
 in the second.
 
 Cheers,
 
 Britt

Re: question about sentence segmentation

2014-07-28 Thread Miller, Timothy
Yes, you're right about that Britt. I've been doing some annotations side by 
side with a treebank viewer and think I have a pretty good handle on the actual 
rules.

Basically, if a header or list identifier is followed by a period or a newline 
it is considered a sentence break and otherwise it is part of the sentence.

e.g.

1. 20 mg flomax

is two sentences, while:

1 - 20 mg flomax

is one sentence.

For headings:

Allergies: Pt is allergic to aspirin.

is one sentence, while:

Allergies:
Pt is allergic to aspirin.

is two sentences.

I'm planning to follow these guidelines.

Tim

On 07/28/2014 01:53 PM, britt fitch wrote:

Thanks for the document, Tim. It seems to not be explicit about how to handle 
sentences occurring in lists.

Are you still considering having the list number as outside of the sentence?

Thanks

Britt

On Jul 25, 2014, at 7:09 AM, Miller, Timothy 
timothy.mil...@childrens.harvard.edumailto:timothy.mil...@childrens.harvard.edu
 wrote:



Checking with Guergana and other colleagues here the advice is to have the 
sentence segmenter follow the treebank guidelines for sentence segmentation:
http://clear.colorado.edu/compsem/documents/treebank_guidelines.pdf

They are a bit light on detail but fortunately we have some treebanked data so 
I will use that for the training data and hopefully that will illuminate the 
tricky cases.

Tim


From: Masanz, James J. [masanz.ja...@mayo.edumailto:masanz.ja...@mayo.edu]
Sent: Tuesday, July 15, 2014 4:39 PM
To: 'dev@ctakes.apache.orgmailto:dev@ctakes.apache.org'
Subject: RE: question about sentence segmentation

Sorry, I don't know if there was a reason.

If you haven't checked with Guergana, you might want to ask her if she had a 
reason or if it was just the way it had been since that corpus was created.

-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
Sent: Tuesday, July 15, 2014 3:34 PM
To: dev@ctakes.apache.orgmailto:dev@ctakes.apache.org
Subject: Re: question about sentence segmentation

Thanks James, I was hoping to hear from you. I'll probably go ahead and
change the data to split sentences between the list header and list element.

You don't happen to know if there is any principled reason for the
original style or whether it was just an arbitrary convention? The only
thing I can think of is it might be hard to learn when to separate when
there is no period after the list header (as in your examples). I think
it's worth empirically checking on that point, but there might be other
reasons that I'm not thinking of.

Thanks
Tim

On 07/15/2014 03:27 PM, Masanz, James J. wrote:


I don't have an opinion about how it should work.

But I can verify that the clinical notes from Mayo Clinic that were used in the 
initial cTAKES sentence detector model had the list markers included in the 
first sentence, so, for example, the following would be two sentences, with 
each line a separate sentence.

#1 Dilated esophagus.
#2 Adenocarcinoma

-- James

-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
Sent: Tuesday, July 15, 2014 6:04 AM
To: dev@ctakes.apache.orgmailto:dev@ctakes.apache.org
Subject: RE: question about sentence segmentation



My preference is to treat the list row number as outside of the sentence of


interest. Or if it is necessary to be included in a sentence, have it be a 
sentence
on its own.

I can get behind this, I think it makes the issue a bit cleaner, to either have 
the list header as non-sentential or it's own sentence. As far as I can tell, 
this is not the current default behavior. At least in my runs the list header 
seems to get attached to the first following sentence, even in cases where it 
starts with a digit and a period (3. Magnesium oxide 400 mg p.o. daily. is 
all one sentence).
This behavior is probably strongly dependent on the annotations we give the 
sentence detector so as I'm prepping new training data I should have a default 
in mind.

Does anyone have any objections to changing the sentence detector behavior to 
break list headers (things like 3. or A  or #5) as their own sentence?

Tim



From: Britt Fitch [britt.fi...@gmail.commailto:britt.fi...@gmail.com]
Sent: Monday, July 14, 2014 8:29 AM
To: dev@ctakes.apache.orgmailto:dev@ctakes.apache.org
Subject: Re: question about sentence segmentation

My preference is to treat the list row number as outside of the sentence of
interest.
Or if it is necessary to be included in a sentence, have it be a sentence
on its own.
That won't be as straightforward as splitting on a period in cases
like 2. Magnesium
oxide 400 mg p.o. daily.
In cases where there are more than 1 written sentence like your example in
the original email, I'd prefer those were each a sentence rather than
making the entire list line a single sentence.
My feeling is that each line without terminating punctuation would

RE: question about sentence segmentation

2014-07-15 Thread Masanz, James J.
I don't have an opinion about how it should work.

But I can verify that the clinical notes from Mayo Clinic that were used in the 
initial cTAKES sentence detector model had the list markers included in the 
first sentence, so, for example, the following would be two sentences, with 
each line a separate sentence.

#1 Dilated esophagus.
#2 Adenocarcinoma

-- James

-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Tuesday, July 15, 2014 6:04 AM
To: dev@ctakes.apache.org
Subject: RE: question about sentence segmentation

 My preference is to treat the list row number as outside of the sentence of
interest. Or if it is necessary to be included in a sentence, have it be a 
sentence
on its own.

I can get behind this, I think it makes the issue a bit cleaner, to either have 
the list header as non-sentential or it's own sentence. As far as I can tell, 
this is not the current default behavior. At least in my runs the list header 
seems to get attached to the first following sentence, even in cases where it 
starts with a digit and a period (3. Magnesium oxide 400 mg p.o. daily. is 
all one sentence).
This behavior is probably strongly dependent on the annotations we give the 
sentence detector so as I'm prepping new training data I should have a default 
in mind.

Does anyone have any objections to changing the sentence detector behavior to 
break list headers (things like 3. or A  or #5) as their own sentence?

Tim



From: Britt Fitch [britt.fi...@gmail.com]
Sent: Monday, July 14, 2014 8:29 AM
To: dev@ctakes.apache.org
Subject: Re: question about sentence segmentation

My preference is to treat the list row number as outside of the sentence of
interest.
Or if it is necessary to be included in a sentence, have it be a sentence
on its own.
That won't be as straightforward as splitting on a period in cases
like 2. Magnesium
oxide 400 mg p.o. daily.
In cases where there are more than 1 written sentence like your example in
the original email, I'd prefer those were each a sentence rather than
making the entire list line a single sentence.
My feeling is that each line without terminating punctuation would be a
single sentence and would exclude the list number.

As an aside, I have encountered several issues with numbered lists being
interpreted differently depending on
1. what number is included at the start
for example: 2. Magnesium oxide 400 mg p.o. daily. vs 12. Magnesium
oxide 400 mg p.o. daily. (This appears to be a chunking issue where the
line starting with 12. Magnesium is identified as starting with chunks [O,
O, B-NP, B-NP, I-NP, B-NP, B-ADVP, O] even though the parts of speech
appear to be correct)
2. whether there is a period at the end of a list
for example: 4. CHF vs 4. CHF. (This appears to be an issue with the
chunker though which produces [O,O] in the first case and [B-VP, B-NP, O]
in the second.

Cheers,

Britt



On Mon, Jul 14, 2014 at 7:50 AM, Miller, Timothy 
timothy.mil...@childrens.harvard.edu wrote:

 Just curious about an edge case regarding headers/lists and wondering what
 people think the correct behavior and annotation are.

 In cases like this:

 #1 Dilated esophagus.
 #2 Adenocarcinoma

 my intuition is that each whole line is one sentence. But then there are
 cases where the number may be followed by multiple sentences on one line.
 1. EGD as a complex procedure. If there is an abnormality, obtain biopsies.

 For this example my intuition is not as clear. Should there be a break
 after the 1. or should the first sentence be 1. EGD as a complex
 procedure.? Again, my intuition leans towards the latter but it seems a
 bit odd since the 1. kind of distributes over all the following sentences
 (i.e. it's like a paragraph descriptor.)

 Does the period after the 1 matter? The number of sentences after the list
 header? The fact that it's all on one line? Anything else?

 Tim



Re: question about sentence segmentation

2014-07-15 Thread Miller, Timothy
Thanks James, I was hoping to hear from you. I'll probably go ahead and
change the data to split sentences between the list header and list element.

You don't happen to know if there is any principled reason for the
original style or whether it was just an arbitrary convention? The only
thing I can think of is it might be hard to learn when to separate when
there is no period after the list header (as in your examples). I think
it's worth empirically checking on that point, but there might be other
reasons that I'm not thinking of.

Thanks
Tim

On 07/15/2014 03:27 PM, Masanz, James J. wrote:
 I don't have an opinion about how it should work.

 But I can verify that the clinical notes from Mayo Clinic that were used in 
 the initial cTAKES sentence detector model had the list markers included in 
 the first sentence, so, for example, the following would be two sentences, 
 with each line a separate sentence.

 #1 Dilated esophagus.
 #2 Adenocarcinoma

 -- James

 -Original Message-
 From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
 Sent: Tuesday, July 15, 2014 6:04 AM
 To: dev@ctakes.apache.org
 Subject: RE: question about sentence segmentation

 My preference is to treat the list row number as outside of the sentence of
 interest. Or if it is necessary to be included in a sentence, have it be a 
 sentence
 on its own.

 I can get behind this, I think it makes the issue a bit cleaner, to either 
 have the list header as non-sentential or it's own sentence. As far as I can 
 tell, this is not the current default behavior. At least in my runs the list 
 header seems to get attached to the first following sentence, even in cases 
 where it starts with a digit and a period (3. Magnesium oxide 400 mg p.o. 
 daily. is all one sentence).
 This behavior is probably strongly dependent on the annotations we give the 
 sentence detector so as I'm prepping new training data I should have a 
 default in mind.

 Does anyone have any objections to changing the sentence detector behavior to 
 break list headers (things like 3. or A  or #5) as their own sentence?

 Tim


 
 From: Britt Fitch [britt.fi...@gmail.com]
 Sent: Monday, July 14, 2014 8:29 AM
 To: dev@ctakes.apache.org
 Subject: Re: question about sentence segmentation

 My preference is to treat the list row number as outside of the sentence of
 interest.
 Or if it is necessary to be included in a sentence, have it be a sentence
 on its own.
 That won't be as straightforward as splitting on a period in cases
 like 2. Magnesium
 oxide 400 mg p.o. daily.
 In cases where there are more than 1 written sentence like your example in
 the original email, I'd prefer those were each a sentence rather than
 making the entire list line a single sentence.
 My feeling is that each line without terminating punctuation would be a
 single sentence and would exclude the list number.

 As an aside, I have encountered several issues with numbered lists being
 interpreted differently depending on
 1. what number is included at the start
 for example: 2. Magnesium oxide 400 mg p.o. daily. vs 12. Magnesium
 oxide 400 mg p.o. daily. (This appears to be a chunking issue where the
 line starting with 12. Magnesium is identified as starting with chunks [O,
 O, B-NP, B-NP, I-NP, B-NP, B-ADVP, O] even though the parts of speech
 appear to be correct)
 2. whether there is a period at the end of a list
 for example: 4. CHF vs 4. CHF. (This appears to be an issue with the
 chunker though which produces [O,O] in the first case and [B-VP, B-NP, O]
 in the second.

 Cheers,

 Britt



 On Mon, Jul 14, 2014 at 7:50 AM, Miller, Timothy 
 timothy.mil...@childrens.harvard.edu wrote:

 Just curious about an edge case regarding headers/lists and wondering what
 people think the correct behavior and annotation are.

 In cases like this:

 #1 Dilated esophagus.
 #2 Adenocarcinoma

 my intuition is that each whole line is one sentence. But then there are
 cases where the number may be followed by multiple sentences on one line.
 1. EGD as a complex procedure. If there is an abnormality, obtain biopsies.

 For this example my intuition is not as clear. Should there be a break
 after the 1. or should the first sentence be 1. EGD as a complex
 procedure.? Again, my intuition leans towards the latter but it seems a
 bit odd since the 1. kind of distributes over all the following sentences
 (i.e. it's like a paragraph descriptor.)

 Does the period after the 1 matter? The number of sentences after the list
 header? The fact that it's all on one line? Anything else?

 Tim




RE: question about sentence segmentation

2014-07-15 Thread Masanz, James J.
Sorry, I don't know if there was a reason.

If you haven't checked with Guergana, you might want to ask her if she had a 
reason or if it was just the way it had been since that corpus was created.

-Original Message-
From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
Sent: Tuesday, July 15, 2014 3:34 PM
To: dev@ctakes.apache.org
Subject: Re: question about sentence segmentation

Thanks James, I was hoping to hear from you. I'll probably go ahead and
change the data to split sentences between the list header and list element.

You don't happen to know if there is any principled reason for the
original style or whether it was just an arbitrary convention? The only
thing I can think of is it might be hard to learn when to separate when
there is no period after the list header (as in your examples). I think
it's worth empirically checking on that point, but there might be other
reasons that I'm not thinking of.

Thanks
Tim

On 07/15/2014 03:27 PM, Masanz, James J. wrote:
 I don't have an opinion about how it should work.

 But I can verify that the clinical notes from Mayo Clinic that were used in 
 the initial cTAKES sentence detector model had the list markers included in 
 the first sentence, so, for example, the following would be two sentences, 
 with each line a separate sentence.

 #1 Dilated esophagus.
 #2 Adenocarcinoma

 -- James

 -Original Message-
 From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
 Sent: Tuesday, July 15, 2014 6:04 AM
 To: dev@ctakes.apache.org
 Subject: RE: question about sentence segmentation

 My preference is to treat the list row number as outside of the sentence of
 interest. Or if it is necessary to be included in a sentence, have it be a 
 sentence
 on its own.

 I can get behind this, I think it makes the issue a bit cleaner, to either 
 have the list header as non-sentential or it's own sentence. As far as I can 
 tell, this is not the current default behavior. At least in my runs the list 
 header seems to get attached to the first following sentence, even in cases 
 where it starts with a digit and a period (3. Magnesium oxide 400 mg p.o. 
 daily. is all one sentence).
 This behavior is probably strongly dependent on the annotations we give the 
 sentence detector so as I'm prepping new training data I should have a 
 default in mind.

 Does anyone have any objections to changing the sentence detector behavior to 
 break list headers (things like 3. or A  or #5) as their own sentence?

 Tim


 
 From: Britt Fitch [britt.fi...@gmail.com]
 Sent: Monday, July 14, 2014 8:29 AM
 To: dev@ctakes.apache.org
 Subject: Re: question about sentence segmentation

 My preference is to treat the list row number as outside of the sentence of
 interest.
 Or if it is necessary to be included in a sentence, have it be a sentence
 on its own.
 That won't be as straightforward as splitting on a period in cases
 like 2. Magnesium
 oxide 400 mg p.o. daily.
 In cases where there are more than 1 written sentence like your example in
 the original email, I'd prefer those were each a sentence rather than
 making the entire list line a single sentence.
 My feeling is that each line without terminating punctuation would be a
 single sentence and would exclude the list number.

 As an aside, I have encountered several issues with numbered lists being
 interpreted differently depending on
 1. what number is included at the start
 for example: 2. Magnesium oxide 400 mg p.o. daily. vs 12. Magnesium
 oxide 400 mg p.o. daily. (This appears to be a chunking issue where the
 line starting with 12. Magnesium is identified as starting with chunks [O,
 O, B-NP, B-NP, I-NP, B-NP, B-ADVP, O] even though the parts of speech
 appear to be correct)
 2. whether there is a period at the end of a list
 for example: 4. CHF vs 4. CHF. (This appears to be an issue with the
 chunker though which produces [O,O] in the first case and [B-VP, B-NP, O]
 in the second.

 Cheers,

 Britt



 On Mon, Jul 14, 2014 at 7:50 AM, Miller, Timothy 
 timothy.mil...@childrens.harvard.edu wrote:

 Just curious about an edge case regarding headers/lists and wondering what
 people think the correct behavior and annotation are.

 In cases like this:

 #1 Dilated esophagus.
 #2 Adenocarcinoma

 my intuition is that each whole line is one sentence. But then there are
 cases where the number may be followed by multiple sentences on one line.
 1. EGD as a complex procedure. If there is an abnormality, obtain biopsies.

 For this example my intuition is not as clear. Should there be a break
 after the 1. or should the first sentence be 1. EGD as a complex
 procedure.? Again, my intuition leans towards the latter but it seems a
 bit odd since the 1. kind of distributes over all the following sentences
 (i.e. it's like a paragraph descriptor.)

 Does the period after the 1 matter? The number of sentences after the list
 header? The fact that it's all