Re: question about sentence segmentation

Steven Bethard Sat, 02 Aug 2014 05:59:15 -0700

On Sat, Aug 2, 2014 at 7:43 AM, Miller, Timothy
<timothy.mil...@childrens.harvard.edu> wrote:
> PE: Lymphnodes: neck and axilla without adenopathy Lungs: normal and clear to 
> auscultation CV: regular rate and rhythm without murmur or gallop , S1, S2 
> normal, no murmur, click, rub or gal*, chest is clear without rales or 
> wheezing, no pedal edema, no JVD, no hepatosplenomegaly Breast: negative 
> findings right/left breast with mild swelling, warmth, mild erythema, 
> slightly tender, no seroma or hematoma Abdomen: Abdomen soft, non-tender.
>
> It would be preferable to me to put sentence breaks in between the sections, 
> so the first two sentences would be:
>
> 1) PE: Lymphonodes...
> 2) Lungs: normal...
[snip]
> Another example that breaks our model in a different way (truncated):
> 1. Baseline labwork including tumor markers  2. Start DD AC on Friday 8/1 
> with RN chemo teach  3. S U parent study
[snip]
> Here it would be preferable to get:
> 1.
> Baseline labwork...
> 2.
> Start DD...
> 3.
> S U parent study


Seems like rather than specifying a set of "candidate characters", we
want to specify a candidate boundary regular expression. Something
like, \p{P}|\b\p{Lu}|\b\p{N}, should cover all of the above cases:
sentence boundaries may appear at punctuation marks, at uppercase
letters after word boundaries, and at numbers after a word boundaries.

Steve

Re: question about sentence segmentation

Reply via email to