On Sat, Aug 2, 2014 at 7:43 AM, Miller, Timothy <timothy.mil...@childrens.harvard.edu> wrote: > PE: Lymphnodes: neck and axilla without adenopathy Lungs: normal and clear to > auscultation CV: regular rate and rhythm without murmur or gallop , S1, S2 > normal, no murmur, click, rub or gal*, chest is clear without rales or > wheezing, no pedal edema, no JVD, no hepatosplenomegaly Breast: negative > findings right/left breast with mild swelling, warmth, mild erythema, > slightly tender, no seroma or hematoma Abdomen: Abdomen soft, non-tender. > > It would be preferable to me to put sentence breaks in between the sections, > so the first two sentences would be: > > 1) PE: Lymphonodes... > 2) Lungs: normal... [snip] > Another example that breaks our model in a different way (truncated): > 1. Baseline labwork including tumor markers 2. Start DD AC on Friday 8/1 > with RN chemo teach 3. S U parent study [snip] > Here it would be preferable to get: > 1. > Baseline labwork... > 2. > Start DD... > 3. > S U parent study
Seems like rather than specifying a set of "candidate characters", we want to specify a candidate boundary regular expression. Something like, \p{P}|\b\p{Lu}|\b\p{N}, should cover all of the above cases: sentence boundaries may appear at punctuation marks, at uppercase letters after word boundaries, and at numbers after a word boundaries. Steve