The recently rebuilt sentence detector (currently in trunk and the 3.1.0 branch) is sometimes taking the apostrophe as a sentence break where the ctakes-3.0.0-incubating model didn't.
The training data used for the recently rebuilt model only contains only 7 lines that end with an apostrophe (single quote) It has >100K occurrences of 's It has >175K occurrences of the ' character in all. The place I noticed this is in testfakenote.txt.xml in ctakes-regression-test. The word "Broca's" used to have a ContractionToken but since a sentence is now ending on the apostrophe, the apostrophe is getting annotated as a PunctuationToken. Since I don't see anything obviously wrong with the training data, I'm pondering the idea of having a rule that would run after the sentence detector model is used which would rejoin any sentence split that occurs at an ' when it is immediately followed by any letter (not just an s) and preceded by any non white space. Some examples that currently split wrong, using vertical bar to show where the sentence detector splits them The patient also was concerned about a small lesion in his Broca'|s area| Broca'|s| Isn'|t| The pain isn'|t preventing Don'|s daily walks.| Some examples that currently split correctly The aspirin isn't stopping Don's pain.| Anyone have any other suggestions? -- James
