apostrophe and sentence detector

Masanz, James J. Mon, 26 Aug 2013 09:06:42 -0700

The recently rebuilt sentence detector (currently in trunk and the 3.1.0 
branch) is sometimes taking the apostrophe as a sentence break where the 
ctakes-3.0.0-incubating model didn't.


The training data used for the recently rebuilt model only contains only 7 
lines that end with an apostrophe (single quote)

It has >100K occurrences of 's

It has >175K occurrences of the ' character in all.

The place I noticed this is in testfakenote.txt.xml in ctakes-regression-test.

The word "Broca's" used to have a ContractionToken but since a sentence is now 
ending on the apostrophe, the apostrophe is getting annotated as a 
PunctuationToken.

Since I don't see anything obviously wrong with the training data, I'm 
pondering the idea of having a rule that would run after the sentence detector 
model is used which would rejoin any sentence split that occurs at an ' when it 
is immediately followed by any letter (not just an s) and preceded by any non 
white space.

Some examples that currently split wrong, using vertical bar to show where the 
sentence detector splits them
The patient also was concerned about a small lesion in his Broca'|s area|
Broca'|s|
Isn'|t|
The pain isn'|t preventing Don'|s daily walks.|

Some examples that currently split correctly
The aspirin isn't stopping Don's pain.|

Anyone have any other suggestions?

-- James

apostrophe and sentence detector

Reply via email to