On 01/25/2014 03:03 PM, Miller, Timothy wrote:
I'm running into one issue, it gets tripped up on sentences with
line-ending spaces. I could easily remove them with a script but by
default they are in there. It happens when a sentence example ends:
...BILAT HEMATOMAS. <LF>
(There is a period, then 2 spaces, then the line feed character.) I am
pretty sure this is the root because when I fix this example to be .<LF>
it gets tripped up in another place instead (with the same error). The
specific error I get is this:
What happens here is probably that two sentences are detected. It wants
to split on
the dot, and on the <LF>.
The sentence detector classifies every eos char if it could be a split
or not. On the other hand
the user expects to get a span (with begin and end offset) per sentence.
The code which computes
the spans tries to remove white space from it.
Removing the white space from a whitespace only sentence is causing the
exception your are seeing.
Which response would you expect from the sentence detector? Should a
white space only sentence be returned?
In case a sentence is terminated by a new line. Should the new line char
be included in the sentence span or not?
Jörn