On 01/25/2014 03:03 PM, Miller, Timothy wrote:
I'm running into one issue, it gets tripped up on sentences with
line-ending spaces.  I could easily remove them with a script but by
default they are in there. It happens when a sentence example ends:

...BILAT HEMATOMAS.  <LF>

(There is a period, then 2 spaces, then the line feed character.) I am
pretty sure this is the root because when I fix this example to be .<LF>
it gets tripped up in another place instead (with the same error). The
specific error I get is this:


What happens here is probably that two sentences are detected. It wants to split on
the dot, and on the <LF>.

The sentence detector classifies every eos char if it could be a split or not. On the other hand the user expects to get a span (with begin and end offset) per sentence. The code which computes
the spans tries to remove white space from it.

Removing the white space from a whitespace only sentence is causing the exception your are seeing.

Which response would you expect from the sentence detector? Should a white space only sentence be returned?

In case a sentence is terminated by a new line. Should the new line char be included in the sentence span or not?

Jörn

Reply via email to