Sentence detection for Chinese

Jens Östlund Tue, 17 Feb 2015 00:03:42 -0800

Hi!

I'm currently in project where we are trying to build a Chinese tokenizer
using OpenNLP 1.5.3.


We've started by trying to train and use the SentenceDetectorME with
Chinese, but we can't get it find the correct spans. We are trying to
figure out what we are doing wrong.

For instance, with the input:
   "婆婆前，媳妇后，闹新房的人后边一拉溜。一进地，婆婆气吁吁先坐下。"
We are getting the following two sentences:
  "婆婆前，媳妇后，闹新房的人后边一拉溜。一进地，婆婆气吁吁先坐下"
  "。"

This happens regardless if we use the MAXENT or PERCEPTRON algorithm. If we
set useTokenEnd to *true*, the sentence detector only finds one sentence
(i.e. returns the input back).

We are training the SentenceDetector model according to the documentation (
https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.sentdetect.training.api).
Just to be clear, the input data consists of a text file with one sentence
per line, and an empty line between documents. The total number of rows is
about 20000.

When running the debugger on the SentenceDetectorME class, I notice that
the sentence end candidates (enders) are found correctly, but all but the
very last candidate are skipped due to this if statement (lines 179-182):

int fws = getFirstWS(s,cint + 1);
if (i + 1 < end && enders.get(i + 1) < fws) {
        continue;
}

Chinese does not use whitespaces as word delimiters, so the above statement
makes the surrounding for loop jump to the sentence end. Am I doing
something wrong or is this a bug?

Best regards,

Jens Östlund

-- 
Jens Östlund
[email protected]
+46 (0)76 882 84 32

Sentence detection for Chinese

Reply via email to