Hi! I'm currently in project where we are trying to build a Chinese tokenizer using OpenNLP 1.5.3.
We've started by trying to train and use the SentenceDetectorME with Chinese, but we can't get it find the correct spans. We are trying to figure out what we are doing wrong. For instance, with the input: "婆婆前,媳妇后,闹新房的人后边一拉溜。一进地,婆婆气吁吁先坐下。" We are getting the following two sentences: "婆婆前,媳妇后,闹新房的人后边一拉溜。一进地,婆婆气吁吁先坐下" "。" This happens regardless if we use the MAXENT or PERCEPTRON algorithm. If we set useTokenEnd to *true*, the sentence detector only finds one sentence (i.e. returns the input back). We are training the SentenceDetector model according to the documentation ( https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.sentdetect.training.api). Just to be clear, the input data consists of a text file with one sentence per line, and an empty line between documents. The total number of rows is about 20000. When running the debugger on the SentenceDetectorME class, I notice that the sentence end candidates (enders) are found correctly, but all but the very last candidate are skipped due to this if statement (lines 179-182): int fws = getFirstWS(s,cint + 1); if (i + 1 < end && enders.get(i + 1) < fws) { continue; } Chinese does not use whitespaces as word delimiters, so the above statement makes the surrounding for loop jump to the sentence end. Am I doing something wrong or is this a bug? Best regards, Jens Östlund -- Jens Östlund [email protected] +46 (0)76 882 84 32
