Hi Jens, but I see two sentences in your input delimited by dot :) On Feb 17, 2015 9:03 AM, "Jens Östlund" <[email protected]> wrote:
> Hi! > > I'm currently in project where we are trying to build a Chinese tokenizer > using OpenNLP 1.5.3. > > We've started by trying to train and use the SentenceDetectorME with > Chinese, but we can't get it find the correct spans. We are trying to > figure out what we are doing wrong. > > For instance, with the input: > "婆婆前,媳妇后,闹新房的人后边一拉溜。一进地,婆婆气吁吁先坐下。" > We are getting the following two sentences: > "婆婆前,媳妇后,闹新房的人后边一拉溜。一进地,婆婆气吁吁先坐下" > "。" > > This happens regardless if we use the MAXENT or PERCEPTRON algorithm. If we > set useTokenEnd to *true*, the sentence detector only finds one sentence > (i.e. returns the input back). > > We are training the SentenceDetector model according to the documentation ( > > https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.sentdetect.training.api > ). > Just to be clear, the input data consists of a text file with one sentence > per line, and an empty line between documents. The total number of rows is > about 20000. > > When running the debugger on the SentenceDetectorME class, I notice that > the sentence end candidates (enders) are found correctly, but all but the > very last candidate are skipped due to this if statement (lines 179-182): > > int fws = getFirstWS(s,cint + 1); > if (i + 1 < end && enders.get(i + 1) < fws) { > continue; > } > > Chinese does not use whitespaces as word delimiters, so the above statement > makes the surrounding for loop jump to the sentence end. Am I doing > something wrong or is this a bug? > > Best regards, > > Jens Östlund > > -- > Jens Östlund > [email protected] > +46 (0)76 882 84 32 >
