Hi Jens, but I see two sentences in your input delimited by dot :)
On Feb 17, 2015 9:03 AM, "Jens Östlund" <[email protected]> wrote:

> Hi!
>
> I'm currently in project where we are trying to build a Chinese tokenizer
> using OpenNLP 1.5.3.
>
> We've started by trying to train and use the SentenceDetectorME with
> Chinese, but we can't get it find the correct spans. We are trying to
> figure out what we are doing wrong.
>
> For instance, with the input:
>    "婆婆前,媳妇后,闹新房的人后边一拉溜。一进地,婆婆气吁吁先坐下。"
> We are getting the following two sentences:
>   "婆婆前,媳妇后,闹新房的人后边一拉溜。一进地,婆婆气吁吁先坐下"
>   "。"
>
> This happens regardless if we use the MAXENT or PERCEPTRON algorithm. If we
> set useTokenEnd to *true*, the sentence detector only finds one sentence
> (i.e. returns the input back).
>
> We are training the SentenceDetector model according to the documentation (
>
> https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.sentdetect.training.api
> ).
> Just to be clear, the input data consists of a text file with one sentence
> per line, and an empty line between documents. The total number of rows is
> about 20000.
>
> When running the debugger on the SentenceDetectorME class, I notice that
> the sentence end candidates (enders) are found correctly, but all but the
> very last candidate are skipped due to this if statement (lines 179-182):
>
> int fws = getFirstWS(s,cint + 1);
> if (i + 1 < end && enders.get(i + 1) < fws) {
>         continue;
> }
>
> Chinese does not use whitespaces as word delimiters, so the above statement
> makes the surrounding for loop jump to the sentence end. Am I doing
> something wrong or is this a bug?
>
> Best regards,
>
> Jens Östlund
>
> --
> Jens Östlund
> [email protected]
> +46 (0)76 882 84 32
>

Reply via email to