[ 
https://issues.apache.org/jira/browse/LUCENENET-5?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Digy closed LUCENENET-5.
------------------------

    Resolution: Fixed
      Assignee: Digy

Not supported version.

> CJK Tokenizer in NLS fails to stop at end of input buffer.
> ----------------------------------------------------------
>
>                 Key: LUCENENET-5
>                 URL: https://issues.apache.org/jira/browse/LUCENENET-5
>             Project: Lucene.Net
>          Issue Type: Bug
>         Environment: lucene.net.nls.1.3.2.2 on .NET 1.1 SP1
>            Reporter: Ben Tregenna
>            Assignee: Digy
>            Priority: Minor
>
> When using the CJKTokenizer from the National Language Support Pack to 
> tokenize simple Japanese text, the tokenizer fails to indicate EOS correctly. 
> Example code snippet (suitable for use as an nUnit test):
> public void SimpleTokenization()
> {
>       TextReader tr = new StringReader("???");
>       CJKTokenizer tokenizer = new CJKTokenizer(tr);
>       Assert.AreEqual("??", tokenizer.Next().TermText(), "First Token is 
> correct");
>       Assert.AreEqual("??", tokenizer.Next().TermText(), "Second Token is 
> correct");
>       Assert.AreEqual(string.Empty, tokenizer.Next().TermText(), "Returns 
> empty string as final token");
>       Assert.IsNull(tokenizer.Next(), "Returns null after end of string");
> }
> The current code treats the final buffer as circular and so returns as a 
> third token "??" and then keeps return these three tokens cyclically. The 
> problem comes from the condition for checking EOS from the TextReader input. 
> In Java, Reader.read() returns -1 on EOS but in .NET TextReader.Read returns 
> 0 on EOS and so the terminating condition needs altering. 
> The diff to fix is pretty trivial:
> CJKTokenizer.cs: 162c162
> <                               if (dataLen == -1)
> ---
> >                               if (dataLen == 0)
> As a final note to the unwary - the comment at the start of the 
> CJKTokenizer.Next() seems to indicate that null will be returned immediately 
> at EOS "Returns the next token in the stream, or null at EOS." However I 
> always get an empty token then null as indicated in the snippet above. The 
> logic now seems to reflect the lucene-java logic exactly so whether this is a 
> bug, a feature or a poor method summary remains unclear to me.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to