[
https://issues.apache.org/jira/browse/LUCENENET-5?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Digy closed LUCENENET-5.
------------------------
Resolution: Fixed
Assignee: Digy
Not supported version.
> CJK Tokenizer in NLS fails to stop at end of input buffer.
> ----------------------------------------------------------
>
> Key: LUCENENET-5
> URL: https://issues.apache.org/jira/browse/LUCENENET-5
> Project: Lucene.Net
> Issue Type: Bug
> Environment: lucene.net.nls.1.3.2.2 on .NET 1.1 SP1
> Reporter: Ben Tregenna
> Assignee: Digy
> Priority: Minor
>
> When using the CJKTokenizer from the National Language Support Pack to
> tokenize simple Japanese text, the tokenizer fails to indicate EOS correctly.
> Example code snippet (suitable for use as an nUnit test):
> public void SimpleTokenization()
> {
> TextReader tr = new StringReader("???");
> CJKTokenizer tokenizer = new CJKTokenizer(tr);
> Assert.AreEqual("??", tokenizer.Next().TermText(), "First Token is
> correct");
> Assert.AreEqual("??", tokenizer.Next().TermText(), "Second Token is
> correct");
> Assert.AreEqual(string.Empty, tokenizer.Next().TermText(), "Returns
> empty string as final token");
> Assert.IsNull(tokenizer.Next(), "Returns null after end of string");
> }
> The current code treats the final buffer as circular and so returns as a
> third token "??" and then keeps return these three tokens cyclically. The
> problem comes from the condition for checking EOS from the TextReader input.
> In Java, Reader.read() returns -1 on EOS but in .NET TextReader.Read returns
> 0 on EOS and so the terminating condition needs altering.
> The diff to fix is pretty trivial:
> CJKTokenizer.cs: 162c162
> < if (dataLen == -1)
> ---
> > if (dataLen == 0)
> As a final note to the unwary - the comment at the start of the
> CJKTokenizer.Next() seems to indicate that null will be returned immediately
> at EOS "Returns the next token in the stream, or null at EOS." However I
> always get an empty token then null as indicated in the snippet above. The
> logic now seems to reflect the lucene-java logic exactly so whether this is a
> bug, a feature or a poor method summary remains unclear to me.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.