[ 
https://issues.apache.org/jira/browse/LUCENE-759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12477115
 ] 

Doron Cohen commented on LUCENE-759:
------------------------------------

I have two comments/questions on the n-gram tokenizers:

(1) Seems that only the first 1024 characters of the input are handled, and the 
rest is ignored (and I think as result the input stream would remain dangling 
open). 

If you add this test case:

    /**
     * Test that no ngrams are lost, even for really long inputs
     * @throws EXception
     */
    public void testLongerInput() throws Exception {
      int expectedNumTokens = 1024;
      int ngramLength = 2;
      // prepare long string
      StringBuffer sb = new StringBuffer();
      while (sb.length()<expectedNumTokens+ngramLength-1) 
        sb.append('a');
      
      StringReader longStringReader = new StringReader (sb.toString());
      NGramTokenizer tokenizer = new NGramTokenizer(longStringReader, 
ngramLength, ngramLength);
      int numTokens = 0;
      Token token;
      while ((token = tokenizer.next())!=null) {
        numTokens++;
        assertEquals("aa",token.termText());
      }
      assertEquals("wrong number of tokens",expectedNumTokens,numTokens);
    }

With expectedNumTokens = 1023 it would pass, but any larger number would fail. 

(2) It seems safer to read the characters like this
            int n = input.read(chars);
            inStr = new String(chars, 0, n);
(This way not counting on String.trim(), which does work, but worries me).



> Add n-gram tokenizers to contrib/analyzers
> ------------------------------------------
>
>                 Key: LUCENE-759
>                 URL: https://issues.apache.org/jira/browse/LUCENE-759
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Otis Gospodnetic
>         Assigned To: Otis Gospodnetic
>            Priority: Minor
>             Fix For: 2.2
>
>         Attachments: LUCENE-759.patch, LUCENE-759.patch, LUCENE-759.patch
>
>
> It would be nice to have some n-gram-capable tokenizers in contrib/analyzers. 
>  Patch coming shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to