[ https://issues.apache.org/jira/browse/LUCENE-759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12477115 ]
Doron Cohen commented on LUCENE-759: ------------------------------------ I have two comments/questions on the n-gram tokenizers: (1) Seems that only the first 1024 characters of the input are handled, and the rest is ignored (and I think as result the input stream would remain dangling open). If you add this test case: /** * Test that no ngrams are lost, even for really long inputs * @throws EXception */ public void testLongerInput() throws Exception { int expectedNumTokens = 1024; int ngramLength = 2; // prepare long string StringBuffer sb = new StringBuffer(); while (sb.length()<expectedNumTokens+ngramLength-1) sb.append('a'); StringReader longStringReader = new StringReader (sb.toString()); NGramTokenizer tokenizer = new NGramTokenizer(longStringReader, ngramLength, ngramLength); int numTokens = 0; Token token; while ((token = tokenizer.next())!=null) { numTokens++; assertEquals("aa",token.termText()); } assertEquals("wrong number of tokens",expectedNumTokens,numTokens); } With expectedNumTokens = 1023 it would pass, but any larger number would fail. (2) It seems safer to read the characters like this int n = input.read(chars); inStr = new String(chars, 0, n); (This way not counting on String.trim(), which does work, but worries me). > Add n-gram tokenizers to contrib/analyzers > ------------------------------------------ > > Key: LUCENE-759 > URL: https://issues.apache.org/jira/browse/LUCENE-759 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Otis Gospodnetic > Assigned To: Otis Gospodnetic > Priority: Minor > Fix For: 2.2 > > Attachments: LUCENE-759.patch, LUCENE-759.patch, LUCENE-759.patch > > > It would be nice to have some n-gram-capable tokenizers in contrib/analyzers. > Patch coming shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]