Index: src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java =================================================================== --- src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java (revision 778975) +++ src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java (working copy) @@ -209,7 +209,7 @@ //make a backup in case we exceed the word count System.arraycopy(termBuffer, 0, backup, 0, termBufferLength); } - if (termBuffer.length < factory.maxTokenLength) { + if (termBufferLength < factory.maxTokenLength) { int wordCount = 0;
int lastWordStart = 0; @@ -226,8 +226,8 @@ } // process the last word - if (lastWordStart < termBuffer.length) { - factory.processWord(termBuffer, lastWordStart, termBuffer.length - lastWordStart, wordCount++); + if (lastWordStart < termBufferLength) { + factory.processWord(termBuffer, lastWordStart, termBufferLength - lastWordStart, wordCount++); } if (wordCount > factory.maxWordCount) { On Thu, Aug 6, 2009 at 10:58 AM, Robert Muir<rcm...@gmail.com> wrote: > Mark, I looked at this and think it might be unrelated to tokenstreams. > > I think the length argument being provided to processWord(char[] > buffer, int offset, int length, int wordCount) in that filter might be > incorrectly calculated. > This is the method that checks the keep list. > > (There is trailing trash on the end of tokens, even with the previous > version of lucene in Solr). > It just so happens the tokens with trailing trash were ones that were > keep words in the previous version, so the test didnt fail. > > different tokens have trailing trash in the current version > (specifically some of the "the" tokens), so its failing now. > > > On Thu, Aug 6, 2009 at 10:14 AM, Mark Miller<markrmil...@gmail.com> wrote: >> I think there is an issue here, but I didn't follow the TokenStream >> improvements very closely. >> >> In Solr, CapitalizationFilterFactory has a CharArray set that it loads up >> with keep words - it then checks (with the old TokenStream API) each token >> (char array) to see if it should keep it. I think because of the cloning >> going on in next, this breaks and you can't match anything in the keep set. >> Does that make sense? >> >> -- >> - Mark >> >> http://www.lucidimagination.com >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> > > > > -- > Robert Muir > rcm...@gmail.com > -- Robert Muir rcm...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org