Re: Issue with Solr TokenFilter and the new TokenStream API

Robert Muir Thu, 06 Aug 2009 08:19:44 -0700

Index: src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java
===================================================================
--- src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java  
(revision
778975)
+++ src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java  (working
copy)
@@ -209,7 +209,7 @@
         //make a backup in case we exceed the word count
         System.arraycopy(termBuffer, 0, backup, 0, termBufferLength);
       }
-      if (termBuffer.length < factory.maxTokenLength) {
+      if (termBufferLength < factory.maxTokenLength) {
         int wordCount = 0;


         int lastWordStart = 0;
@@ -226,8 +226,8 @@
         }

         // process the last word
-        if (lastWordStart < termBuffer.length) {
-          factory.processWord(termBuffer, lastWordStart,
termBuffer.length - lastWordStart, wordCount++);
+        if (lastWordStart < termBufferLength) {
+          factory.processWord(termBuffer, lastWordStart,
termBufferLength - lastWordStart, wordCount++);
         }

         if (wordCount > factory.maxWordCount) {


On Thu, Aug 6, 2009 at 10:58 AM, Robert Muir<rcm...@gmail.com> wrote:
> Mark, I looked at this and think it might be unrelated to tokenstreams.
>
> I think the length argument being provided to processWord(char[]
> buffer, int offset, int length, int wordCount) in that filter might be
> incorrectly calculated.
> This is the method that checks the keep list.
>
> (There is trailing trash on the end of tokens, even with the previous
> version of lucene in Solr).
> It just so happens the tokens with trailing trash were ones that were
> keep words in the previous version, so the test didnt fail.
>
> different tokens have trailing trash in the current version
> (specifically some of the "the" tokens), so its failing now.
>
>
> On Thu, Aug 6, 2009 at 10:14 AM, Mark Miller<markrmil...@gmail.com> wrote:
>> I think there is an issue here, but I didn't follow the TokenStream
>> improvements very closely.
>>
>> In Solr, CapitalizationFilterFactory has a CharArray set that it loads up
>> with keep words - it then checks (with the old TokenStream API) each token
>> (char array) to see if it should keep it. I think because of the cloning
>> going on in next, this breaks and you can't match anything in the keep set.
>> Does that make sense?
>>
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>



-- 
Robert Muir
rcm...@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Issue with Solr TokenFilter and the new TokenStream API

Reply via email to