Hi Grant,

In reply to your questions:

1. Are you having to restart/initialize the stemmer every time for your
"slow" approach?  Does that really need to happen?

It is invoking a COM object in Windows. The object is instantiated once for
a token stream, and then invoked once for each token. The invoke always has
an overhead, not much to do about that (sigh...)

2. Can the stemmer return something other than a String?  Say a String array
of all the stemmed words?  Or maybe even some type of object that tells you
the original word and the stemmed word?

The stemmer can only return a String. But, I do know that the returned
string always has exactly the same number of words as the input string. So
logically, it would be possible to :
a) first calculate the position/start/end of each token in the input string
(usual tokenization by Whitespace), resulting in token list 1
b) then invoke the stemmer, and tokenize that result by Whitespace,
resulting in token list 2
c) 'merge' the token values of token list 2 into token list 1, which is
possible because each token's position is the same in both lists...
d) return that 'merged' token list 2 for further processing

Would this work in Solr?

I can do some Java coding to achieve that from logical point of view, but I
wouldn't know how to structure this flow into the MyTokenizerFactory, so
some hints to achieve that would be great!

Thanks for helping out!

Jaco.


2008/9/26 Grant Ingersoll <[EMAIL PROTECTED]>

>
> On Sep 26, 2008, at 9:40 AM, Jaco wrote:
>
>  Hi,
>>
>> Here's some of the code of my Tokenizer:
>>
>> public class MyTokenizerFactory extends BaseTokenizerFactory
>> {
>>   public WhitespaceTokenizer create(Reader input)
>>   {
>>       String text, normalizedText;
>>
>>       try {
>>           text  = IOUtils.toString(input);
>>           normalizedText    = *invoke my stemmer(text)*;
>>
>>       }
>>       catch( IOException ex ) {
>>           throw new SolrException( SolrException.ErrorCode.SERVER_ERROR,
>> ex );
>>       }
>>
>>       StringReader    stringReader = new StringReader(normalizedText);
>>
>>       return new WhitespaceTokenizer(stringReader);
>>   }
>> }
>>
>> I see what's going in the analysis tool now, and I think I understand the
>> problem. For instance, the text: abcdxxx defgxxx. Let's assume the stemmer
>> gets rid of xxx.
>>
>> I would then see this in the analysis tool after the tokenizer stage:
>> - abcd - term position 1; start: 1; end:  3
>> - defg - term position 2; start: 4; end: 7
>>
>> These positions are not in line with the initial search text - this must
>> be
>> why the highlighting goes wrong. I guess my little trick to do this was a
>> bit too simple because it messes up the positions basically because
>> something different from the original source text is tokenized.
>>
>
> Yes, this is exactly the problem.  I don't know enough about com4J or your
> stemmer, but some things come to mind:
>
> 1. Are you having to restart/initialize the stemmer every time for your
> "slow" approach?  Does that really need to happen?
> 2. Can the stemmer return something other than a String?  Say a String
> array of all the stemmed words?  Or maybe even some type of object that
> tells you the original word and the stemmed word?
>
> -Grant
>

Reply via email to