Hi Grant, In reply to your questions:
1. Are you having to restart/initialize the stemmer every time for your "slow" approach? Does that really need to happen? It is invoking a COM object in Windows. The object is instantiated once for a token stream, and then invoked once for each token. The invoke always has an overhead, not much to do about that (sigh...) 2. Can the stemmer return something other than a String? Say a String array of all the stemmed words? Or maybe even some type of object that tells you the original word and the stemmed word? The stemmer can only return a String. But, I do know that the returned string always has exactly the same number of words as the input string. So logically, it would be possible to : a) first calculate the position/start/end of each token in the input string (usual tokenization by Whitespace), resulting in token list 1 b) then invoke the stemmer, and tokenize that result by Whitespace, resulting in token list 2 c) 'merge' the token values of token list 2 into token list 1, which is possible because each token's position is the same in both lists... d) return that 'merged' token list 2 for further processing Would this work in Solr? I can do some Java coding to achieve that from logical point of view, but I wouldn't know how to structure this flow into the MyTokenizerFactory, so some hints to achieve that would be great! Thanks for helping out! Jaco. 2008/9/26 Grant Ingersoll <[EMAIL PROTECTED]> > > On Sep 26, 2008, at 9:40 AM, Jaco wrote: > > Hi, >> >> Here's some of the code of my Tokenizer: >> >> public class MyTokenizerFactory extends BaseTokenizerFactory >> { >> public WhitespaceTokenizer create(Reader input) >> { >> String text, normalizedText; >> >> try { >> text = IOUtils.toString(input); >> normalizedText = *invoke my stemmer(text)*; >> >> } >> catch( IOException ex ) { >> throw new SolrException( SolrException.ErrorCode.SERVER_ERROR, >> ex ); >> } >> >> StringReader stringReader = new StringReader(normalizedText); >> >> return new WhitespaceTokenizer(stringReader); >> } >> } >> >> I see what's going in the analysis tool now, and I think I understand the >> problem. For instance, the text: abcdxxx defgxxx. Let's assume the stemmer >> gets rid of xxx. >> >> I would then see this in the analysis tool after the tokenizer stage: >> - abcd - term position 1; start: 1; end: 3 >> - defg - term position 2; start: 4; end: 7 >> >> These positions are not in line with the initial search text - this must >> be >> why the highlighting goes wrong. I guess my little trick to do this was a >> bit too simple because it messes up the positions basically because >> something different from the original source text is tokenized. >> > > Yes, this is exactly the problem. I don't know enough about com4J or your > stemmer, but some things come to mind: > > 1. Are you having to restart/initialize the stemmer every time for your > "slow" approach? Does that really need to happen? > 2. Can the stemmer return something other than a String? Say a String > array of all the stemmed words? Or maybe even some type of object that > tells you the original word and the stemmed word? > > -Grant >