Hi, Here's some of the code of my Tokenizer:
public class MyTokenizerFactory extends BaseTokenizerFactory { public WhitespaceTokenizer create(Reader input) { String text, normalizedText; try { text = IOUtils.toString(input); normalizedText = *invoke my stemmer(text)*; } catch( IOException ex ) { throw new SolrException( SolrException.ErrorCode.SERVER_ERROR, ex ); } StringReader stringReader = new StringReader(normalizedText); return new WhitespaceTokenizer(stringReader); } } I see what's going in the analysis tool now, and I think I understand the problem. For instance, the text: abcdxxx defgxxx. Let's assume the stemmer gets rid of xxx. I would then see this in the analysis tool after the tokenizer stage: - abcd - term position 1; start: 1; end: 3 - defg - term position 2; start: 4; end: 7 These positions are not in line with the initial search text - this must be why the highlighting goes wrong. I guess my little trick to do this was a bit too simple because it messes up the positions basically because something different from the original source text is tokenized. Any suggestions would be very welcome... Cheers, Jaco. 2008/9/26 Grant Ingersoll <[EMAIL PROTECTED]> > How are you creating the tokens? What are you setting for the offsets and > the positions? > > One thing that is helpful is Solr's built in Analysis tool via the Admin > interface (http://localhost:8983/solr/admin/) From there, you can plug in > verbose mode, and see what the position and offsets are for every piece of > your Analyzer. > > -Grant > > > On Sep 26, 2008, at 3:10 AM, Jaco wrote: > > Hello, >> >> I need to work with an external stemmer in Solr. This stemmer is >> accessible >> as a COM object (running Solr in tomcat on Windows platform). I managed to >> integrate this using the com4j library. I tested two scenario's: >> 1. Create a custom FilterFactory and Filter class for this. The external >> stemmer is then invoked for every token >> 2. Create a custom TokenizerFactory (extending BaseTokenizerFactory), that >> invokes the external stemmer for the entire search text, then puts the >> result of this into a StringReader, and finally returns new >> WhitespaceTokenizer(stringReader), so the stemmed text gets tokenized by >> the >> whitespace tokenizer. >> >> Looking at search results, both scenario's appear to work from a >> functional >> point of view. The first scenario however is too slow because of the >> overhead of calling the external COM object for each token. >> >> The second scenario is much faster, and also gives correct search results. >> However, this then gives problems with highlighting - sometimes, errors >> are >> reported (String out of Range), in other cases, I get incorrect highlight >> fragments. Without knowing all details about this stuff, this makes sense >> because of the change done to the text to be processed before it's >> tokenized. Maybe my second scenario does not make sense at all..? >> >> Any ideas on how to overcome this or any other suggestions on how to >> realise >> this? >> >> Thanks, bye, >> >> Jaco. >> >> PS I posted this message twice before but it didn't come through (spam >> filtering..??), so this is the 2nd try with text changed a bit >> > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > >