How are you creating the tokens? What are you setting for the offsets and the positions?

One thing that is helpful is Solr's built in Analysis tool via the Admin interface (http://localhost:8983/solr/admin/) From there, you can plug in verbose mode, and see what the position and offsets are for every piece of your Analyzer.


On Sep 26, 2008, at 3:10 AM, Jaco wrote:


I need to work with an external stemmer in Solr. This stemmer is accessible as a COM object (running Solr in tomcat on Windows platform). I managed to
integrate this using the com4j library. I tested two scenario's:
1. Create a custom FilterFactory and Filter class for this. The external
stemmer is then invoked for every token
2. Create a custom TokenizerFactory (extending BaseTokenizerFactory), that
invokes the external stemmer for the entire search text, then puts the
result of this into a StringReader, and finally returns new
WhitespaceTokenizer(stringReader), so the stemmed text gets tokenized by the
whitespace tokenizer.

Looking at search results, both scenario's appear to work from a functional
point of view. The first scenario however is too slow because of the
overhead of calling the external COM object for each token.

The second scenario is much faster, and also gives correct search results. However, this then gives problems with highlighting - sometimes, errors are reported (String out of Range), in other cases, I get incorrect highlight fragments. Without knowing all details about this stuff, this makes sense
because of the change done to the text to be processed before it's
tokenized.  Maybe my second scenario does not make sense at all..?

Any ideas on how to overcome this or any other suggestions on how to realise

Thanks, bye,


PS I posted this message twice before but it didn't come through (spam
filtering..??), so this is the 2nd try with text changed a bit

Grant Ingersoll

Lucene Helpful Hints:

Reply via email to