Integrating external stemmer in Solr and pre-processing text

Jaco Fri, 26 Sep 2008 00:10:49 -0700

Hello,

I need to work with an external stemmer in Solr. This stemmer is accessible
as a COM object (running Solr in tomcat on Windows platform). I managed to
integrate this using the com4j library. I tested two scenario's:
1. Create a custom FilterFactory and Filter class for this. The external
stemmer is then invoked for every token
2. Create a custom TokenizerFactory (extending BaseTokenizerFactory), that
invokes the external stemmer for the entire search text, then puts the
result of this into a StringReader, and finally returns new
WhitespaceTokenizer(stringReader), so the stemmed text gets tokenized by the
whitespace tokenizer.


Looking at search results, both scenario's appear to work from a functional
point of view. The first scenario however is too slow because of the
overhead of calling the external COM object for each token.

The second scenario is much faster, and also gives correct search results.
However, this then gives problems with highlighting - sometimes, errors are
reported (String out of Range), in other cases, I get incorrect highlight
fragments. Without knowing all details about this stuff, this makes sense
because of the change done to the text to be processed before it's
tokenized.  Maybe my second scenario does not make sense at all..?

Any ideas on how to overcome this or any other suggestions on how to realise
this?

Thanks, bye,

Jaco.

PS I posted this message twice before but it didn't come through (spam
filtering..??), so this is the 2nd try with text changed a bit

Integrating external stemmer in Solr and pre-processing text

Reply via email to