Hello, I need to work with an external stemmer in Solr. This stemmer is accessible as a COM object (running Solr in tomcat on Windows platform). I managed to integrate this using the com4j library. I tested two scenario's: 1. Create a custom FilterFactory and Filter class for this. The external stemmer is then invoked for every token 2. Create a custom TokenizerFactory (extending BaseTokenizerFactory), that invokes the external stemmer for the entire search text, then puts the result of this into a StringReader, and finally returns new WhitespaceTokenizer(stringReader), so the stemmed text gets tokenized by the whitespace tokenizer.
Looking at search results, both scenario's appear to work from a functional point of view. The first scenario however is too slow because of the overhead of calling the external COM object for each token. The second scenario is much faster, and also gives correct search results. However, this then gives problems with highlighting - sometimes, errors are reported (String out of Range), in other cases, I get incorrect highlight fragments. Without knowing all details about this stuff, this makes sense because of the change done to the text to be processed before it's tokenized. Maybe my second scenario does not make sense at all..? Any ideas on how to overcome this or any other suggestions on how to realise this? Thanks, bye, Jaco. PS I posted this message twice before but it didn't come through (spam filtering..??), so this is the 2nd try with text changed a bit