How are you creating the tokens? What are you setting for the offsets
and the positions?
One thing that is helpful is Solr's built in Analysis tool via the
Admin interface (http://localhost:8983/solr/admin/) From there, you
can plug in verbose mode, and see what the position and offsets are
for every piece of your Analyzer.
-Grant
On Sep 26, 2008, at 3:10 AM, Jaco wrote:
Hello,
I need to work with an external stemmer in Solr. This stemmer is
accessible
as a COM object (running Solr in tomcat on Windows platform). I
managed to
integrate this using the com4j library. I tested two scenario's:
1. Create a custom FilterFactory and Filter class for this. The
external
stemmer is then invoked for every token
2. Create a custom TokenizerFactory (extending
BaseTokenizerFactory), that
invokes the external stemmer for the entire search text, then puts the
result of this into a StringReader, and finally returns new
WhitespaceTokenizer(stringReader), so the stemmed text gets
tokenized by the
whitespace tokenizer.
Looking at search results, both scenario's appear to work from a
functional
point of view. The first scenario however is too slow because of the
overhead of calling the external COM object for each token.
The second scenario is much faster, and also gives correct search
results.
However, this then gives problems with highlighting - sometimes,
errors are
reported (String out of Range), in other cases, I get incorrect
highlight
fragments. Without knowing all details about this stuff, this makes
sense
because of the change done to the text to be processed before it's
tokenized. Maybe my second scenario does not make sense at all..?
Any ideas on how to overcome this or any other suggestions on how to
realise
this?
Thanks, bye,
Jaco.
PS I posted this message twice before but it didn't come through (spam
filtering..??), so this is the 2nd try with text changed a bit
--------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ