Re: Integrating external stemmer in Solr and pre-processing text

Jaco Fri, 26 Sep 2008 06:41:26 -0700

Hi,

Here's some of the code of my Tokenizer:


public class MyTokenizerFactory extends BaseTokenizerFactory
{
    public WhitespaceTokenizer create(Reader input)
    {
        String text, normalizedText;

        try {
            text  = IOUtils.toString(input);
            normalizedText    = *invoke my stemmer(text)*;

        }
        catch( IOException ex ) {
            throw new SolrException( SolrException.ErrorCode.SERVER_ERROR,
ex );
        }

        StringReader    stringReader = new StringReader(normalizedText);

        return new WhitespaceTokenizer(stringReader);
    }
}

I see what's going in the analysis tool now, and I think I understand the
problem. For instance, the text: abcdxxx defgxxx. Let's assume the stemmer
gets rid of xxx.

I would then see this in the analysis tool after the tokenizer stage:
- abcd - term position 1; start: 1; end:  3
- defg - term position 2; start: 4; end: 7

These positions are not in line with the initial search text - this must be
why the highlighting goes wrong. I guess my little trick to do this was a
bit too simple because it messes up the positions basically because
something different from the original source text is tokenized.

Any suggestions would be very welcome...

Cheers,

Jaco.


2008/9/26 Grant Ingersoll <[EMAIL PROTECTED]>

> How are you creating the tokens?  What are you setting for the offsets and
> the positions?
>
> One thing that is helpful is Solr's built in Analysis tool via the Admin
> interface (http://localhost:8983/solr/admin/)  From there, you can plug in
> verbose mode, and see what the position and offsets are for every piece of
> your Analyzer.
>
> -Grant
>
>
> On Sep 26, 2008, at 3:10 AM, Jaco wrote:
>
>  Hello,
>>
>> I need to work with an external stemmer in Solr. This stemmer is
>> accessible
>> as a COM object (running Solr in tomcat on Windows platform). I managed to
>> integrate this using the com4j library. I tested two scenario's:
>> 1. Create a custom FilterFactory and Filter class for this. The external
>> stemmer is then invoked for every token
>> 2. Create a custom TokenizerFactory (extending BaseTokenizerFactory), that
>> invokes the external stemmer for the entire search text, then puts the
>> result of this into a StringReader, and finally returns new
>> WhitespaceTokenizer(stringReader), so the stemmed text gets tokenized by
>> the
>> whitespace tokenizer.
>>
>> Looking at search results, both scenario's appear to work from a
>> functional
>> point of view. The first scenario however is too slow because of the
>> overhead of calling the external COM object for each token.
>>
>> The second scenario is much faster, and also gives correct search results.
>> However, this then gives problems with highlighting - sometimes, errors
>> are
>> reported (String out of Range), in other cases, I get incorrect highlight
>> fragments. Without knowing all details about this stuff, this makes sense
>> because of the change done to the text to be processed before it's
>> tokenized.  Maybe my second scenario does not make sense at all..?
>>
>> Any ideas on how to overcome this or any other suggestions on how to
>> realise
>> this?
>>
>> Thanks, bye,
>>
>> Jaco.
>>
>> PS I posted this message twice before but it didn't come through (spam
>> filtering..??), so this is the 2nd try with text changed a bit
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>

Re: Integrating external stemmer in Solr and pre-processing text

Reply via email to