Re: Integrating external stemmer in Solr and pre-processing text

Jaco Fri, 26 Sep 2008 14:37:34 -0700

Thanks for these suggestions, will try it in the coming days and post my
findings in this thread.


Bye,

Jaco.

2008/9/26 Grant Ingersoll <[EMAIL PROTECTED]>

>
> On Sep 26, 2008, at 12:05 PM, Jaco wrote:
>
>  Hi Grant,
>>
>> In reply to your questions:
>>
>> 1. Are you having to restart/initialize the stemmer every time for your
>> "slow" approach?  Does that really need to happen?
>>
>> It is invoking a COM object in Windows. The object is instantiated once
>> for
>> a token stream, and then invoked once for each token. The invoke always
>> has
>> an overhead, not much to do about that (sigh...)
>>
>> 2. Can the stemmer return something other than a String?  Say a String
>> array
>> of all the stemmed words?  Or maybe even some type of object that tells
>> you
>> the original word and the stemmed word?
>>
>> The stemmer can only return a String. But, I do know that the returned
>> string always has exactly the same number of words as the input string. So
>> logically, it would be possible to :
>> a) first calculate the position/start/end of each token in the input
>> string
>> (usual tokenization by Whitespace), resulting in token list 1
>> b) then invoke the stemmer, and tokenize that result by Whitespace,
>> resulting in token list 2
>> c) 'merge' the token values of token list 2 into token list 1, which is
>> possible because each token's position is the same in both lists...
>> d) return that 'merged' token list 2 for further processing
>>
>> Would this work in Solr?
>>
>
> I think so, assuming your stemmer tokenizes on whitespace as well.
>
>
>>
>> I can do some Java coding to achieve that from logical point of view, but
>> I
>> wouldn't know how to structure this flow into the MyTokenizerFactory, so
>> some hints to achieve that would be great!
>>
>
>
> One thought:
> Don't create an all in one Tokenizer.  Instead, keep the Whitespace
> Tokenizer as is.  Then, create a TokenFilter that buffers the whole document
> into a memory (via the next() implementation) and also creates, using
> StringBuilder, a string containing the whole text.  Once you've read it all
> in, then send the string to your stemmer, parse it back out and associate it
> back to your token buffer.  If you are guaranteed position, you could even
> keep a (linked) hash, such that it is really quick to look up tokens after
> stemming.
>
> Pseudocode looks something like:
>
> while (token.next != null)
>   tokenMap.put(token.position, token)
>   stringBuilder.append(' ').append(token.text)
>
> stemmedText = comObj.stem(stringBuilder.toString())
> correlateStemmedText(stemmedText, tokenMap)
>
> spit out the tokens one by one...
>
>
> I think this approach should be fast (but maybe not as fast as your all in
> one tokenizer) and will provide the correct position and offsets.  You do
> have to be careful w/ really big documents, as that map can be big.  You
> also want to be careful about map reuse, token reuse, etc.
>
> I believe there are a couple of buffering TokenFilters in Solr that you
> could examine for inspiration.  I think the RemoveDuplicatesTokenFilter (or
> whatever it's called) does buffering.
>
> -Grant
>
>
>
>
>
>>
>> Thanks for helping out!
>>
>> Jaco.
>>
>>
>> 2008/9/26 Grant Ingersoll <[EMAIL PROTECTED]>
>>
>>
>>> On Sep 26, 2008, at 9:40 AM, Jaco wrote:
>>>
>>> Hi,
>>>
>>>>
>>>> Here's some of the code of my Tokenizer:
>>>>
>>>> public class MyTokenizerFactory extends BaseTokenizerFactory
>>>> {
>>>>  public WhitespaceTokenizer create(Reader input)
>>>>  {
>>>>     String text, normalizedText;
>>>>
>>>>     try {
>>>>         text  = IOUtils.toString(input);
>>>>         normalizedText    = *invoke my stemmer(text)*;
>>>>
>>>>     }
>>>>     catch( IOException ex ) {
>>>>         throw new SolrException( SolrException.ErrorCode.SERVER_ERROR,
>>>> ex );
>>>>     }
>>>>
>>>>     StringReader    stringReader = new StringReader(normalizedText);
>>>>
>>>>     return new WhitespaceTokenizer(stringReader);
>>>>  }
>>>> }
>>>>
>>>> I see what's going in the analysis tool now, and I think I understand
>>>> the
>>>> problem. For instance, the text: abcdxxx defgxxx. Let's assume the
>>>> stemmer
>>>> gets rid of xxx.
>>>>
>>>> I would then see this in the analysis tool after the tokenizer stage:
>>>> - abcd - term position 1; start: 1; end:  3
>>>> - defg - term position 2; start: 4; end: 7
>>>>
>>>> These positions are not in line with the initial search text - this must
>>>> be
>>>> why the highlighting goes wrong. I guess my little trick to do this was
>>>> a
>>>> bit too simple because it messes up the positions basically because
>>>> something different from the original source text is tokenized.
>>>>
>>>>
>>> Yes, this is exactly the problem.  I don't know enough about com4J or
>>> your
>>> stemmer, but some things come to mind:
>>>
>>> 1. Are you having to restart/initialize the stemmer every time for your
>>> "slow" approach?  Does that really need to happen?
>>> 2. Can the stemmer return something other than a String?  Say a String
>>> array of all the stemmed words?  Or maybe even some type of object that
>>> tells you the original word and the stemmed word?
>>>
>>> -Grant
>>>
>>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>

Re: Integrating external stemmer in Solr and pre-processing text

Reply via email to