Thanks for these suggestions, will try it in the coming days and post my findings in this thread.
Bye, Jaco. 2008/9/26 Grant Ingersoll <[EMAIL PROTECTED]> > > On Sep 26, 2008, at 12:05 PM, Jaco wrote: > > Hi Grant, >> >> In reply to your questions: >> >> 1. Are you having to restart/initialize the stemmer every time for your >> "slow" approach? Does that really need to happen? >> >> It is invoking a COM object in Windows. The object is instantiated once >> for >> a token stream, and then invoked once for each token. The invoke always >> has >> an overhead, not much to do about that (sigh...) >> >> 2. Can the stemmer return something other than a String? Say a String >> array >> of all the stemmed words? Or maybe even some type of object that tells >> you >> the original word and the stemmed word? >> >> The stemmer can only return a String. But, I do know that the returned >> string always has exactly the same number of words as the input string. So >> logically, it would be possible to : >> a) first calculate the position/start/end of each token in the input >> string >> (usual tokenization by Whitespace), resulting in token list 1 >> b) then invoke the stemmer, and tokenize that result by Whitespace, >> resulting in token list 2 >> c) 'merge' the token values of token list 2 into token list 1, which is >> possible because each token's position is the same in both lists... >> d) return that 'merged' token list 2 for further processing >> >> Would this work in Solr? >> > > I think so, assuming your stemmer tokenizes on whitespace as well. > > >> >> I can do some Java coding to achieve that from logical point of view, but >> I >> wouldn't know how to structure this flow into the MyTokenizerFactory, so >> some hints to achieve that would be great! >> > > > One thought: > Don't create an all in one Tokenizer. Instead, keep the Whitespace > Tokenizer as is. Then, create a TokenFilter that buffers the whole document > into a memory (via the next() implementation) and also creates, using > StringBuilder, a string containing the whole text. Once you've read it all > in, then send the string to your stemmer, parse it back out and associate it > back to your token buffer. If you are guaranteed position, you could even > keep a (linked) hash, such that it is really quick to look up tokens after > stemming. > > Pseudocode looks something like: > > while (token.next != null) > tokenMap.put(token.position, token) > stringBuilder.append(' ').append(token.text) > > stemmedText = comObj.stem(stringBuilder.toString()) > correlateStemmedText(stemmedText, tokenMap) > > spit out the tokens one by one... > > > I think this approach should be fast (but maybe not as fast as your all in > one tokenizer) and will provide the correct position and offsets. You do > have to be careful w/ really big documents, as that map can be big. You > also want to be careful about map reuse, token reuse, etc. > > I believe there are a couple of buffering TokenFilters in Solr that you > could examine for inspiration. I think the RemoveDuplicatesTokenFilter (or > whatever it's called) does buffering. > > -Grant > > > > > >> >> Thanks for helping out! >> >> Jaco. >> >> >> 2008/9/26 Grant Ingersoll <[EMAIL PROTECTED]> >> >> >>> On Sep 26, 2008, at 9:40 AM, Jaco wrote: >>> >>> Hi, >>> >>>> >>>> Here's some of the code of my Tokenizer: >>>> >>>> public class MyTokenizerFactory extends BaseTokenizerFactory >>>> { >>>> public WhitespaceTokenizer create(Reader input) >>>> { >>>> String text, normalizedText; >>>> >>>> try { >>>> text = IOUtils.toString(input); >>>> normalizedText = *invoke my stemmer(text)*; >>>> >>>> } >>>> catch( IOException ex ) { >>>> throw new SolrException( SolrException.ErrorCode.SERVER_ERROR, >>>> ex ); >>>> } >>>> >>>> StringReader stringReader = new StringReader(normalizedText); >>>> >>>> return new WhitespaceTokenizer(stringReader); >>>> } >>>> } >>>> >>>> I see what's going in the analysis tool now, and I think I understand >>>> the >>>> problem. For instance, the text: abcdxxx defgxxx. Let's assume the >>>> stemmer >>>> gets rid of xxx. >>>> >>>> I would then see this in the analysis tool after the tokenizer stage: >>>> - abcd - term position 1; start: 1; end: 3 >>>> - defg - term position 2; start: 4; end: 7 >>>> >>>> These positions are not in line with the initial search text - this must >>>> be >>>> why the highlighting goes wrong. I guess my little trick to do this was >>>> a >>>> bit too simple because it messes up the positions basically because >>>> something different from the original source text is tokenized. >>>> >>>> >>> Yes, this is exactly the problem. I don't know enough about com4J or >>> your >>> stemmer, but some things come to mind: >>> >>> 1. Are you having to restart/initialize the stemmer every time for your >>> "slow" approach? Does that really need to happen? >>> 2. Can the stemmer return something other than a String? Say a String >>> array of all the stemmed words? Or maybe even some type of object that >>> tells you the original word and the stemmed word? >>> >>> -Grant >>> >>> > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > >