Re: Integrating external stemmer in Solr and pre-processing text

2008-09-30 Thread Jaco
Hi,

The suggested approach with a TokenFilter extending the BufferedTokenStream
class works fine, performance is OK - the external stemmer is now invoked
only once for the complete search text. Also, from a functional point of
view, the approach is useful, because it allows for other filtering (i.e
WordDelimiterFilter with the various useful options) to be done before
stemming takes place.

Code is roughly like this for the process() function of the custom Filter
class:

protected Token process (Token token) {
StringBuilderstringBuilder = new StringBuilder();
TokennextToken;
IntegertokenPos = 0;
MaptokenMap = new LinkedHashMap();

stringBuilder.append(token.term()).append(' ');
tokenMap.put(tokenPos++, token);
nextToken= read();

while (nextToken != null)
{
stringBuilder.append(nextToken.term()).append(' ');
tokenMap.put(tokenPos++, nextToken);

nextToken= read();
}

StringinputText = stringBuilder.toString();
StringstemmedText   = stemText(inputText);
String[] stemmedWords= stemmedText.split("\\s");

for (Map.Entry entry : tokenMap.entrySet())
{
Integerpos= entry.getKey();
Tokentok = entry.getValue();

tok.setTermBuffer(stemmedWords[pos]);
write(tok);
}

return null;
}
}

This will need some work and additional error checking, and I'll probably
put a maximum om the number of tokens that is to be processed in one go to
make sure things don't get too big in memory.

Thanks for helping out!

Bye,

Jaco.



2008/9/26 Jaco <[EMAIL PROTECTED]>

> Thanks for these suggestions, will try it in the coming days and post my
> findings in this thread.
>
> Bye,
>
>
> Jaco.
>
> 2008/9/26 Grant Ingersoll <[EMAIL PROTECTED]>
>
>>
>> On Sep 26, 2008, at 12:05 PM, Jaco wrote:
>>
>>  Hi Grant,
>>>
>>> In reply to your questions:
>>>
>>> 1. Are you having to restart/initialize the stemmer every time for your
>>> "slow" approach?  Does that really need to happen?
>>>
>>> It is invoking a COM object in Windows. The object is instantiated once
>>> for
>>> a token stream, and then invoked once for each token. The invoke always
>>> has
>>> an overhead, not much to do about that (sigh...)
>>>
>>> 2. Can the stemmer return something other than a String?  Say a String
>>> array
>>> of all the stemmed words?  Or maybe even some type of object that tells
>>> you
>>> the original word and the stemmed word?
>>>
>>> The stemmer can only return a String. But, I do know that the returned
>>> string always has exactly the same number of words as the input string.
>>> So
>>> logically, it would be possible to :
>>> a) first calculate the position/start/end of each token in the input
>>> string
>>> (usual tokenization by Whitespace), resulting in token list 1
>>> b) then invoke the stemmer, and tokenize that result by Whitespace,
>>> resulting in token list 2
>>> c) 'merge' the token values of token list 2 into token list 1, which is
>>> possible because each token's position is the same in both lists...
>>> d) return that 'merged' token list 2 for further processing
>>>
>>> Would this work in Solr?
>>>
>>
>> I think so, assuming your stemmer tokenizes on whitespace as well.
>>
>>
>>>
>>> I can do some Java coding to achieve that from logical point of view, but
>>> I
>>> wouldn't know how to structure this flow into the MyTokenizerFactory, so
>>> some hints to achieve that would be great!
>>>
>>
>>
>> One thought:
>> Don't create an all in one Tokenizer.  Instead, keep the Whitespace
>> Tokenizer as is.  Then, create a TokenFilter that buffers the whole document
>> into a memory (via the next() implementation) and also creates, using
>> StringBuilder, a string containing the whole text.  Once you've read it all
>> in, then send the string to your stemmer, parse it back out and associate it
>> back to your token buffer.  If you are guaranteed position, you could even
>> keep a (linked) hash, such that it is really quick to look up tokens after
>> stemming.
>>
>> Pseudocode looks something like:
>>
>> while (token.next != null)
>>   tokenMap.put(token.position, token)
>>   stringBuilder.append(' ').append(token.text)
>>
>> stemmedText = comObj.stem(stringBuilder.toString())
>> correlateStemmedText(stemmedText, tokenMap)
>>
>> spit out the tokens one by one...
>>
>>
>> I think this approach should be fast (but maybe not as fast as your all in
>> one tokenizer) and will provide the correct position and offsets.  You do
>> have to be careful w/ really big documents, as that map can be big.  You
>> also want to be careful about map reuse, token reuse, etc.
>>
>> I believe there are a couple of buffering TokenFilters in Solr that you
>> could examine for inspiration.  I think the RemoveDuplicatesTokenFilter (or
>> whatever it's called) does buffer

Re: Integrating external stemmer in Solr and pre-processing text

2008-09-26 Thread Jaco
Thanks for these suggestions, will try it in the coming days and post my
findings in this thread.

Bye,

Jaco.

2008/9/26 Grant Ingersoll <[EMAIL PROTECTED]>

>
> On Sep 26, 2008, at 12:05 PM, Jaco wrote:
>
>  Hi Grant,
>>
>> In reply to your questions:
>>
>> 1. Are you having to restart/initialize the stemmer every time for your
>> "slow" approach?  Does that really need to happen?
>>
>> It is invoking a COM object in Windows. The object is instantiated once
>> for
>> a token stream, and then invoked once for each token. The invoke always
>> has
>> an overhead, not much to do about that (sigh...)
>>
>> 2. Can the stemmer return something other than a String?  Say a String
>> array
>> of all the stemmed words?  Or maybe even some type of object that tells
>> you
>> the original word and the stemmed word?
>>
>> The stemmer can only return a String. But, I do know that the returned
>> string always has exactly the same number of words as the input string. So
>> logically, it would be possible to :
>> a) first calculate the position/start/end of each token in the input
>> string
>> (usual tokenization by Whitespace), resulting in token list 1
>> b) then invoke the stemmer, and tokenize that result by Whitespace,
>> resulting in token list 2
>> c) 'merge' the token values of token list 2 into token list 1, which is
>> possible because each token's position is the same in both lists...
>> d) return that 'merged' token list 2 for further processing
>>
>> Would this work in Solr?
>>
>
> I think so, assuming your stemmer tokenizes on whitespace as well.
>
>
>>
>> I can do some Java coding to achieve that from logical point of view, but
>> I
>> wouldn't know how to structure this flow into the MyTokenizerFactory, so
>> some hints to achieve that would be great!
>>
>
>
> One thought:
> Don't create an all in one Tokenizer.  Instead, keep the Whitespace
> Tokenizer as is.  Then, create a TokenFilter that buffers the whole document
> into a memory (via the next() implementation) and also creates, using
> StringBuilder, a string containing the whole text.  Once you've read it all
> in, then send the string to your stemmer, parse it back out and associate it
> back to your token buffer.  If you are guaranteed position, you could even
> keep a (linked) hash, such that it is really quick to look up tokens after
> stemming.
>
> Pseudocode looks something like:
>
> while (token.next != null)
>   tokenMap.put(token.position, token)
>   stringBuilder.append(' ').append(token.text)
>
> stemmedText = comObj.stem(stringBuilder.toString())
> correlateStemmedText(stemmedText, tokenMap)
>
> spit out the tokens one by one...
>
>
> I think this approach should be fast (but maybe not as fast as your all in
> one tokenizer) and will provide the correct position and offsets.  You do
> have to be careful w/ really big documents, as that map can be big.  You
> also want to be careful about map reuse, token reuse, etc.
>
> I believe there are a couple of buffering TokenFilters in Solr that you
> could examine for inspiration.  I think the RemoveDuplicatesTokenFilter (or
> whatever it's called) does buffering.
>
> -Grant
>
>
>
>
>
>>
>> Thanks for helping out!
>>
>> Jaco.
>>
>>
>> 2008/9/26 Grant Ingersoll <[EMAIL PROTECTED]>
>>
>>
>>> On Sep 26, 2008, at 9:40 AM, Jaco wrote:
>>>
>>> Hi,
>>>

 Here's some of the code of my Tokenizer:

 public class MyTokenizerFactory extends BaseTokenizerFactory
 {
  public WhitespaceTokenizer create(Reader input)
  {
 String text, normalizedText;

 try {
 text  = IOUtils.toString(input);
 normalizedText= *invoke my stemmer(text)*;

 }
 catch( IOException ex ) {
 throw new SolrException( SolrException.ErrorCode.SERVER_ERROR,
 ex );
 }

 StringReaderstringReader = new StringReader(normalizedText);

 return new WhitespaceTokenizer(stringReader);
  }
 }

 I see what's going in the analysis tool now, and I think I understand
 the
 problem. For instance, the text: abcdxxx defgxxx. Let's assume the
 stemmer
 gets rid of xxx.

 I would then see this in the analysis tool after the tokenizer stage:
 - abcd - term position 1; start: 1; end:  3
 - defg - term position 2; start: 4; end: 7

 These positions are not in line with the initial search text - this must
 be
 why the highlighting goes wrong. I guess my little trick to do this was
 a
 bit too simple because it messes up the positions basically because
 something different from the original source text is tokenized.


>>> Yes, this is exactly the problem.  I don't know enough about com4J or
>>> your
>>> stemmer, but some things come to mind:
>>>
>>> 1. Are you having to restart/initialize the stemmer every time for your
>>> "slow" approach?  Does that really need to happen?
>>> 2. Can the stemmer return something other than a String?  

Re: Integrating external stemmer in Solr and pre-processing text

2008-09-26 Thread Grant Ingersoll


On Sep 26, 2008, at 12:05 PM, Jaco wrote:


Hi Grant,

In reply to your questions:

1. Are you having to restart/initialize the stemmer every time for  
your

"slow" approach?  Does that really need to happen?

It is invoking a COM object in Windows. The object is instantiated  
once for
a token stream, and then invoked once for each token. The invoke  
always has

an overhead, not much to do about that (sigh...)

2. Can the stemmer return something other than a String?  Say a  
String array
of all the stemmed words?  Or maybe even some type of object that  
tells you

the original word and the stemmed word?

The stemmer can only return a String. But, I do know that the returned
string always has exactly the same number of words as the input  
string. So

logically, it would be possible to :
a) first calculate the position/start/end of each token in the input  
string

(usual tokenization by Whitespace), resulting in token list 1
b) then invoke the stemmer, and tokenize that result by Whitespace,
resulting in token list 2
c) 'merge' the token values of token list 2 into token list 1, which  
is

possible because each token's position is the same in both lists...
d) return that 'merged' token list 2 for further processing

Would this work in Solr?


I think so, assuming your stemmer tokenizes on whitespace as well.




I can do some Java coding to achieve that from logical point of  
view, but I
wouldn't know how to structure this flow into the  
MyTokenizerFactory, so

some hints to achieve that would be great!



One thought:
Don't create an all in one Tokenizer.  Instead, keep the Whitespace  
Tokenizer as is.  Then, create a TokenFilter that buffers the whole  
document into a memory (via the next() implementation) and also  
creates, using StringBuilder, a string containing the whole text.   
Once you've read it all in, then send the string to your stemmer,  
parse it back out and associate it back to your token buffer.  If you  
are guaranteed position, you could even keep a (linked) hash, such  
that it is really quick to look up tokens after stemming.


Pseudocode looks something like:

while (token.next != null)
   tokenMap.put(token.position, token)
   stringBuilder.append(' ').append(token.text)

stemmedText = comObj.stem(stringBuilder.toString())
correlateStemmedText(stemmedText, tokenMap)

spit out the tokens one by one...


I think this approach should be fast (but maybe not as fast as your  
all in one tokenizer) and will provide the correct position and  
offsets.  You do have to be careful w/ really big documents, as that  
map can be big.  You also want to be careful about map reuse, token  
reuse, etc.


I believe there are a couple of buffering TokenFilters in Solr that  
you could examine for inspiration.  I think the  
RemoveDuplicatesTokenFilter (or whatever it's called) does buffering.


-Grant






Thanks for helping out!

Jaco.


2008/9/26 Grant Ingersoll <[EMAIL PROTECTED]>



On Sep 26, 2008, at 9:40 AM, Jaco wrote:

Hi,


Here's some of the code of my Tokenizer:

public class MyTokenizerFactory extends BaseTokenizerFactory
{
 public WhitespaceTokenizer create(Reader input)
 {
 String text, normalizedText;

 try {
 text  = IOUtils.toString(input);
 normalizedText= *invoke my stemmer(text)*;

 }
 catch( IOException ex ) {
 throw new  
SolrException( SolrException.ErrorCode.SERVER_ERROR,

ex );
 }

 StringReaderstringReader = new  
StringReader(normalizedText);


 return new WhitespaceTokenizer(stringReader);
 }
}

I see what's going in the analysis tool now, and I think I  
understand the
problem. For instance, the text: abcdxxx defgxxx. Let's assume the  
stemmer

gets rid of xxx.

I would then see this in the analysis tool after the tokenizer  
stage:

- abcd - term position 1; start: 1; end:  3
- defg - term position 2; start: 4; end: 7

These positions are not in line with the initial search text -  
this must

be
why the highlighting goes wrong. I guess my little trick to do  
this was a

bit too simple because it messes up the positions basically because
something different from the original source text is tokenized.



Yes, this is exactly the problem.  I don't know enough about com4J  
or your

stemmer, but some things come to mind:

1. Are you having to restart/initialize the stemmer every time for  
your

"slow" approach?  Does that really need to happen?
2. Can the stemmer return something other than a String?  Say a  
String
array of all the stemmed words?  Or maybe even some type of object  
that

tells you the original word and the stemmed word?

-Grant



--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ









Re: Integrating external stemmer in Solr and pre-processing text

2008-09-26 Thread Jaco
The overhead is not in the instantiation, but in the actual call to the COM
object. The approach with one time instantiation in the TokenFilterFactory,
and the use of that object in the TokenFilter is exactly what I tried. There
is a factor of 10 performance gain when being able to do a single call
instead of token-by-token (also tried this in another environment (perl),
which gave the same result).

So I guess I'll need to do this with the other approach I suggested.

Bye,

Jaco.

2008/9/26 Chris Hostetter <[EMAIL PROTECTED]>

>
> : It is invoking a COM object in Windows. The object is instantiated once
> for
> : a token stream, and then invoked once for each token. The invoke always
> has
> : an overhead, not much to do about that (sigh...)
>
> I also know nothing about COM, but based on your comments it sounds like
> instantiating your COM object is expensive ... so why to it for every
> token?  why not have a TokenFilter where the COM object is constructed
> when the TokenFilter is constructed, and then the same object will be
> invoked for each token in the stream for a given field value.
>
> Or better still: if your COM object is threadsafe, construct one in the
> init method for your TokenFilterFactory and reuse it in every TokenFilter
> instance.
>
>
>
> -Hoss
>
>


Re: Integrating external stemmer in Solr and pre-processing text

2008-09-26 Thread Chris Hostetter

: It is invoking a COM object in Windows. The object is instantiated once for
: a token stream, and then invoked once for each token. The invoke always has
: an overhead, not much to do about that (sigh...)

I also know nothing about COM, but based on your comments it sounds like 
instantiating your COM object is expensive ... so why to it for every 
token?  why not have a TokenFilter where the COM object is constructed 
when the TokenFilter is constructed, and then the same object will be 
invoked for each token in the stream for a given field value.

Or better still: if your COM object is threadsafe, construct one in the 
init method for your TokenFilterFactory and reuse it in every TokenFilter 
instance.



-Hoss



Re: Integrating external stemmer in Solr and pre-processing text

2008-09-26 Thread Jaco
Hi Grant,

In reply to your questions:

1. Are you having to restart/initialize the stemmer every time for your
"slow" approach?  Does that really need to happen?

It is invoking a COM object in Windows. The object is instantiated once for
a token stream, and then invoked once for each token. The invoke always has
an overhead, not much to do about that (sigh...)

2. Can the stemmer return something other than a String?  Say a String array
of all the stemmed words?  Or maybe even some type of object that tells you
the original word and the stemmed word?

The stemmer can only return a String. But, I do know that the returned
string always has exactly the same number of words as the input string. So
logically, it would be possible to :
a) first calculate the position/start/end of each token in the input string
(usual tokenization by Whitespace), resulting in token list 1
b) then invoke the stemmer, and tokenize that result by Whitespace,
resulting in token list 2
c) 'merge' the token values of token list 2 into token list 1, which is
possible because each token's position is the same in both lists...
d) return that 'merged' token list 2 for further processing

Would this work in Solr?

I can do some Java coding to achieve that from logical point of view, but I
wouldn't know how to structure this flow into the MyTokenizerFactory, so
some hints to achieve that would be great!

Thanks for helping out!

Jaco.


2008/9/26 Grant Ingersoll <[EMAIL PROTECTED]>

>
> On Sep 26, 2008, at 9:40 AM, Jaco wrote:
>
>  Hi,
>>
>> Here's some of the code of my Tokenizer:
>>
>> public class MyTokenizerFactory extends BaseTokenizerFactory
>> {
>>   public WhitespaceTokenizer create(Reader input)
>>   {
>>   String text, normalizedText;
>>
>>   try {
>>   text  = IOUtils.toString(input);
>>   normalizedText= *invoke my stemmer(text)*;
>>
>>   }
>>   catch( IOException ex ) {
>>   throw new SolrException( SolrException.ErrorCode.SERVER_ERROR,
>> ex );
>>   }
>>
>>   StringReaderstringReader = new StringReader(normalizedText);
>>
>>   return new WhitespaceTokenizer(stringReader);
>>   }
>> }
>>
>> I see what's going in the analysis tool now, and I think I understand the
>> problem. For instance, the text: abcdxxx defgxxx. Let's assume the stemmer
>> gets rid of xxx.
>>
>> I would then see this in the analysis tool after the tokenizer stage:
>> - abcd - term position 1; start: 1; end:  3
>> - defg - term position 2; start: 4; end: 7
>>
>> These positions are not in line with the initial search text - this must
>> be
>> why the highlighting goes wrong. I guess my little trick to do this was a
>> bit too simple because it messes up the positions basically because
>> something different from the original source text is tokenized.
>>
>
> Yes, this is exactly the problem.  I don't know enough about com4J or your
> stemmer, but some things come to mind:
>
> 1. Are you having to restart/initialize the stemmer every time for your
> "slow" approach?  Does that really need to happen?
> 2. Can the stemmer return something other than a String?  Say a String
> array of all the stemmed words?  Or maybe even some type of object that
> tells you the original word and the stemmed word?
>
> -Grant
>


Re: Integrating external stemmer in Solr and pre-processing text

2008-09-26 Thread Grant Ingersoll


On Sep 26, 2008, at 9:40 AM, Jaco wrote:


Hi,

Here's some of the code of my Tokenizer:

public class MyTokenizerFactory extends BaseTokenizerFactory
{
   public WhitespaceTokenizer create(Reader input)
   {
   String text, normalizedText;

   try {
   text  = IOUtils.toString(input);
   normalizedText= *invoke my stemmer(text)*;

   }
   catch( IOException ex ) {
   throw new  
SolrException( SolrException.ErrorCode.SERVER_ERROR,

ex );
   }

   StringReaderstringReader = new  
StringReader(normalizedText);


   return new WhitespaceTokenizer(stringReader);
   }
}

I see what's going in the analysis tool now, and I think I  
understand the
problem. For instance, the text: abcdxxx defgxxx. Let's assume the  
stemmer

gets rid of xxx.

I would then see this in the analysis tool after the tokenizer stage:
- abcd - term position 1; start: 1; end:  3
- defg - term position 2; start: 4; end: 7

These positions are not in line with the initial search text - this  
must be
why the highlighting goes wrong. I guess my little trick to do this  
was a

bit too simple because it messes up the positions basically because
something different from the original source text is tokenized.


Yes, this is exactly the problem.  I don't know enough about com4J or  
your stemmer, but some things come to mind:


1. Are you having to restart/initialize the stemmer every time for  
your "slow" approach?  Does that really need to happen?
2. Can the stemmer return something other than a String?  Say a String  
array of all the stemmed words?  Or maybe even some type of object  
that tells you the original word and the stemmed word?


-Grant


Re: Integrating external stemmer in Solr and pre-processing text

2008-09-26 Thread Jaco
Hi,

Here's some of the code of my Tokenizer:

public class MyTokenizerFactory extends BaseTokenizerFactory
{
public WhitespaceTokenizer create(Reader input)
{
String text, normalizedText;

try {
text  = IOUtils.toString(input);
normalizedText= *invoke my stemmer(text)*;

}
catch( IOException ex ) {
throw new SolrException( SolrException.ErrorCode.SERVER_ERROR,
ex );
}

StringReaderstringReader = new StringReader(normalizedText);

return new WhitespaceTokenizer(stringReader);
}
}

I see what's going in the analysis tool now, and I think I understand the
problem. For instance, the text: abcdxxx defgxxx. Let's assume the stemmer
gets rid of xxx.

I would then see this in the analysis tool after the tokenizer stage:
- abcd - term position 1; start: 1; end:  3
- defg - term position 2; start: 4; end: 7

These positions are not in line with the initial search text - this must be
why the highlighting goes wrong. I guess my little trick to do this was a
bit too simple because it messes up the positions basically because
something different from the original source text is tokenized.

Any suggestions would be very welcome...

Cheers,

Jaco.


2008/9/26 Grant Ingersoll <[EMAIL PROTECTED]>

> How are you creating the tokens?  What are you setting for the offsets and
> the positions?
>
> One thing that is helpful is Solr's built in Analysis tool via the Admin
> interface (http://localhost:8983/solr/admin/)  From there, you can plug in
> verbose mode, and see what the position and offsets are for every piece of
> your Analyzer.
>
> -Grant
>
>
> On Sep 26, 2008, at 3:10 AM, Jaco wrote:
>
>  Hello,
>>
>> I need to work with an external stemmer in Solr. This stemmer is
>> accessible
>> as a COM object (running Solr in tomcat on Windows platform). I managed to
>> integrate this using the com4j library. I tested two scenario's:
>> 1. Create a custom FilterFactory and Filter class for this. The external
>> stemmer is then invoked for every token
>> 2. Create a custom TokenizerFactory (extending BaseTokenizerFactory), that
>> invokes the external stemmer for the entire search text, then puts the
>> result of this into a StringReader, and finally returns new
>> WhitespaceTokenizer(stringReader), so the stemmed text gets tokenized by
>> the
>> whitespace tokenizer.
>>
>> Looking at search results, both scenario's appear to work from a
>> functional
>> point of view. The first scenario however is too slow because of the
>> overhead of calling the external COM object for each token.
>>
>> The second scenario is much faster, and also gives correct search results.
>> However, this then gives problems with highlighting - sometimes, errors
>> are
>> reported (String out of Range), in other cases, I get incorrect highlight
>> fragments. Without knowing all details about this stuff, this makes sense
>> because of the change done to the text to be processed before it's
>> tokenized.  Maybe my second scenario does not make sense at all..?
>>
>> Any ideas on how to overcome this or any other suggestions on how to
>> realise
>> this?
>>
>> Thanks, bye,
>>
>> Jaco.
>>
>> PS I posted this message twice before but it didn't come through (spam
>> filtering..??), so this is the 2nd try with text changed a bit
>>
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>


Re: Integrating external stemmer in Solr and pre-processing text

2008-09-26 Thread Grant Ingersoll
How are you creating the tokens?  What are you setting for the offsets  
and the positions?


One thing that is helpful is Solr's built in Analysis tool via the  
Admin interface (http://localhost:8983/solr/admin/)  From there, you  
can plug in verbose mode, and see what the position and offsets are  
for every piece of your Analyzer.


-Grant

On Sep 26, 2008, at 3:10 AM, Jaco wrote:


Hello,

I need to work with an external stemmer in Solr. This stemmer is  
accessible
as a COM object (running Solr in tomcat on Windows platform). I  
managed to

integrate this using the com4j library. I tested two scenario's:
1. Create a custom FilterFactory and Filter class for this. The  
external

stemmer is then invoked for every token
2. Create a custom TokenizerFactory (extending  
BaseTokenizerFactory), that

invokes the external stemmer for the entire search text, then puts the
result of this into a StringReader, and finally returns new
WhitespaceTokenizer(stringReader), so the stemmed text gets  
tokenized by the

whitespace tokenizer.

Looking at search results, both scenario's appear to work from a  
functional

point of view. The first scenario however is too slow because of the
overhead of calling the external COM object for each token.

The second scenario is much faster, and also gives correct search  
results.
However, this then gives problems with highlighting - sometimes,  
errors are
reported (String out of Range), in other cases, I get incorrect  
highlight
fragments. Without knowing all details about this stuff, this makes  
sense

because of the change done to the text to be processed before it's
tokenized.  Maybe my second scenario does not make sense at all..?

Any ideas on how to overcome this or any other suggestions on how to  
realise

this?

Thanks, bye,

Jaco.

PS I posted this message twice before but it didn't come through (spam
filtering..??), so this is the 2nd try with text changed a bit


--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ