Chris Hostetter wrote:
> : > If a given Tokenizer does not need to do any character normalization (I > : would think most wouldn't) is there any added cost during tokenization with
> : this change?
> :
> : Thank you for your reply, Mike!
> : There is no added cost if Tokenizer doesn't need to call correctOffset().
>
> But every tokenizer *should* call correctOffset on the start/end offset of
> every token it produces correct?

Yes.

> My understanding is that we would imake a change like this is...
>
> 1) change the Tokenizer class to look something like this...

(snip)

> 2) change all of the Tokenizers shipped with Lucene to use correctOffset
> when setting all start/end offsets on any Tokens.
>
> ...once those two things are done, anyone using out-of-the-box tokenizers
> can use a CharStream and get correct offsets -- anyone with an existing
> custom Tokenizer should continue to get the same behavior as before, but
> if they wnat to start using a CharStream they need to tweak there code.

Looks great!

> The only potential downside i can think of is the performance cost of the
> added method calls -- but if we make NoOpCharStream.correctOffset final
> the JVM should be able to able to optimize away the "identity" function
> correct?

I didn't take care of JVM optimization, however, we have already have
the final class "CharReader" in Solr 1.4:

public final class CharReader extends CharStream {
 protected Reader input;
 public CharReader( Reader in ){
   input = in;
 }
 @Override
 public int correctOffset(int currentOff) {
   return currentOff;
 }
 :
}

and CharReader is instantiated in TokenizerChain.

Koji


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to