Chris Hostetter wrote:
> : > If a given Tokenizer does not need to do any character
normalization (I
> : would think most wouldn't) is there any added cost during
tokenization with
> : this change?
> :
> : Thank you for your reply, Mike!
> : There is no added cost if Tokenizer doesn't need to call
correctOffset().
>
> But every tokenizer *should* call correctOffset on the start/end
offset of
> every token it produces correct?
Yes.
> My understanding is that we would imake a change like this is...
>
> 1) change the Tokenizer class to look something like this...
(snip)
> 2) change all of the Tokenizers shipped with Lucene to use correctOffset
> when setting all start/end offsets on any Tokens.
>
> ...once those two things are done, anyone using out-of-the-box
tokenizers
> can use a CharStream and get correct offsets -- anyone with an existing
> custom Tokenizer should continue to get the same behavior as before, but
> if they wnat to start using a CharStream they need to tweak there code.
Looks great!
> The only potential downside i can think of is the performance cost of
the
> added method calls -- but if we make NoOpCharStream.correctOffset final
> the JVM should be able to able to optimize away the "identity" function
> correct?
I didn't take care of JVM optimization, however, we have already have
the final class "CharReader" in Solr 1.4:
public final class CharReader extends CharStream {
protected Reader input;
public CharReader( Reader in ){
input = in;
}
@Override
public int correctOffset(int currentOff) {
return currentOff;
}
:
}
and CharReader is instantiated in TokenizerChain.
Koji
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]