: > If a given Tokenizer does not need to do any character normalization (I : would think most wouldn't) is there any added cost during tokenization with : this change? : : Thank you for your reply, Mike! : There is no added cost if Tokenizer doesn't need to call correctOffset().
But every tokenizer *should* call correctOffset on the start/end offset of every token it produces correct? My understanding is that we would imake a change like this is... 1) change the Tokenizer class to look something like this... public abstract class Tokenizer extends TokenStream { protected CharStream input; protected Tokenizer() {} protected Tokenizer(Reader input) { this(new NoOpCharStream(input)); } protected Tokenizer(CharStream input) { this.input = input; } public void close() throws IOException { input.close(); } public void reset(Reader input) throws IOException { if (input instanceof CharStream) { this.input = (CharStream)input; } else { this.input = new NoOpCharStream(input); } } } 2) change all of the Tokenizers shipped with Lucene to use correctOffset when setting all start/end offsets on any Tokens. ...once those two things are done, anyone using out-of-the-box tokenizers can use a CharStream and get correct offsets -- anyone with an existing custom Tokenizer should continue to get the same behavior as before, but if they wnat to start using a CharStream they need to tweak there code. The only potential downside i can think of is the performance cost of the added method calls -- but if we make NoOpCharStream.correctOffset final the JVM should be able to able to optimize away the "identity" function correct? -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]