: > If a given Tokenizer does not need to do any character normalization (I
: would think most wouldn't) is there any added cost during tokenization with
: this change?
: 
: Thank you for your reply, Mike!
: There is no added cost if Tokenizer doesn't need to call correctOffset().

But every tokenizer *should* call correctOffset on the start/end offset of 
every token it produces correct?

My understanding is that we would imake a change like this is...

1) change the Tokenizer class to look something like this...

public abstract class Tokenizer extends TokenStream {
  protected CharStream input;
  protected Tokenizer() {}
  protected Tokenizer(Reader input) {
    this(new NoOpCharStream(input));
  }
  protected Tokenizer(CharStream input) {
    this.input = input;
  }
  public void close() throws IOException {
    input.close();
  }
  public void reset(Reader input) throws IOException {
    if (input instanceof CharStream) {
       this.input = (CharStream)input;
    } else {
       this.input = new NoOpCharStream(input);
    }
  }
}

2) change all of the Tokenizers shipped with Lucene to use correctOffset 
when setting all start/end offsets on any Tokens.


...once those two things are done, anyone using out-of-the-box tokenizers 
can use a CharStream and get correct offsets -- anyone with an existing 
custom Tokenizer should continue to get the same behavior as before, but 
if they wnat to start using a CharStream they need to tweak there code.

The only potential downside i can think of is the performance cost of the 
added method calls -- but if we make NoOpCharStream.correctOffset final 
the JVM should be able to able to optimize away the "identity" function 
correct?



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to