Re: Proposal for introducing CharFilter

Koji Sekiguchi Tue, 18 Nov 2008 18:17:11 -0800

Chris Hostetter wrote:

> : > If a given Tokenizer does not need to do any characternormalization (I> : would think most wouldn't) is there any added cost duringtokenization with

> : this change?
> :
> : Thank you for your reply, Mike!

> : There is no added cost if Tokenizer doesn't need to callcorrectOffset().

> But every tokenizer *should* call correctOffset on the start/endoffset of

> every token it produces correct?


Yes.

> My understanding is that we would imake a change like this is...
>
> 1) change the Tokenizer class to look something like this...

(snip)

> 2) change all of the Tokenizers shipped with Lucene to use correctOffset
> when setting all start/end offsets on any Tokens.
>

> ...once those two things are done, anyone using out-of-the-boxtokenizers

> can use a CharStream and get correct offsets -- anyone with an existing
> custom Tokenizer should continue to get the same behavior as before, but
> if they wnat to start using a CharStream they need to tweak there code.

Looks great!

> The only potential downside i can think of is the performance cost ofthe

> added method calls -- but if we make NoOpCharStream.correctOffset final
> the JVM should be able to able to optimize away the "identity" function
> correct?

I didn't take care of JVM optimization, however, we have already have
the final class "CharReader" in Solr 1.4:

public final class CharReader extends CharStream {
 protected Reader input;
 public CharReader( Reader in ){
   input = in;
 }
 @Override
 public int correctOffset(int currentOff) {
   return currentOff;
 }
 :
}

and CharReader is instantiated in TokenizerChain.

Koji


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Proposal for introducing CharFilter

Reply via email to