"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > OK, I ran some benchmarks here. > > > > The performance gains are sizable: 12.8% speedup using Sun's JDK 5 and > > 17.2% speedup using Sun's JDK 6, on Linux. This is indexing all > > Wikipedia content using LowerCaseTokenizer + StopFilter + > > PorterStemFilter. I think it's worth pursuing! > > Did you try it w/o token reuse (reuse tokens only when mutating, not > when creating new tokens from the tokenizer)?
I haven't tried this variant yet. I guess for long filter chains the GC cost of the tokenizer making the initial token should go down as overall part of the time. Though I think we should still re-use the initial token since it should (?) only help. > It would be interesting to see what's attributable to Token reuse only > (after core filters have been optimized to use the char[] setters, > etc). Good question; it could be the gains are mostly from switching to char[] termBuffer and less so from Token instance re-use. Too many tests to try :) > We've had issues in the past regarding errors with filters dealing > with token properties: > 1) filters creating a new token from and old token, but forgetting > about setting positionIncrement > 2) legacy filters losing "new" information such as payloads when > creating , because they didn't exist when the filter was written. > > #1 is solved by token mutation because there are setters for the value > (before, the filter author was forced to create a new token, unless > they could access the package-private String). Ahhh, good! > #2 can now be solved by clone() (another relatively new addition) > > So what new problems might crop up with token reuse? > - a filter reusing a token, but not zeroing out something new like > payloads because they didn't exist when the filter was authored (the > opposite problem from before) > > Would a Token.clear() be needed for use by (primarily) tokenizers? Hmm, good point; I like the clear() idea. I will add that. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
