"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > On Nov 18, 2007 6:07 AM, Michael McCandless <[EMAIL PROTECTED]> > wrote: > > a quick test tokenizing all of Wikipedia w/ > > SimpleAnalyzer showed 6-8% overall slowdown if I call token.clear() in > > ReadTokensTask.java. > > We could slim down clear() a little by only resetting certain things... > startOffset and endOffset need to be set each time if anyone cares > about offsets, so they don't really need to be reset. The only > tokenizer to use "type" sets it every time AFAIK, so would could argue > for skipping that as well. Not sure if the small performance gain > would be worth it though.
Honestly I was surprised by how sizable the performance difference was when clearing each token. I don't understand why. I wonder if more frequently setting pointers to null somehow causes GC to kick in more often or something? (I was using Sun's JDK 1.5.0_08 on Linux). If so it could be setting payloadLength=0 (once payload is inlined) would be faster than setting payloadBytes=null. And, maybe, we should in fact have a local payload byte[] instead of by reference so we don't keep changing that pointer with every token. Anyway, I do think it's worth paring back to what absolutely must be cleared? We could even reset the fields directly from DocumentsWriter. I've found that keeping good performance requires being absurdly vigilant: if we slip a bit here and a bit there then suddenly we'll find that we've become slow. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]