"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> On Nov 18, 2007 6:07 AM, Michael McCandless <[EMAIL PROTECTED]>
> wrote:
> > a quick test tokenizing all of Wikipedia w/
> > SimpleAnalyzer showed 6-8% overall slowdown if I call token.clear() in
> > ReadTokensTask.java.
> 
> We could slim down clear() a little by only resetting certain things...
> startOffset and endOffset need to be set each time if anyone cares
> about offsets, so they don't really need to be reset.  The only
> tokenizer to use "type" sets it every time AFAIK, so would could argue
> for skipping that as well.  Not sure if the small performance gain
> would be worth it though.

Honestly I was surprised by how sizable the performance difference was
when clearing each token.  I don't understand why.  I wonder if more
frequently setting pointers to null somehow causes GC to kick in more
often or something?  (I was using Sun's JDK 1.5.0_08 on Linux).  If so
it could be setting payloadLength=0 (once payload is inlined) would be
faster than setting payloadBytes=null.

And, maybe, we should in fact have a local payload byte[] instead of
by reference so we don't keep changing that pointer with every token.

Anyway, I do think it's worth paring back to what absolutely must be
cleared?  We could even reset the fields directly from
DocumentsWriter.  I've found that keeping good performance requires
being absurdly vigilant: if we slip a bit here and a bit there then
suddenly we'll find that we've become slow.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to