Re: Token termBuffer issues

Michael McCandless Sat, 21 Jul 2007 09:14:44 -0700

"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> On 7/21/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > To further improve "out of the box" performance I would really also
> > like to fix the core analyzers, when possible, to re-use a single
> > Token instance for each Token they produce.  This would then mean no
> > objects are created as you step through Tokens in the TokenStream
> > ... so this would give the best performance.
> 
> How much better I wonder?  Small object allocation & reclaiming is
> supposed to be very good in current JVMs.
> You got good performance out of reusing Field objects, i think,
> partially because of the convoluted logic-tree in the constructor,
> including interning the field name.
>
> Token is a much lighter weight class, with no unpredictable branches,
> less state (except we increased it with the termBuffer stuff ;-) and
> no string interning (which presumably includes synchronization).


Good question ... I will run some benchmarks to see if it's
worthwhile.

> > The problem is, this would be a sneaky API change and would for
> > example break anyone who gathers the Tokens into an array and
> > processes them later (eg maybe highlighter does?).
> 
> I think there is probably quite a bit of code out there that needs to
> look at tokens in context (and thus does some sort of buffering).

Agreed, so we can't change the API.  So the next/nextDirect proposal
should work well: it doesn't change the API yet would allow consumers
that don't require "full private copy" of every Token, like
DocumentsWriter, to have good performance.

> > Maybe one way to migrate to this would be to add a new method "Token
> > nextDirect()" to TokenStream which is allowed to return a Token that
> > may be recycled.  This means callers of nextDirect() must make their
> > own copy of the Token if they intend to keep it for later usage.  It
> > would default to "return next()" and I would then default "next()" to
> > call nextDirect() but make a full copy of what's
> > returned. DocumentsWriter would use this to step through the tokens.
> >
> > Analyzers would then implement either next() or nextDirect().  We
> > would fix all core analyzers to implemente nextDirect(), and then all
> > other analyzers would remain backwards compatible.
> >
> > From a caller's standpoint the only difference between nextDirect()
> > and next() is that next() guarantees to give you a "full private copy"
> > of the Token but nextDirect() does not.
> >
> > We could also punt on this entirely since it is always possible for
> > people to make their own analyzers that re-use Tokens, but I think
> > having decent "out of the box" performance with Lucene is important.
> > Ie, Lucene's defaults should be set up so that you get decent
> > performance without changing anything; you should not have to work
> > hard to get good performance and unfortunately today you still do....
> 
> I'd be surprised if we can gain meaningful performance with token
> reuse in a typical full-text analysis chain (stemming, stop-words,
> etc).   Have you done any measurements, or is it a hunch at this
> point?

Will test.

> > Hmmm ... is it too late to make all of Token's package protected attrs
> > private, so we require use of getters?
> 
> They are package protected, which means that they are safe to change
> because we own all the code in that package.

OK, I was unsure whether this is considered an API change since users
can make their own classes in this package too.  So I will make all of
them (including Payload) private.

> Another thing that would slow down slightly with termBuffer use is
> hash lookups (because Strings cache their hashCode).  So an anlalyzer
> with a synonym filter and a stop filter would end up calculating that
> twice (or constructing a String).  I don't know if it would be worth
> maintaining that in the Token or not...

Good point; I will benchmark before/after to assess...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Token termBuffer issues

Reply via email to