"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > On 7/21/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > To further improve "out of the box" performance I would really also > > like to fix the core analyzers, when possible, to re-use a single > > Token instance for each Token they produce. This would then mean no > > objects are created as you step through Tokens in the TokenStream > > ... so this would give the best performance. > > How much better I wonder? Small object allocation & reclaiming is > supposed to be very good in current JVMs. > You got good performance out of reusing Field objects, i think, > partially because of the convoluted logic-tree in the constructor, > including interning the field name. > > Token is a much lighter weight class, with no unpredictable branches, > less state (except we increased it with the termBuffer stuff ;-) and > no string interning (which presumably includes synchronization).
Good question ... I will run some benchmarks to see if it's worthwhile. > > The problem is, this would be a sneaky API change and would for > > example break anyone who gathers the Tokens into an array and > > processes them later (eg maybe highlighter does?). > > I think there is probably quite a bit of code out there that needs to > look at tokens in context (and thus does some sort of buffering). Agreed, so we can't change the API. So the next/nextDirect proposal should work well: it doesn't change the API yet would allow consumers that don't require "full private copy" of every Token, like DocumentsWriter, to have good performance. > > Maybe one way to migrate to this would be to add a new method "Token > > nextDirect()" to TokenStream which is allowed to return a Token that > > may be recycled. This means callers of nextDirect() must make their > > own copy of the Token if they intend to keep it for later usage. It > > would default to "return next()" and I would then default "next()" to > > call nextDirect() but make a full copy of what's > > returned. DocumentsWriter would use this to step through the tokens. > > > > Analyzers would then implement either next() or nextDirect(). We > > would fix all core analyzers to implemente nextDirect(), and then all > > other analyzers would remain backwards compatible. > > > > From a caller's standpoint the only difference between nextDirect() > > and next() is that next() guarantees to give you a "full private copy" > > of the Token but nextDirect() does not. > > > > We could also punt on this entirely since it is always possible for > > people to make their own analyzers that re-use Tokens, but I think > > having decent "out of the box" performance with Lucene is important. > > Ie, Lucene's defaults should be set up so that you get decent > > performance without changing anything; you should not have to work > > hard to get good performance and unfortunately today you still do.... > > I'd be surprised if we can gain meaningful performance with token > reuse in a typical full-text analysis chain (stemming, stop-words, > etc). Have you done any measurements, or is it a hunch at this > point? Will test. > > Hmmm ... is it too late to make all of Token's package protected attrs > > private, so we require use of getters? > > They are package protected, which means that they are safe to change > because we own all the code in that package. OK, I was unsure whether this is considered an API change since users can make their own classes in this package too. So I will make all of them (including Payload) private. > Another thing that would slow down slightly with termBuffer use is > hash lookups (because Strings cache their hashCode). So an anlalyzer > with a synonym filter and a stop filter would end up calculating that > twice (or constructing a String). I don't know if it would be worth > maintaining that in the Token or not... Good point; I will benchmark before/after to assess... Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]