"Chris Hostetter" <[EMAIL PROTECTED]> wrote: > > : > Currently the char[] wins, but good point: seems like each setter > : > should null out the other one? > : > : Certainly the String setter should null the char[] (that's the only > : way to keep back compatibility), and probably vice-versa. > > i haven't really thought baout this before today (i missed seeing the > char[] stuff get added to Token as well) but if we're confident the > char[] > stuff is hte direction we want to go, then i believe the cleanest forward > migration plan is... > > 1) deprecate Token.termText, Token.getTermText(), Token.setTermText > 2) make Token.setTermBuffer() null out Token.termText (document) > 3) make Token.setTermText() null out Token.termBuffer > 4) refactor all of the the "if (null == termBuffer)" logic in > DocumentsWriter into a the Token class, ala... > public final char[] termBuffer() { > initTermBuffer(); > return termBuffer; > } > public final int termBufferOffset() { > initTermBuffer(); > return termBufferOffset; > } > public final int termBufferLength() { > initTermBuffer(); > return termBufferLength; > } > private void initTermBuffer() { > if (null != termBuffer) return; > termBufferOffset=0; > termBufferLength = termText.length(); > termBuffer = char[termBufferLength]; > termText.getChars(0, termBufferLength, termBuffer, 0) > } > ...such that DocumentsWRiter never uses termText just termBuffer > 5) at some point down the road, modify all of the "core" TokenStreams to > use termBuffer instead of termText > 6) at some point way down the road, delete the depreacated > methods/variables and the Token.initTermBuffer method. > > Unless I've missed something, the end result should be... > > a) existing TokenStreams that use termText exclusively and don't know > anything about termBuffer will have the exact same performance > characteristics that they currently do (a char[] will be created on > demand > the first time termBuffer is used -- by DocumentsWriter) > > b) TokenStreams which wind up being a mix of old and new code using both > termText and termBuffer should work correctly in any combination. > > c) new TokenStreams that use termBuffer exclusively should work fine, and > have decent performance even with the overhead of the initTermBuffer() > call (which will get better once the deprecated legacy termText usage can > be removed.
I like this approach Hoss! I will open an issue and work on it; I'd like to finish up through your step 5 below. This way "out of the box" performance of Lucene is improved, for people who use the core analyzers and filters. To further improve "out of the box" performance I would really also like to fix the core analyzers, when possible, to re-use a single Token instance for each Token they produce. This would then mean no objects are created as you step through Tokens in the TokenStream ... so this would give the best performance. The problem is, this would be a sneaky API change and would for example break anyone who gathers the Tokens into an array and processes them later (eg maybe highlighter does?). Maybe one way to migrate to this would be to add a new method "Token nextDirect()" to TokenStream which is allowed to return a Token that may be recycled. This means callers of nextDirect() must make their own copy of the Token if they intend to keep it for later usage. It would default to "return next()" and I would then default "next()" to call nextDirect() but make a full copy of what's returned. DocumentsWriter would use this to step through the tokens. Analyzers would then implement either next() or nextDirect(). We would fix all core analyzers to implemente nextDirect(), and then all other analyzers would remain backwards compatible. >From a caller's standpoint the only difference between nextDirect() and next() is that next() guarantees to give you a "full private copy" of the Token but nextDirect() does not. We could also punt on this entirely since it is always possible for people to make their own analyzers that re-use Tokens, but I think having decent "out of the box" performance with Lucene is important. Ie, Lucene's defaults should be set up so that you get decent performance without changing anything; you should not have to work hard to get good performance and unfortunately today you still do.... > Side note: Token.toString() is current jacked in cases where termBuffer is > used) Woops -- sorry -- I will add test case & fix it. > Note that there are many existing filters that directly access and > manipulate the package protected String termText. These will need to > be changed. Hmmm ... is it too late to make all of Token's package protected attrs private, so we require use of getters? Or, maybe just make the new ones (termBuffer*) private and then leave termText package protected but deprecate it and add a comment saying that core analyzers and filters have switched to using termBuffer and so you may get an NPE if you access termText directly? Plus fix all core analyzers to not directly access termText and use termBuffer instead... Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]