Re: Token implementation

Michael McCandless Tue, 20 May 2008 02:02:28 -0700


DM Smith wrote:

On May 19, 2008, at 4:33 PM, Michael McCandless wrote:
DM Smith <[EMAIL PROTECTED]> wrote:
Michael McCandless wrote:
I agree the situation is not ideal, and it's confusing.
My problem as a user is that I have to read the code to figureout how to
optimally use the class. The JavaDoc is a bit wanting.
Yeah we should fix the javadocs.
This comes back to LUCENE-969.

At the time, we decided to keep both String & char[] only to avoid
performance cost for those analyzer chains that use String tokens
exclusively.
The idea was to allow Token to keep both text or char[] andsometimes
both (if they are storing the same characters, as happens if
termBuffer() is called when it's a String being stored)
When termBuffer() is called termText is always null on return.This is theinvariant of initTermBuffer() which is called directly orindirectly by
every method that touches termBuffer.
Sorry I meant termText().
It is only after calling termText() that one could have both. Theonlyadvantage I see here is that calling it twice without anyintervening call
to a termBuffer method would it return the same physical string.
Right.
After calling setTermText(newText), termBuffer is set to null.
I presume the purpose of a filter is to get the token content,and ifmodified, set the new content. If so, the result of the setXXXwill be that
either termText or termBuffer will be null.
Right, though, if there's no change, and the next filter in the chain
calls termText(), we save constructing a new String by caching it.
Then, in 3.0, we would make the change you are proposing (to only
store char[] internally).  That was the plan, anyway.  Accelerating
this plan (to store only char[] today) is compelling, but I worry
about the performance hit to legacy analyzer chains...
I wonder whether it is all that significant an issue. Today, someof theLucene analyzers have been migrated to using char[], whileothers, notably
contrib, continue to use text.
IMHO: Prior to char[], the text was replaceable, but notmodifiable. From apractical perspective, Token reuse minimized the cost ofconstruction, butnot much else. The performance of a Token was predictable, butthe filterwas potentially sub-optimal. With char[] and supporting methods,the text
became modifiable, too.
When a filter calls setTermText or setTermBuffer, it does notknow how the
consumer of the token will work. It could be that it stores it with
setTermText and the next filter calls termBuffer().
I may not understand this correctly, but it seems to me that thefollowingis plausible given a filter chain of Lucene provided filters(including contrib)If we have a token filter chain of A -> B -> C, which uses next()in anypart of the chain, the flow of a reusable Token is stopped. Agiven filter
may cache a Token and reuse it. So consider the following scenario:
A overrides next(Token) and reuses the token via char[]
B overrides next() and has a cached Token and updates text.
C overrides next(Token) and reuses the token via char[].

First run:
After A is done, the termText in the supplied Token is null andtermBuffer
has a value.

B provides it's own token so it is not influenced by A.
C is given the token that B returns, because the caller iseffectively using"token = input.next(token)", but because the token has text, aperformancehit is taken to put it into char[]. Both text and char[] startout the same,
but because char[] is changed, termText is set to null.

Second run:
A starts with a Token with a char[] because it is reusing thetoken from thelast run or because it is using a localToken from the caller. Ifit is alocalToken, then the scenario is as above. But if it is the endresult ofthe first run, then A is re-using the token that is cached in B.Since C
last modified it, it is char[].
B uses its cached Token, but it was modified by A to be char[]with null
text. Now B takes a performance hit as it creates a new String.

C is as in the first run.

Another scenario:
A, B and C are all legacy. This would only be filter chains thatare notprovided by core Lucene as the core filter chains have beenmigrated. This
would be a performance hit.
Why is this a performance hit?  If they are all legacy they would all
implemented the non-reuse next() API, and would use setTermText, and
no conversion to char[] would happen (except inside DocumentsWriter)?
I was thinking faster than I was typing. I meant to say "not be aperformance hit".


Ahh, ok.

A last scenario:
A, B and C are all char[]. This would not take a performance hit.

It seems to me that in a mixed chain, that there will always be a
performance hit.
Right, though since we cache the String we should save on multiple
calls to termText().  And in a non-mixed chain (all new or all old) I
think there wouldn't be a hit.
But my guess is that maintaining termBuffer once used would
be a good thing.
Interesting...
So, a modified suggestion to maintain the performance but improvethe first
scenario. Do you see any problem with the following?
If termBuffer is used in a token, then it is maintained and neverset to null.Note also that resizeTermBuffer(size) maintains the termBuffer.That is, itcopies the text when the array is grown. When it is known that itis goingto be slammed this is unnecessary. One can implement a helperfunction that
merely grows the array. There are a couple of places this

Thus setTokenText would be something like:
public void setTermText(String text) {
 termText = text;
 if (termBuffer != null) {
   growTermBuffer(termText.length());
   termText.getChars(0, termText.length(), termBuffer, 0);
 }
}

A possible implementation to grow the array would be:
private void growTermBuffer(int newSize)
{
// determine the best size
if (newSize < MIN_BUFFER_SIZE) {
 newSize = MIN_BUFFER_SIZE;
}
// if the buffer exists and is too small, then determine a bettersize.
// this is the current doubling algorithm. it could be better.
int tbLength = termBuffer == null ? 0 : termBuffer.length;
if (tbLength > 0 && newSize > tbLength) {
 int size = tbLength;
 while (size < newSize) {
    size *= 2;
 }
  newSize = size;
}

// Check to see if the buffer needs to be resized
if (newSize > tbLength)
{
 termBuffer = new char[newSize];
}
}
That looks good.  Though, if Token is 100% re-used (full chain is
"new") then I would expect growing the char[] to be very low cost
(happens very few times).
Without peeking at your Python sample below ;) the doublingalgorithm is costly as it over allocates too much. But it has thenice behavior that reallocations are fewer.
But, now peeking ahead, a less aggressive allocation policy wouldcause more allocations and the array copy becomes more important.Though not significantly. But in a mixed environment with somesharing, it could be magnified.
I didn't dig into it, but in the core, are reusable Tokens perdocument or per DocumentWriter. If it is the former then the copybecomes more significant. If it is the latter, then it is almostmoot, as you point out.


It's actually per DocumentsWriter per thread per field.

More below....
More responses below:

DM Smith <[EMAIL PROTECTED]> wrote:
I think the Token implementation as it stands can use someimprovement andI'd be willing to do it. I'd like some input, though.Especially because it
is core to Lucene.
I've been working on eliminating deprecations from my user codeand I ran
across Token.getText() as being deprecated.

This is not about my code, but the code in Token.
In Token, it allows one of two representations of the actualtoken, eitheran immutable String, or a mutable char[]. One can flip back andforth
between these all to easily.
termText() is deprecated so termBuffer() is suggested as areplacement.
Calling termBuffer() will potentially copy the text out of theString
termText and into the newly created buffer and return it.
Calling setTermText(str), which is not deprecated, will dropthe buffer and
save the str in termText.

It appears that the invariant that is trying to be established is
either termText or termBuffer holds the content, but not both.
However, termBuffer() potentially violates this by loadingtermText
with the termBuffer, but not nulling out the buffer.
Actually, both are allowed to be set, as long as they are the same.
So termBuffer() is allowed to leave both non-null.
I'm not sure of the advantage.
Covered above (mixed chains).
Also, in my code, I am not manipulating char[] so getting thebuffer, I needto immediately convert it to a string to process it. And thenwhen I'm done,I have a String possibly of some other length. To stuff it backinto the
termBuffer, requires a call to:
setTermBuffer(s.toCharArray(), o, s.length())
It would be better to call Token.resizeTermBuffer(...), then
s.getChars into the Token's term buffer (saves a buffer copy).
Many thanks. I saw the comment, but missed getChars. This isbetter, but
even better would be growTermBuffer (above) since resizeTermBuffer
potentially has an unnecessary copy.
True, but remember growing should be very rare in 100% reuse (new)case.
I'm presuming then that the DocumentWriter is the holder of the onereusable Token.
I was looking at this in light of TokenFilter's next(Token)method and howit was being used. In looking at the contrib filters, they havenot beenmodified. Further, most of them, if they work with the contentanalysis andgeneration, do their work in strings. Some of these appear tobe goodcandidates for using char[] rather than strings, such as theNGram filter.But others look like they'd just as well remain with Stringmanipulation.
It would be great to upgrade all contrib filters to use the re-use APIs.
I'll be happy to work toward that end. I think it affects myperformance
today.
That would be great :)
I'd like to suggest that internally, that Token be changed toonly use
char[] termBuffer and eliminate termText.
The question is what performance cost we are incurring eg on the
contrib (& other) sources/filters? Every time setTermText iscalled,we copy out the chars (instead of holding a reference to theString).
Every time getText() is called we create a new String(...) from the
char[]. I think it's potentially a high cost, and so maybe weshould
wait until 3.0 when we drop the deprecated APIs?
See above. I'll concede that. But I think that once termBuffer isused
because of a mixed chain, it should be retained.
This may cause problems if a core Tokenizer is used to produce the
tokens, but then a big chain of old (non-reuse) filters is used after
that?
If the reusable token is per DocumentWriter and the chain is perDocument, then it is an advantage to retain it. he buffer is therefor the next Document, likely big enough that it does not need tobe allocated again.

Well, remember that a single non-reuse filter in the chain "breaks"the sharing, going backwards. IE the new token source will see adifferent Token with every next() call. So the buffer doesn't getreused. So I think in this case we would spend alot of extra timeconverting the Strings in the non-reuse filters back to char[].

And also, that termText be restored as not deprecated.
It made me nervous keeping this method because it looks like itshould
be cheap to call, and in the past it was very cheap to call.  But,
maybe we could keep it, if we mark very very clearly in thejavadocs
the performance cost you incur by using this method (it makes a new
String() every time)?
But, in TokenFilter, next() should be deprecated, IMHO.
I think this is a good idea. After all if people don't want tobother
using the passed in Token, they are still allowed to return a new
one.
Should the constructors in Token that take a String bedeprecated, too? The
comments seem to suggest that.

Actually I think they should. We should strongly encourage the re-use

APIs.

I have also used a better algorithm than doubling for resizing an
array. I'd have to hunt for it.


That would be great!


I'm looking, but still haven't found it. :)

I'm looking into digging into this one.


Python has an interesting approach, for its list type.  Here's a
snippet from listobject.c from Python's sources:

        /* This over-allocates proportional to the list size, making room
         * for additional growth.  The over-allocation is mild, but is
         * enough to give linear-time amortized behavior over a long
         * sequence of appends() in the presence of a poorly-performing
         * system realloc().
         * The growth pattern is:  0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
         */
        new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6) + newsize;

This is a good one. It is computationally simple and it is not tooaggressive.



I'm sure Perl has something interesting too :)  But I think this is
probably overkill for us ...

Based on the outcome of this discussion, would this be one Jiraissue?
Reopening LUCENE-969? A separate issue for the contrib changes?


I would say open a new issue?

OK. I'm on vacation for the next week and I don't know what kind ofconnectivity I'll have. So it might wait a bit.

Since this would not be an API change, would there need to bechanges to
test cases? If so, what would you suggest?
I think the existing test cases are likely sufficient though I always
love adding new ones ;)


Me too ;)

-- DM

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Token implementation

Reply via email to