Re: Token implementation

Michael McCandless Tue, 20 May 2008 02:09:44 -0700


Hiroaki Kawai wrote:



"Michael McCandless" <[EMAIL PROTECTED]> wrote:

I agree the situation is not ideal, and it's confusing.

This comes back to LUCENE-969.

At the time, we decided to keep both String & char[] only to avoid
performance cost for those analyzer chains that use String tokens
exclusively.

The idea was to allow Token to keep both text or char[] and sometimes
both (if they are storing the same characters, as happens if
termBuffer() is called when it's a String being stored)

Then, in 3.0, we would make the change you are proposing (to only
store char[] internally).  That was the plan, anyway.  Accelerating
this plan (to store only char[] today) is compelling, but I worry
about the performance hit to legacy analyzer chains...


I'd like to suggest another implementation which use
StringBuilder or CharBuffer instead of char[].


StringBuilder has to wait until we are on Java 1.5.

Because we don't need to maintain the length separatly from the
characater sequence itself.
If we use char[], then we have to handle char[] and the offset and the
sequence length, the method we implement will be so complex.
I think those should be packed into one object.

I did not test that using StringBuilder or CharBuffer hit the

performance or not. But I think it might not result in so badperformace.

I'm somewhat less optimistic here. These classes are targeting usecases with much larger sequences of characters than a typical Tokenin a Document. We should test the performance impact to see.

More responses below:
DM Smith <[EMAIL PROTECTED]> wrote:
-snip-
I was looking at this in light of TokenFilter's next(Token)method and howit was being used. In looking at the contrib filters, they havenot beenmodified. Further, most of them, if they work with the contentanalysis andgeneration, do their work in strings. Some of these appear to begoodcandidates for using char[] rather than strings, such as theNGram filter.But others look like they'd just as well remain with Stringmanipulation.
It would be great to upgrade all contrib filters to use the re-useAPIs.
I'll contribute, too. :-)


Fantastic!

I'd like to suggest that internally, that Token be changed toonly use
char[] termBuffer and eliminate termText.


The question is what performance cost we are incurring eg on the
contrib (& other) sources/filters?  Every time setTermText is called,
we copy out the chars (instead of holding a reference to the String).
Every time getText() is called we create a new String(...) from the
char[].  I think it's potentially a high cost, and so maybe we should
wait until 3.0 when we drop the deprecated APIs?

And also, that termText be restored as not deprecated.

It made me nervous keeping this method because it looks like itshould

be cheap to call, and in the past it was very cheap to call.  But,
maybe we could keep it, if we mark very very clearly in the javadocs
the performance cost you incur by using this method (it makes a new
String() every time)?


I'd like to suggest changing the method definition to:
 public void setTermText(CharSequence text)


This seems like a good idea.

But, in TokenFilter, next() should be deprecated, IMHO.
I think this is a good idea. After all if people don't want tobother
using the passed in Token, they are still allowed to return a new
one.
I could not see what you meant. Can I ask you to let me know thereason
why it should be deprecated?

Deprecated in favor of next(Token result) API. Ie, token sources/filters should migrate to this re-use API. It's a straightforwardmigration because the method next(Token result) is allowed to ignoreresult (and return its own Token) if it wants to.


Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Token implementation

Reply via email to