Hiroaki Kawai wrote:
"Michael McCandless" <[EMAIL PROTECTED]> wrote:
I agree the situation is not ideal, and it's confusing.
This comes back to LUCENE-969.
At the time, we decided to keep both String & char[] only to avoid
performance cost for those analyzer chains that use String tokens
exclusively.
The idea was to allow Token to keep both text or char[] and sometimes
both (if they are storing the same characters, as happens if
termBuffer() is called when it's a String being stored)
Then, in 3.0, we would make the change you are proposing (to only
store char[] internally). That was the plan, anyway. Accelerating
this plan (to store only char[] today) is compelling, but I worry
about the performance hit to legacy analyzer chains...
I'd like to suggest another implementation which use
StringBuilder or CharBuffer instead of char[].
StringBuilder has to wait until we are on Java 1.5.
Because we don't need to maintain the length separatly from the
characater sequence itself.
If we use char[], then we have to handle char[] and the offset and the
sequence length, the method we implement will be so complex.
I think those should be packed into one object.
I did not test that using StringBuilder or CharBuffer hit the
performance or not. But I think it might not result in so bad
performace.
I'm somewhat less optimistic here. These classes are targeting use
cases with much larger sequences of characters than a typical Token
in a Document. We should test the performance impact to see.
More responses below:
DM Smith <[EMAIL PROTECTED]> wrote:
-snip-
I was looking at this in light of TokenFilter's next(Token)
method and how
it was being used. In looking at the contrib filters, they have
not been
modified. Further, most of them, if they work with the content
analysis and
generation, do their work in strings. Some of these appear to be
good
candidates for using char[] rather than strings, such as the
NGram filter.
But others look like they'd just as well remain with String
manipulation.
It would be great to upgrade all contrib filters to use the re-use
APIs.
I'll contribute, too. :-)
Fantastic!
I'd like to suggest that internally, that Token be changed to
only use
char[] termBuffer and eliminate termText.
The question is what performance cost we are incurring eg on the
contrib (& other) sources/filters? Every time setTermText is called,
we copy out the chars (instead of holding a reference to the String).
Every time getText() is called we create a new String(...) from the
char[]. I think it's potentially a high cost, and so maybe we should
wait until 3.0 when we drop the deprecated APIs?
And also, that termText be restored as not deprecated.
It made me nervous keeping this method because it looks like it
should
be cheap to call, and in the past it was very cheap to call. But,
maybe we could keep it, if we mark very very clearly in the javadocs
the performance cost you incur by using this method (it makes a new
String() every time)?
I'd like to suggest changing the method definition to:
public void setTermText(CharSequence text)
This seems like a good idea.
But, in TokenFilter, next() should be deprecated, IMHO.
I think this is a good idea. After all if people don't want to
bother
using the passed in Token, they are still allowed to return a new
one.
I could not see what you meant. Can I ask you to let me know the
reason
why it should be deprecated?
Deprecated in favor of next(Token result) API. Ie, token sources/
filters should migrate to this re-use API. It's a straightforward
migration because the method next(Token result) is allowed to ignore
result (and return its own Token) if it wants to.
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]