Re: new Token API

Michael McCandless Sun, 18 Nov 2007 03:07:33 -0800

"Yonik Seeley" <[EMAIL PROTECTED]> wrote:

> 1) If we are deprecating some methods like String termText(), how
> about at the same time deprecating "String type"?  If we want
> lightweight per-token metadata for communication between filters, an
> int or a long used as a bitvector (32 or 64 independent boolean vars
> per token) would be much more useful than a single String.


You mean replace String type with a series of booleans stored as
bit-flags (and also allowing room for custom bit flags for the
application)?  This sounds nice but is there a compelling reason to do
this now?  Eg, I don't think "String type" costs us much performance
loss now?

> 2) I think we need to clarify who needs to "clean up" a token's
> state when it's being reused (or if it needs to be cleaned
> up)... for example, in the CharTokenizer, the token type, token
> payload, and positionIncrement is not reset, so they will default to
> the last token's value.... is this a) a bug b) guaranteed behavior
> one can depend on or c) undefined?  Since this includes
> positionIncrement, I'm inclined to say that this is a bug.  There is
> a Token.clear()....  should it be used by either the caller or the
> Tokenizer?

How about: if you are re-using your token, then whoever set the
payload, positionIncrement, etc, should always clear/reset it on the
next token?  Ie your 'next' method must always set a value for X (X =
payload, positionIncrement, etc.) when you are re-using?

Inserting clear() into DocumentsWriter actually causes a non-trivial
performance hit -- a quick test tokenizing all of Wikipedia w/
SimpleAnalyzer showed 6-8% overall slowdown if I call token.clear() in
ReadTokensTask.java.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: new Token API

Reply via email to