"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > 1) If we are deprecating some methods like String termText(), how > about at the same time deprecating "String type"? If we want > lightweight per-token metadata for communication between filters, an > int or a long used as a bitvector (32 or 64 independent boolean vars > per token) would be much more useful than a single String.
You mean replace String type with a series of booleans stored as bit-flags (and also allowing room for custom bit flags for the application)? This sounds nice but is there a compelling reason to do this now? Eg, I don't think "String type" costs us much performance loss now? > 2) I think we need to clarify who needs to "clean up" a token's > state when it's being reused (or if it needs to be cleaned > up)... for example, in the CharTokenizer, the token type, token > payload, and positionIncrement is not reset, so they will default to > the last token's value.... is this a) a bug b) guaranteed behavior > one can depend on or c) undefined? Since this includes > positionIncrement, I'm inclined to say that this is a bug. There is > a Token.clear().... should it be used by either the caller or the > Tokenizer? How about: if you are re-using your token, then whoever set the payload, positionIncrement, etc, should always clear/reset it on the next token? Ie your 'next' method must always set a value for X (X = payload, positionIncrement, etc.) when you are re-using? Inserting clear() into DocumentsWriter actually causes a non-trivial performance hit -- a quick test tokenizing all of Wikipedia w/ SimpleAnalyzer showed 6-8% overall slowdown if I call token.clear() in ReadTokensTask.java. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]