"Michael McCandless" <[EMAIL PROTECTED]> wrote: > "Yonik Seeley" <[EMAIL PROTECTED]> wrote: > > On 7/25/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > > "Yonik Seeley" <[EMAIL PROTECTED]> wrote: > > > > On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > > > > "Yonik Seeley" <[EMAIL PROTECTED]> wrote: > > > > > > On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > > > > > > OK, I ran some benchmarks here. > > > > > > > > > > > > > > The performance gains are sizable: 12.8% speedup using Sun's JDK > > > > > > > 5 and > > > > > > > 17.2% speedup using Sun's JDK 6, on Linux. This is indexing all > > > > > > > Wikipedia content using LowerCaseTokenizer + StopFilter + > > > > > > > PorterStemFilter. I think it's worth pursuing! > > > > > > > > > > > > Did you try it w/o token reuse (reuse tokens only when mutating, not > > > > > > when creating new tokens from the tokenizer)? > > > > > > > > > > I haven't tried this variant yet. I guess for long filter chains the > > > > > GC cost of the tokenizer making the initial token should go down as > > > > > overall part of the time. Though I think we should still re-use the > > > > > initial token since it should (?) only help. > > > > > > > > If it weren't any slower, that would be great... but I worry about > > > > filters that need buffering (either on the input side or the output > > > > side) and how that interacts with filters that try and reuse. > > > > > > OK I will tease out this effect & measure performance impact. > > > > > > This would mean that the tokenizer must not only produce new Token > > > instance for each term but also cannot re-use the underlying char[] > > > buffer in that token, right? > > > > If the tokenizer can actually change the contents of the char[], then > > yes, it seems like when next() is called rather than next(Token), a > > new char[] would need to be allocated. > > Right. So I'm now testing "reuse all" vs "tokenizer makes a full copy > but filters get to re-use it".
OK, I tested this case where CharTokenizer makes a new Token (and new char[] array) for every token instead of re-using each. This way is 6% slower than fully re-using the Token (585 sec -> 618 sec) -- using same test as described in https://issues.apache.org/jira/browse/LUCENE-969. > > > EG with mods for CharTokenizer I re-use > > > its "char[] buffer" with every Token, but I'll change that to be a new > > > buffer for each Token for this test. > > > > It's not just for a test, right? If next() is called, it can't reuse > > the char[]. And there is no getting around the fact that some > > tokenizers will need to call next() because of buffering. > > Correct -- the way I'm doing this now is in TokenStream.java I have a > default "Token next()" which calls "next(Token result)" but makes a > complete copy before returning it. This keeps full backwards > compatiblity even in the case where a consumer wants a private copy > (calls next()) but the provider only provides the "re-use" API > (next(Token result)). > > Mike > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]