Re: Token termBuffer issues

Michael McCandless Mon, 30 Jul 2007 05:46:31 -0700

"Michael McCandless" <[EMAIL PROTECTED]> wrote:
> "Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> > On 7/25/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > > "Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> > > > On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > > > > "Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> > > > > > On 7/24/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > > > > > > OK, I ran some benchmarks here.
> > > > > > >
> > > > > > > The performance gains are sizable: 12.8% speedup using Sun's JDK 
> > > > > > > 5 and
> > > > > > > 17.2% speedup using Sun's JDK 6, on Linux.  This is indexing all
> > > > > > > Wikipedia content using LowerCaseTokenizer + StopFilter +
> > > > > > > PorterStemFilter.  I think it's worth pursuing!
> > > > > >
> > > > > > Did you try it w/o token reuse (reuse tokens only when mutating, not
> > > > > > when creating new tokens from the tokenizer)?
> > > > >
> > > > > I haven't tried this variant yet.  I guess for long filter chains the
> > > > > GC cost of the tokenizer making the initial token should go down as
> > > > > overall part of the time.  Though I think we should still re-use the
> > > > > initial token since it should (?) only help.
> > > >
> > > > If it weren't any slower, that would be great... but I worry about
> > > > filters that need buffering (either on the input side or the output
> > > > side) and how that interacts with filters that try and reuse.
> > >
> > > OK I will tease out this effect & measure performance impact.
> > >
> > > This would mean that the tokenizer must not only produce new Token
> > > instance for each term but also cannot re-use the underlying char[]
> > > buffer in that token, right?
> > 
> > If the tokenizer can actually change the contents of the char[], then
> > yes, it seems like when next() is called rather than next(Token), a
> > new char[] would need to be allocated.
> 
> Right.  So I'm now testing "reuse all" vs "tokenizer makes a full copy
> but filters get to re-use it".


OK, I tested this case where CharTokenizer makes a new Token (and new
char[] array) for every token instead of re-using each.  This way is
6% slower than fully re-using the Token (585 sec -> 618 sec) -- using
same test as described in
https://issues.apache.org/jira/browse/LUCENE-969.

> > >  EG with mods for CharTokenizer I re-use
> > > its "char[] buffer" with every Token, but I'll change that to be a new
> > > buffer for each Token for this test.
> > 
> > It's not just for a test, right?  If next() is called, it can't reuse
> > the char[].  And there is no getting around the fact that some
> > tokenizers will need to call next() because of buffering.
> 
> Correct -- the way I'm doing this now is in TokenStream.java I have a
> default "Token next()" which calls "next(Token result)" but makes a
> complete copy before returning it.  This keeps full backwards
> compatiblity even in the case where a consumer wants a private copy
> (calls next()) but the provider only provides the "re-use" API
> (next(Token result)).
> 
> Mike
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Token termBuffer issues

Reply via email to