I will try tweaking RAM, and check about autoCommit=false. It's on the
future agenda to multi-thread through the index writer. The indexing
time I quoted includes the document creation time which would definitely
improve with multi-threading.

I'm doing batch updates of up to 1000 a pop, and closing and re-opening
the IndexWriter in between.

Not using term vectors on any fields. Not using compression either. I
did change the tokenizers to re-use instead of instantiate a new one. 

About half of the fields are stored in the index. Most are small, but
unfortunately the largest one is stored which uses a lot of memory and
probably takes additional time to write. The only reason is so a snippet
can be returned from it, but eventually I'd like to get rid of that and
return snippets as tokens in the stream (I'm guessing that it might be
ok to return analyzed data as a snippet given it would save a lot on
index size, which would speed up copy time during swapping between
search and update indexes).

-----Original Message-----
From: Michael McCandless [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 09, 2008 10:38 AM
To: java-user@lucene.apache.org
Subject: Re: performance feedback


This is great to hear!

If you tweak things a bit (increase RAM buffer size, use  
autoCommit=false, use threads, etc) you should be able to eke out some  
more gains...

Are you storing fields & using term vectors on any of your fields?

Mike

Beard, Brian wrote:

>
> I just did an update from lucene 2.2.0 to 2.3.2 and thought I'd give
> some kudos for the indexing performance enhancements.
>
> The lucene indexing portion is about 6-8 times faster. Previously we
> were doing ~60-120 documents per second, now we're between 400-1000,
> depending on the type of document, size, and how many fields there  
> are.
> Haven't done a formal comparison side by side, but certainly is
> substantially faster.
>
> The gain would have been equal to the 8-10 times in the readme, but
> using custom tokenizers slows things down a little vs. using the
> standard one. At first I didn't realize to use reusableTokenFilter  
> which
> bypassed the custom tokenizers and had the 8-10x improvement. Maybe
> there's some more gain to be had I can pursue.
>
> We index from 5 to 20 fields per document (in serial through
> IndexWriter). Most are 3-5K total size, but can vary quite a bit.  
> Total
> index size (eventually) will be ~15G.
>
> Thanks,
>
> Brian Beard
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to