Fastest batch indexing with 1.3-rc1

2003-08-20 Thread Dan Quaroni
Hey there.  What's the fastest way to do a batch index with lucene 1.3-rc1
on a dual or quad-processor box?  The files I'm indexing are very easy to
split divide among multiple threads.

Here's what I've done at this point:

Each thread has its own IndexWriter writing to its own RAMDirectory.  Every
number of documents, I mergeIndexes the thread's index to the main disk
index.

The thread writers have a mergeFactor of 50.
The disk indexWriter has a mergeFactor of 30.
I call optimize only on the main disk index, and only once at the very end.

Just doing this has shown great improvements for me, but I want to squeeze
out every bit of performance I can.  What's the fastest way to mergeIndexes?
Should I use a low mergeFactor when working with RAMDirectorys?  Should I
optimize the thread index before I merge it to the main one?

Thanks!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Fastest batch indexing with 1.3-rc1

2003-08-20 Thread Dan Quaroni
Looks like I spoke too soon... As the index gets larger, time to merge
becomes prohibitably high.  It appears to increase linearly.

Oh well.  I guess I'll just have to go with about 3ms/doc.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Fastest batch indexing with 1.3-rc1

2003-08-20 Thread Doug Cutting
As the index grows, disk i/o becomes the bottleneck.  The default 
indexing parameters do a pretty good job of optimizing this.  But if you 
have lots of CPUs and lots of disks, you might try building several 
indexes in parallel, each containing a subset of the documents, optimize 
each index and finally merge them all into a single index at the end. 
But you need lots of i/o capacity for this to pay off.

Doug

Dan Quaroni wrote:
Looks like I spoke too soon... As the index gets larger, time to merge
becomes prohibitably high.  It appears to increase linearly.
Oh well.  I guess I'll just have to go with about 3ms/doc.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Fastest batch indexing with 1.3-rc1

2003-08-20 Thread Leo Galambos
Isn't it better for Dan to skip the optimization phase before merging? I 
am not sure, but he could save some time on this (if he has enough file 
handles for that, of course). What strategy do you use in nutch?

THX

-g-

Doug Cutting wrote:

As the index grows, disk i/o becomes the bottleneck.  The default 
indexing parameters do a pretty good job of optimizing this.  But if 
you have lots of CPUs and lots of disks, you might try building 
several indexes in parallel, each containing a subset of the 
documents, optimize each index and finally merge them all into a 
single index at the end. But you need lots of i/o capacity for this to 
pay off.

Doug

Dan Quaroni wrote:

Looks like I spoke too soon... As the index gets larger, time to merge
becomes prohibitably high.  It appears to increase linearly.
Oh well.  I guess I'll just have to go with about 3ms/doc.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]