Fastest batch indexing with 1.3-rc1
Hey there. What's the fastest way to do a batch index with lucene 1.3-rc1 on a dual or quad-processor box? The files I'm indexing are very easy to split divide among multiple threads. Here's what I've done at this point: Each thread has its own IndexWriter writing to its own RAMDirectory. Every number of documents, I mergeIndexes the thread's index to the main disk index. The thread writers have a mergeFactor of 50. The disk indexWriter has a mergeFactor of 30. I call optimize only on the main disk index, and only once at the very end. Just doing this has shown great improvements for me, but I want to squeeze out every bit of performance I can. What's the fastest way to mergeIndexes? Should I use a low mergeFactor when working with RAMDirectorys? Should I optimize the thread index before I merge it to the main one? Thanks! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Fastest batch indexing with 1.3-rc1
Looks like I spoke too soon... As the index gets larger, time to merge becomes prohibitably high. It appears to increase linearly. Oh well. I guess I'll just have to go with about 3ms/doc. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fastest batch indexing with 1.3-rc1
As the index grows, disk i/o becomes the bottleneck. The default indexing parameters do a pretty good job of optimizing this. But if you have lots of CPUs and lots of disks, you might try building several indexes in parallel, each containing a subset of the documents, optimize each index and finally merge them all into a single index at the end. But you need lots of i/o capacity for this to pay off. Doug Dan Quaroni wrote: Looks like I spoke too soon... As the index gets larger, time to merge becomes prohibitably high. It appears to increase linearly. Oh well. I guess I'll just have to go with about 3ms/doc. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fastest batch indexing with 1.3-rc1
Isn't it better for Dan to skip the optimization phase before merging? I am not sure, but he could save some time on this (if he has enough file handles for that, of course). What strategy do you use in nutch? THX -g- Doug Cutting wrote: As the index grows, disk i/o becomes the bottleneck. The default indexing parameters do a pretty good job of optimizing this. But if you have lots of CPUs and lots of disks, you might try building several indexes in parallel, each containing a subset of the documents, optimize each index and finally merge them all into a single index at the end. But you need lots of i/o capacity for this to pay off. Doug Dan Quaroni wrote: Looks like I spoke too soon... As the index gets larger, time to merge becomes prohibitably high. It appears to increase linearly. Oh well. I guess I'll just have to go with about 3ms/doc. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]