Re: commercial websites powered by Lucene?

Ulrich Mayring Tue, 24 Jun 2003 03:43:31 -0700

Chris Miller wrote:


The main thing I'm interested in is how you handle updates to Lucene's
index. I'd imagine you have a fairly high turnover of CVs and jobs, so index
updates must place a reasonable load on the CPU/disk. Do you keep CVs and
jobs in the same index or two different ones? And what is the process you
use to update the index(es) - do you batch-process updates or do you handle
them in real-time as changes are made?

The way we do it: we re-index everything periodically in a temporary directory and then rename the temporary directory. That way the index remains accessible at all times and its currency is simply determined by the interval I run the re-indexing in.

 We need to be able to handle indexing about 60,000 documents/day,
while allowing (many) searches to continue operating alongside.

On an entry-level Sun I can index about 23 documents per second and these are real-life HTML pages. Thus in less than one hour you would be finished with a complete index run and save yourself all kinds of trouble with crashes during indexing etc.

On my 2 GHz Linux workstation it's even faster: more than 2000 documents per minute, so you'd be done in half an hour.

BTW, we're not using the supplied JavaCC-based HTML parser, instead we got htmlparser.sourceforge.net, which is a joy to use and pretty fast.

Ulrich

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: commercial websites powered by Lucene?

Reply via email to