Re: commercial websites powered by Lucene?

Chris Miller Tue, 24 Jun 2003 04:18:23 -0700

Thanks for your commments Ulrich. I just posted a message asking if anyone
had attempted this approach! Sounds like you have, and it works :-)  Thanks
for information, this sounds pretty close to what my preferred approach
would be.


You say you get 2000 docs/minute. I've done some benchmarking and managed to
get our data indexing at ~1000/minute on an Athlon 1800+ (and most of that
speed was acheived by bumping the IndexWriter.mergeFactor up to 100 or so).
Our data is coming from a database table, each record contains about 40
fields, and I'm indexing 8 of those fields (an ID, 4 number fields, 3 text
fields including one that has ~2k text). Does this sound reasonable to you,
or do you have any tips that might improve that performance?


"Ulrich Mayring" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
> Chris Miller wrote:
> >
> > The main thing I'm interested in is how you handle updates to Lucene's
> > index. I'd imagine you have a fairly high turnover of CVs and jobs, so
index
> > updates must place a reasonable load on the CPU/disk. Do you keep CVs
and
> > jobs in the same index or two different ones? And what is the process
you
> > use to update the index(es) - do you batch-process updates or do you
handle
> > them in real-time as changes are made?
>
> The way we do it: we re-index everything periodically in a temporary
> directory and then rename the temporary directory. That way the index
> remains accessible at all times and its currency is simply determined by
> the interval I run the re-indexing in.
>
> >  We need to be able to handle indexing about 60,000 documents/day,
> > while allowing (many) searches to continue operating alongside.
>
> On an entry-level Sun I can index about 23 documents per second and
> these are real-life HTML pages. Thus in less than one hour you would be
> finished with a complete index run and save yourself all kinds of
> trouble with crashes during indexing etc.
>
> On my 2 GHz Linux workstation it's even faster: more than 2000 documents
> per minute, so you'd be done in half an hour.
>
> BTW, we're not using the supplied JavaCC-based HTML parser, instead we
> got htmlparser.sourceforge.net, which is a joy to use and pretty fast.
>
> Ulrich




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: commercial websites powered by Lucene?

Reply via email to