Re: Lucene Performance Issues

Doug Cutting Tue, 28 Mar 2006 08:38:36 -0800

thomasg wrote:

Hi, we are currently intending to implement a document storage / search tool
using Jackrabbit and Lucene. We have been approached by a commercial search
and indexing organisation called ISYS who are suggesting the following
problems with using Lucene. We do have a requirement to store and search
large documents and the total document store will be large too. Any comments
on the following would be greatly appreciated.

I would tend to be sceptical of claims from someone who is trying tosell you something: they are inherently biased. The best way to answerperformance questions is to benchmark. That way you'll also getexperience with each alternative.

Ease of development, availability of support, ease of debugging, cost,and other aspects besides raw performance may also prove more importantlong-term.

1) By default, Lucene only indexes the first 10,000 words from each
document. When increasing this default out-of-memory errors can occur. This
implies that documents, or large sections thereof, are loaded into memory.
ISYS has a very small memory footprint which is not affected by document
size nor number of documents.

Lucene does store a few bytes per word indexed while indexing. But I'venever seen folks complain about this as a significant consumer ofmemory, even when they crank up the limit. How large are your largedocuments? (Is it really useful to get hits on such large documentsanyway, or might it be more useful to get hits on sections of these?)

2) Lucene appears to be slow at indexing, at least by ISYS' standards.
Published performance benchmarks seem to vary between almost acceptable,
down to very poor. ISYS' file readers are already optimized for the fastest
text extraction possible.

How fast do you need to index? How fast does your collection change?In most systems I've worked on, Lucene's indexing speed has not been abottleneck. For example, when crawling with Nutch, downloading pagesand html parsing are typically much slower than indexing. Lucene doesnot include text extractors, so it's hard to know what ISIS is comparingto there. Also, when benchmarking indexing, one should look at the ratewhen the index is very large, not just when the index is small.

3) The Lucene documentation suggests it can be slow at searching and can get
slower and slower the larger your indexes get. The tipping point is where
the index size exceeds the amount of free memory in your machine. This also
implies that whole indexes, or large portions of them, are loaded into
memory.

Lucene does not normally store entire indexes in memory. Indexes muchlarger than memory can be searched quickly. For very high traffic (tensto hundreds of searches per second) indexes smaller than memory arerequired, since otherwise the disk becomes a bottleneck. My rule ofthumb is that a 10M document index of 10k documents (100G of text)easily gives sub-second average response time for typical queries.

The bigger the index, the more powerful the machine required. ISYS'
search speed is always proportional to the size of the result set.

That sound suspicious to me, unless by "size of result set" they meanthe total number of documents that include any query term, in which casethat is also true of Lucene. But if you're using an "AND" query, thesize of the result set is usually much smaller than that total.

Do they claim they can search billions of documents with a boolean querythat only matches a few documents in time proportional to the size ofthose few matching documents? And all that on a 486 with 1MB of RAM?That sounds like magic!

Index
size does not materially affect search speed and the index is never loaded
into memory.

Index size does not affect Lucene's search speed per-se: what matters isthe frequency of the search terms. And terms tend to have largerfrequencies in larger indexes.

It also appears that Lucene requires hands-on tuning to keep
its search speed acceptable.

I don't recommend tuning Lucene, but rather just using the out-of-thebox defaults. Folks tend to cause problems for themselves trying tomake things 5% faster by setting parameters to extreme values withoutunderstanding the consequences of those parameters. This results inlots of tuning-related messages on the mailing lists, but I am notconvinced many of these are required.

ISYS' indexes are self-managing and do not
require any maintenance to keep them searchable at full speed.


Lucene could make the same claim.

I suspect you can achieve adequate performance with either product. Themore salient difference will be that with Lucene you'd have the fullsource code and mailing lists like this to help you. With ISYS you'dhave a black box, a detailed manual, and a phone number. So it dependson your style of development: do you want to be able to peek inside thesearch-engine, or do you want to be able to pick up the phone?


Cheers,

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene Performance Issues

Reply via email to