Re: The 2GB segment size limit

Michael McCandless Wed, 25 Jun 2008 03:09:36 -0700


Nadav Har'El wrote:

Recently an index I've been building passed the 2 GB mark, and after I
optimize()ed it into one segment over 2 GB, it stopped working.

Nadav, which platform did you hit this on? I think I've created > 2GB index on 32 bit WinXP just fine. How many platforms are reallyaffected by this?

Apparently, this is a known problem (on 32 bit JVMs), and mentionedin the FAQ,http://wiki.apache.org/lucene-java/LuceneFAQ question "Is there away to limit
the size of an index".

My first problem is that it looks to me like this FAQ entry is passing
outdated advice. My second second problem is that we document a bug,instead
of fixing it.
The first thing the FAQ does is to recommendIndexWriter.setMaxMergeDocs().This solution has two serious problems: First, normally one doesn'tknow howmany documents one can index before reaching 2 GB, and second, acall tooptimize() appears to ignore this setting and merge everything again- no good!

And a 3rd problem is: that limit applies to the input segments (to themerge), not the output segment. So the eg given of settingmaxMergeDocs to 7M is very likely too high because if you merge 10segments, each < 7M docs, you'll likely easily get a resulting segment> 2 GB.

The second solution the FAQ recommends (using MultiSearcher) isunwieldy andin my opinion, should be unnecessary (since we have the concept ofsegments,
why do we need separate indices in that case?).

The third option labeled the "optimal solution" is to write a new
FSDirectory implementation that represents files over 2 GB as several
files, broken on the 2 GB mark. But has anyone ever implemented this?

I agree these two workarounds sound quite challenging to do inpractice...

Does anyone have any experience with the 2 GB problem? Is one of these
recommendations *really* the recommended solution? What about the new
LogByteSizeMergePolicy and its setMaxMergeMB argument - wouldn't itbe betterto use that? Does anybody know if optimize() also obeys this flag?If not,
shouldn't it?

optimize() doesn't obey it, and the same problem (input vs output)applies to maxMergeMB as well.

To make optimize() obey these limits, one would have to make their ownMergePolicy.

In short, I'd like to understand the "best practices" of solving the2 GB
problem, and improve the FAQ in this regard.
Moreover, I wonder, instead of documenting around the problem,should weperhaps make the default behavior more correct? In other words,imaginethat we set LogByteSizeMergePolicy.DEFAULT_MAX_MERGE_MB to 1024 (or1023,to be on the safe side?). Then, segments larger than 1 GB will neverbemerged with anything else. Some users (with multi-gigabyte indiceson a 64bit CPU) may not like this default, but they can change it - atleast with
this default Lucene's behavior will be correct on all CPUs and JVMs.

I think we should understand how widespread this really is in ouruserbase. If it's a minority being affected by it, I think thecurrent defaults are correct (and, it's this minority that shouldchange Lucene to not produce too large a segment).

I have one last question that I wonder if anyone can answer before Istartdigging into the code. We use merges not just for merging segments,but alsoas an oportunity to clean up segments from deleted documents. Ifsome segmentis bigger than the maximum and is never merged again, does this alsomeandeleted documents will never ever get cleaned up from it? This canbe aserious problem on huge dynamic indices (e.g., imagine a crawl ofthe Web
or some large intranet).

Right, the deletes will not be cleaned up. But you can useexpungeDeletes()? Or, make a MergePolicy that favors merges thatwould clean up deletes.


Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: The 2GB segment size limit

Reply via email to