The 2GB segment size limit

Nadav Har'El Wed, 25 Jun 2008 00:28:06 -0700

Hi,

Recently an index I've been building passed the 2 GB mark, and after I
optimize()ed it into one segment over 2 GB, it stopped working.


Apparently, this is a known problem (on 32 bit JVMs), and mentioned in the FAQ,
http://wiki.apache.org/lucene-java/LuceneFAQ question "Is there a way to limit
the size of an index".

My first problem is that it looks to me like this FAQ entry is passing
outdated advice. My second second problem is that we document a bug, instead 
of fixing it.

The first thing the FAQ does is to recommend IndexWriter.setMaxMergeDocs().
This solution has two serious problems: First, normally one doesn't know how
many documents one can index before reaching 2 GB, and second, a call to
optimize() appears to ignore this setting and merge everything again - no good!

The second solution the FAQ recommends (using MultiSearcher) is unwieldy and
in my opinion, should be unnecessary (since we have the concept of segments,
why do we need separate indices in that case?).

The third option labeled the "optimal solution" is to write a new
FSDirectory implementation that represents files over 2 GB as several
files, broken on the 2 GB mark. But has anyone ever implemented this?

Does anyone have any experience with the 2 GB problem? Is one of these
recommendations *really* the recommended solution? What about the new
LogByteSizeMergePolicy and its setMaxMergeMB argument - wouldn't it be better
to use that? Does anybody know if optimize() also obeys this flag? If not,
shouldn't it?

In short, I'd like to understand the "best practices" of solving the 2 GB
problem, and improve the FAQ in this regard.

Moreover, I wonder, instead of documenting around the problem, should we
perhaps make the default behavior more correct? In other words, imagine
that we set LogByteSizeMergePolicy.DEFAULT_MAX_MERGE_MB to 1024 (or 1023,
to be on the safe side?). Then, segments larger than 1 GB will never be
merged with anything else. Some users (with multi-gigabyte indices on a 64
bit CPU) may not like this default, but they can change it - at least with
this default Lucene's behavior will be correct on all CPUs and JVMs.

I have one last question that I wonder if anyone can answer before I start
digging into the code. We use merges not just for merging segments, but also
as an oportunity to clean up segments from deleted documents. If some segment
is bigger than the maximum and is never merged again, does this also mean
deleted documents will never ever get cleaned up from it? This can be a
serious problem on huge dynamic indices (e.g., imagine a crawl of the Web
or some large intranet).

Nowadays, 2 GB indices are less rare than they used to be, and 32 bit JVMs
are still quite common, so I think this is a problem we should solve properly.

Thanks,
Nadav.

-- 
Nadav Har'El                        |    Wednesday, Jun 25 2008, 22 Sivan 5768
[EMAIL PROTECTED]             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Committee: A group of people that keeps
http://nadav.harel.org.il           |minutes and wastes hours.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

The 2GB segment size limit

Reply via email to