Nadav Har'El wrote:

Recently an index I've been building passed the 2 GB mark, and after I
optimize()ed it into one segment over 2 GB, it stopped working.

Nadav, which platform did you hit this on? I think I've created > 2 GB index on 32 bit WinXP just fine. How many platforms are really affected by this?

Apparently, this is a known problem (on 32 bit JVMs), and mentioned in the FAQ, http://wiki.apache.org/lucene-java/LuceneFAQ question "Is there a way to limit
the size of an index".

My first problem is that it looks to me like this FAQ entry is passing
outdated advice. My second second problem is that we document a bug, instead
of fixing it.

The first thing the FAQ does is to recommend IndexWriter.setMaxMergeDocs(). This solution has two serious problems: First, normally one doesn't know how many documents one can index before reaching 2 GB, and second, a call to optimize() appears to ignore this setting and merge everything again - no good!

And a 3rd problem is: that limit applies to the input segments (to the merge), not the output segment. So the eg given of setting maxMergeDocs to 7M is very likely too high because if you merge 10 segments, each < 7M docs, you'll likely easily get a resulting segment > 2 GB.

The second solution the FAQ recommends (using MultiSearcher) is unwieldy and in my opinion, should be unnecessary (since we have the concept of segments,
why do we need separate indices in that case?).

The third option labeled the "optimal solution" is to write a new
FSDirectory implementation that represents files over 2 GB as several
files, broken on the 2 GB mark. But has anyone ever implemented this?

I agree these two workarounds sound quite challenging to do in practice...

Does anyone have any experience with the 2 GB problem? Is one of these
recommendations *really* the recommended solution? What about the new
LogByteSizeMergePolicy and its setMaxMergeMB argument - wouldn't it be better to use that? Does anybody know if optimize() also obeys this flag? If not,
shouldn't it?

optimize() doesn't obey it, and the same problem (input vs output) applies to maxMergeMB as well.

To make optimize() obey these limits, one would have to make their own MergePolicy.

In short, I'd like to understand the "best practices" of solving the 2 GB
problem, and improve the FAQ in this regard.

Moreover, I wonder, instead of documenting around the problem, should we perhaps make the default behavior more correct? In other words, imagine that we set LogByteSizeMergePolicy.DEFAULT_MAX_MERGE_MB to 1024 (or 1023, to be on the safe side?). Then, segments larger than 1 GB will never be merged with anything else. Some users (with multi-gigabyte indices on a 64 bit CPU) may not like this default, but they can change it - at least with
this default Lucene's behavior will be correct on all CPUs and JVMs.

I think we should understand how widespread this really is in our userbase. If it's a minority being affected by it, I think the current defaults are correct (and, it's this minority that should change Lucene to not produce too large a segment).

I have one last question that I wonder if anyone can answer before I start digging into the code. We use merges not just for merging segments, but also as an oportunity to clean up segments from deleted documents. If some segment is bigger than the maximum and is never merged again, does this also mean deleted documents will never ever get cleaned up from it? This can be a serious problem on huge dynamic indices (e.g., imagine a crawl of the Web
or some large intranet).

Right, the deletes will not be cleaned up. But you can use expungeDeletes()? Or, make a MergePolicy that favors merges that would clean up deletes.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to