Hi, Recently an index I've been building passed the 2 GB mark, and after I optimize()ed it into one segment over 2 GB, it stopped working.
Apparently, this is a known problem (on 32 bit JVMs), and mentioned in the FAQ, http://wiki.apache.org/lucene-java/LuceneFAQ question "Is there a way to limit the size of an index". My first problem is that it looks to me like this FAQ entry is passing outdated advice. My second second problem is that we document a bug, instead of fixing it. The first thing the FAQ does is to recommend IndexWriter.setMaxMergeDocs(). This solution has two serious problems: First, normally one doesn't know how many documents one can index before reaching 2 GB, and second, a call to optimize() appears to ignore this setting and merge everything again - no good! The second solution the FAQ recommends (using MultiSearcher) is unwieldy and in my opinion, should be unnecessary (since we have the concept of segments, why do we need separate indices in that case?). The third option labeled the "optimal solution" is to write a new FSDirectory implementation that represents files over 2 GB as several files, broken on the 2 GB mark. But has anyone ever implemented this? Does anyone have any experience with the 2 GB problem? Is one of these recommendations *really* the recommended solution? What about the new LogByteSizeMergePolicy and its setMaxMergeMB argument - wouldn't it be better to use that? Does anybody know if optimize() also obeys this flag? If not, shouldn't it? In short, I'd like to understand the "best practices" of solving the 2 GB problem, and improve the FAQ in this regard. Moreover, I wonder, instead of documenting around the problem, should we perhaps make the default behavior more correct? In other words, imagine that we set LogByteSizeMergePolicy.DEFAULT_MAX_MERGE_MB to 1024 (or 1023, to be on the safe side?). Then, segments larger than 1 GB will never be merged with anything else. Some users (with multi-gigabyte indices on a 64 bit CPU) may not like this default, but they can change it - at least with this default Lucene's behavior will be correct on all CPUs and JVMs. I have one last question that I wonder if anyone can answer before I start digging into the code. We use merges not just for merging segments, but also as an oportunity to clean up segments from deleted documents. If some segment is bigger than the maximum and is never merged again, does this also mean deleted documents will never ever get cleaned up from it? This can be a serious problem on huge dynamic indices (e.g., imagine a crawl of the Web or some large intranet). Nowadays, 2 GB indices are less rare than they used to be, and 32 bit JVMs are still quite common, so I think this is a problem we should solve properly. Thanks, Nadav. -- Nadav Har'El | Wednesday, Jun 25 2008, 22 Sivan 5768 [EMAIL PROTECTED] |----------------------------------------- Phone +972-523-790466, ICQ 13349191 |Committee: A group of people that keeps http://nadav.harel.org.il |minutes and wastes hours. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]