Hello, I'm trying to merge 12 indexed into one big index using the Lucene IndexMergeTool (command line used appended below). The merge seemed to finish successfully, but when I ran CheckIndex on the merged index, I got an array out of bounds error "java.lang.ArrayIndexOutOfBoundsException: 133157597" (See below for full error message). I'm wondering if we are running into some Lucene limit or if there is some kind of wierd bug in the LuceneIndexMergeTool. The documents indexed containd dirty OCR from up to 400 languages so there is a huge number of unique terms. We are doing bigrams plus unigrams, so that increases it even more. Each index prior to the merge had less than 2 billion unique terms, so even if the terms were not duplicated across indexes 2 * 12 =24 billion possible unique terms. I believe that LUCENE-2257 raised the limit that Lucene can handle to 274 billion.
Appended below is the message from CheckIndex, the command line used to merge the indexes, and the term count line from CheckIndex run on each of the 12 indexes that were later merged. Tom CheckIndex error: Opening index @ bigramsRetest Segments file=segments_1 numSegments=1 version=3.6 format=FORMAT_3_1 [Lucene 3.1+] 1 of 1: name=_c docCount=865870 compound=false hasProx=true numFiles=8 size (MB)=309,357.885 diagnostics = {mergeFactor=12, os.version=2.6.18-308.20.1.el5, os=Linux, lucene.version=3.6-SNAPSHOT exported - tom - 2012-11-06 14:16:41, source=merge, os.arch=amd64, mergeMaxNumSegments=1, java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} no deletions test: open reader.........OK test: fields..............OK [87 fields] test: field norms.........OK [43 fields] test: terms, freq, prox...ERROR [133157597] java.lang.ArrayIndexOutOfBoundsException: 133157597 at org.apache.lucene.index.TermInfosReaderIndex.compareField(TermInfosReaderIndex.java:249) at org.apache.lucene.index.TermInfosReaderIndex.compareTo(TermInfosReaderIndex.java:225) at org.apache.lucene.index.TermInfosReaderIndex.getIndexOffset(TermInfosReaderIndex.java:156) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:232) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:172) at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:66) at org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:715) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:578) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1064) test: stored fields.......OK [32361128 total field count; avg 37.374 fields per doc] test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc] FAILED WARNING: fixIndex() would remove reference to this segment; full exception: java.lang.RuntimeException: Term Index test failed at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:591) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1064) WARNING: 1 broken segments (containing 865870 documents) detected WARNING: would write new segments file, and 865870 documents would be lost, if -fix were specified ---------------------------------------- I merged the 12 indexes using this command which completed without reporting any errors (names of files/directories shortened) java -Xms32g -Xmx32g -cp /xxx/3.6/lucene-core-3.6-SNAPSHOT.jar:/xxx/3.6/lucene-misc-3.6-SNAPSHOT.jar org.apache.lucene.misc.IndexMergeTool /xxx/bigramsAll index1 index2 index3 index4 index5 index6 index7 index8 index9 index10 index11 index12 ------------------------------------------------------------ CheckIndex lines from 12 indexes prior to merging (commas added): checkindex1: test: terms, freq, prox...OK [2,098,320,125 terms; 7,678,444,394 terms/docs pairs; 24,091,209,315 tokens] checkindex10: test: terms, freq, prox...OK [1,753,749,778 terms; 6,245,487,551 terms/docs pairs; 19,608,458,684 tokens] checkindex11: test: terms, freq, prox...OK [1,845,617,621 terms; 6,669,643,340 terms/docs pairs; 20,809,037,859 tokens] checkindex12: test: terms, freq, prox...OK [1,836,242,012 terms; 6,576,312,517 terms/docs pairs; 20,696,851,354 tokens] checkindex2: test: terms, freq, prox...OK [1,826,454,981 terms; 6,562,443,988 terms/docs pairs; 20,573,418,135 tokens] checkindex3: test: terms, freq, prox...OK [1,559,632,315 terms; 5,331,674,748 terms/docs pairs; 16,676,182,757 tokens] checkindex4: test: terms, freq, prox...OK [1,733,497,461 terms; 6,148,179,886 terms/docs pairs; 19,185,049,968 tokens] checkindex5: test: terms, freq, prox...OK [1,743,907,495 terms; 6,145,563,987 terms/docs pairs; 19,112,393,059 tokens] checkindex6: test: terms, freq, prox...OK [1,788,413,706 terms; 6,426,214,975 terms/docs pairs; 20,085,657,055 tokens] checkindex7: test: terms, freq, prox...OK [1,827,750,657 terms; 6,528,132,060 terms/docs pairs; 20,458,147,346 tokens] checkindex8: test: terms, freq, prox...OK [1,827,041,001 terms; 6,488,926,124 terms/docs pairs; 20,342,593,173 tokens] checkindex9: test: terms, freq, prox...OK [1,796,261,968 terms; 6,379,849,448 terms/docs pairs; 19,914,688,090 tokens]