Hello,

I'm trying to merge 12 indexed into one big index using the Lucene
IndexMergeTool (command line used appended below).   The merge seemed to
finish successfully, but when I ran CheckIndex on the merged index, I got
an array out of bounds error "java.lang.ArrayIndexOutOfBoundsException:
133157597" (See below for full error message).
I'm wondering if we are running into some Lucene limit or if there is some
kind of wierd bug in the LuceneIndexMergeTool.    The documents indexed
containd dirty OCR from up to 400 languages so there is a huge number of
unique terms.  We are doing bigrams plus unigrams, so that increases it
even more.   Each index prior to the merge had less than 2 billion unique
terms, so even if the terms were not duplicated across indexes 2 * 12 =24
billion possible unique terms.  I believe that LUCENE-2257 raised the
 limit that Lucene can handle to  274 billion.

Appended below is the message from CheckIndex, the command line used to
merge the indexes, and the term count line from CheckIndex run on each of
the 12 indexes that were later merged.

Tom
CheckIndex error:
Opening index @ bigramsRetest
Segments file=segments_1 numSegments=1 version=3.6 format=FORMAT_3_1
[Lucene 3.1+]
  1 of 1: name=_c docCount=865870
    compound=false
    hasProx=true
    numFiles=8
    size (MB)=309,357.885
    diagnostics = {mergeFactor=12, os.version=2.6.18-308.20.1.el5,
os=Linux, lucene.version=3.6-SNAPSHOT exported - tom - 2012-11-06 14:16:41,
source=merge, os.arch=amd64, mergeMaxNumSegments=1, java.version=1.6.0_16,
java.vendor=Sun Microsystems Inc.}
    no deletions
    test: open reader.........OK
    test: fields..............OK [87 fields]
    test: field norms.........OK [43 fields]
    test: terms, freq, prox...ERROR [133157597]
java.lang.ArrayIndexOutOfBoundsException: 133157597
        at
org.apache.lucene.index.TermInfosReaderIndex.compareField(TermInfosReaderIndex.java:249)
        at
org.apache.lucene.index.TermInfosReaderIndex.compareTo(TermInfosReaderIndex.java:225)
        at
org.apache.lucene.index.TermInfosReaderIndex.getIndexOffset(TermInfosReaderIndex.java:156)
        at
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:232)
        at
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:172)
        at
org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:66)
        at
org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:715)
        at
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:578)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1064)
    test: stored fields.......OK [32361128 total field count; avg 37.374
fields per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/freq
vector fields per doc]
FAILED
    WARNING: fixIndex() would remove reference to this segment; full
exception:
java.lang.RuntimeException: Term Index test failed
        at
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:591)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1064)

WARNING: 1 broken segments (containing 865870 documents) detected
WARNING: would write new segments file, and 865870 documents would be lost,
if -fix were specified
----------------------------------------
I merged the 12 indexes using this command which completed without
reporting any errors (names of files/directories shortened)
 java -Xms32g -Xmx32g  -cp
/xxx/3.6/lucene-core-3.6-SNAPSHOT.jar:/xxx/3.6/lucene-misc-3.6-SNAPSHOT.jar
org.apache.lucene.misc.IndexMergeTool /xxx/bigramsAll  index1 index2 index3
index4 index5 index6 index7 index8 index9 index10 index11 index12
------------------------------------------------------------
CheckIndex lines from 12 indexes prior to merging  (commas added):
checkindex1:    test: terms, freq, prox...OK [2,098,320,125 terms;
7,678,444,394 terms/docs pairs; 24,091,209,315 tokens]
checkindex10:    test: terms, freq, prox...OK [1,753,749,778 terms;
6,245,487,551 terms/docs pairs; 19,608,458,684 tokens]
checkindex11:    test: terms, freq, prox...OK [1,845,617,621 terms;
6,669,643,340 terms/docs pairs; 20,809,037,859 tokens]
checkindex12:    test: terms, freq, prox...OK [1,836,242,012 terms;
6,576,312,517 terms/docs pairs; 20,696,851,354 tokens]
checkindex2:    test: terms, freq, prox...OK [1,826,454,981 terms;
6,562,443,988 terms/docs pairs; 20,573,418,135 tokens]
checkindex3:    test: terms, freq, prox...OK [1,559,632,315 terms;
5,331,674,748 terms/docs pairs; 16,676,182,757 tokens]
checkindex4:    test: terms, freq, prox...OK [1,733,497,461 terms;
6,148,179,886 terms/docs pairs; 19,185,049,968 tokens]
checkindex5:    test: terms, freq, prox...OK [1,743,907,495 terms;
6,145,563,987 terms/docs pairs; 19,112,393,059 tokens]
checkindex6:    test: terms, freq, prox...OK [1,788,413,706 terms;
6,426,214,975 terms/docs pairs; 20,085,657,055 tokens]
checkindex7:    test: terms, freq, prox...OK [1,827,750,657 terms;
6,528,132,060 terms/docs pairs; 20,458,147,346 tokens]
checkindex8:    test: terms, freq, prox...OK [1,827,041,001 terms;
6,488,926,124 terms/docs pairs; 20,342,593,173 tokens]
checkindex9:    test: terms, freq, prox...OK [1,796,261,968 terms;
6,379,849,448 terms/docs pairs; 19,914,688,090 tokens]

Reply via email to