Sorry about the previous stack trace -- that was a wild goose-chase. 
Here is the real culprit:

 at
org.apache.lucene.index.SegmentTermEnum.growBuffer(SegmentTermEnum.java(Compiled
Code))
 at
org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java(Compiled
Code))
 at
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java(Compiled
Code))
 at
org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java(Compiled
Code))
 at
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java(Compiled
Code))
 at
org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java(Compiled
Code))
 at
org.apache.lucene.index.MultiReader.docFreq(MultiReader.java(Compiled Code))
 at org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:69)
 at org.apache.lucene.search.Similarity.idf(Similarity.java:255)
 at
org.apache.lucene.search.TermQuery$TermWeight.sumOfSquaredWeights(TermQuery.java:47)
 at
org.apache.lucene.search.BooleanQuery$BooleanWeight.sumOfSquaredWeights(BooleanQuery.java:110)
 at
org.apache.lucene.search.BooleanQuery$BooleanWeight.sumOfSquaredWeights(BooleanQuery.java:110)
 at org.apache.lucene.search.Query.weight(Query.java(Compiled Code))
 at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java(Compiled
Code))
 at org.apache.lucene.search.Hits.getMoreDocs(Hits.java(Compiled Code))
 at org.apache.lucene.search.Hits.<init>(Hits.java:43)
 at org.apache.lucene.search.Searcher.search(Searcher.java(Compiled Code))

the growBuffer contains:
  buffer = new char[length]
length is the result of two readVInt() on the "tii" file (start + length)

I made a local copy of the largest "tii" file and (using beanshell) read
the two "variable ints":
bsh % d = org.apache.lucene.store.FSDirectory.getDirectory("/tmp/", false);
bsh % i = d.openFile("copy.tii");
bsh % print (i.readVInt());
266338303
bsh % print (i.readVInt());
0

In other words (If I read the code correctly) _start_ is set to
266338303 while 'length' is set to 0.  growBuffer allocates a char
array, and (according to our GC logs) allocates 532676624 bytes (which
is 266338303*2 + 18)  I am guessing the 18 additional bytes is type
information, which is probably JVM-dependent.

So now the question becomes more direct:  Why on earth does our biggest
segment think that it starts at 266 million characters?

I just noticed that the hex value of the 266-million-tingy is ... ...
drumroll ... ... fdfffff

Now I'm thinking that either the read/write-operations dont work 100% or
that there's a corrupt index.  However we _have_ tried to rebuild the
index from scratch, so I don't think it's a corrupt index.


And finally, the start of the 'tii' file is as follows:

$ hexdump /tmp/test.tii | head
0000000 ffff feff 0000 0000 0000 f3cc 0000 8000
0000010 0000 1000 0000 0000 0000 0014 3107 3030
0000020 3130 3831 0104 02fd 897f 0407 3203 3034
0000030 0104 0380 0180 0788 0304 3633 0430 8001


Regards,
Fredrik


-------------------------------------------------------------------------
Start.no tilbyr nå raskere bredbånd til lavere pris.
Sjekk http://www.start.no/bredband/ for mer informasjon

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to