I'm glad we finally got to the bottom of this :) This fix will be in 2.9.1.
This is a nice fast indexing result, too... Mike On Thu, Oct 29, 2009 at 3:55 PM, Peter Keegan <[email protected]> wrote: > Btw, this 2.9 indexer is fast! I indexed 4Gb (1.07 million docs) with > optimization in just under 30 min. > I used setRAMBufferSizeMB=1.9G > > Peter > > On Thu, Oct 29, 2009 at 3:46 PM, Peter Keegan <[email protected]>wrote: > >> A handful of the source documents did contain the U+FFFF character. The >> patch from *LUCENE-2016<https://issues.apache.org/jira/browse/LUCENE-2016> >> *fixed the problem. >> Thanks Mike! >> >> Peter >> >> >> On Wed, Oct 28, 2009 at 1:29 PM, Michael McCandless < >> [email protected]> wrote: >> >>> Hmm, only a few affected terms, and all this particular >>> "literals:cfid196$" term, with optional suffixes. Really strange. >>> >>> One things that's odd is the exact term "literals:cfid196$" is printed >>> twice, which should never happen (every unique term should be stored >>> only once, in the terms dict). >>> >>> And, otherwise, CheckIndex got through the index just fine. >>> >>> Try searching a TermQuery with these affected terms and see if it >>> succeeds? If so, maybe trying making an index with one or two of >>> them, alone, and see if that index shows the problem? >>> >>> OK I'm attaching more mods. Can you re-run your CheckIndex? It will >>> produce an enormous amount of output, but if you can excise the few >>> lines around when that warning comes out & post back that'd be great. >>> >>> Mike >>> >>> On Wed, Oct 28, 2009 at 12:23 PM, Peter Keegan <[email protected]> >>> wrote: >>> > Just to be safe, I ran with the official jar file from one of the >>> mirrors >>> > and reproduced the problem. >>> > The debug session is not showing any characters = '\uffff' (checking >>> this in >>> > Tokenizer). >>> > The output from the modified CheckIndex follows. There are only a few >>> terms >>> > with the inconsistency. They are all legitimate terms from the app's >>> > context. With this info, I might be able to isolate the source >>> documents. >>> > What should I be looking for when they are indexed? >>> > >>> > CheckInput output: >>> > >>> > Opening index @ >>> D:\mnsavs\lresumes1\lresumes1.luc\lresumes1.search.main.4 >>> > >>> > Segments file=segments_2 numSegments=3 version=FORMAT_DIAGNOSTICS >>> [Lucene >>> > 2.9] >>> > 1 of 3: name=_0 docCount=413585 >>> > compound=false >>> > hasProx=true >>> > numFiles=8 >>> > size (MB)=1,148.817 >>> > diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0 >>> > 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64, >>> > java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} >>> > docStoreOffset=0 >>> > docStoreSegment=_0 >>> > docStoreIsCompoundFile=false >>> > no deletions >>> > test: open reader.........OK >>> > test: fields..............OK [33 fields] >>> > test: field norms.........OK [33 fields] >>> > test: terms, freq, prox...OK [7704753 terms; 180326717 terms/docs >>> pairs; >>> > 340244234 tokens] >>> > test: stored fields.......OK [1240755 total field count; avg 3 fields >>> > per doc] >>> > test: term vectors........OK [0 total vector count; avg 0 term/freq >>> > vector fields per doc] >>> > >>> > 2 of 3: name=_1 docCount=359068 >>> > compound=false >>> > hasProx=true >>> > numFiles=8 >>> > size (MB)=1,125.161 >>> > diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0 >>> > 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64, >>> > java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} >>> > docStoreOffset=413585 >>> > docStoreSegment=_0 >>> > docStoreIsCompoundFile=false >>> > no deletions >>> > test: open reader.........OK >>> > test: fields..............OK [33 fields] >>> > test: field norms.........OK [33 fields] >>> > test: terms, freq, prox...WARNING: term literals:cfid196$ docFreq=43 >>> != >>> > num docs seen 4 + num docs deleted 0 >>> > WARNING: term literals:cfid196$ docFreq=1 != num docs seen 4 + num docs >>> > deleted 0 >>> > WARNING: term literals:cfid196$ docFreq=1 != num docs seen 4 + num docs >>> > deleted 0 >>> > WARNING: term literals:cfid196$commandant docFreq=1 != num docs seen 9 >>> + >>> > num docs deleted 0 >>> > WARNING: term literals:cfid196$on docFreq=3178 != num docs seen 1 + num >>> > docs deleted 0 >>> > OK [7137621 terms; 179101847 terms/docs pairs; 346076058 tokens] >>> > test: stored fields.......OK [1077204 total field count; avg 3 fields >>> > per doc] >>> > test: term vectors........OK [0 total vector count; avg 0 term/freq >>> > vector fields per doc] >>> > >>> > 3 of 3: name=_2 docCount=304849 >>> > compound=false >>> > hasProx=true >>> > numFiles=8 >>> > size (MB)=962.004 >>> > diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0 >>> > 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64, >>> > java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} >>> > docStoreOffset=772653 >>> > docStoreSegment=_0 >>> > docStoreIsCompoundFile=false >>> > no deletions >>> > test: open reader.........OK >>> > test: fields..............OK [33 fields] >>> > test: field norms.........OK [33 fields] >>> > test: terms, freq, prox...WARNING: term contents:? docFreq=1 != num >>> > docs seen 246 + num docs deleted 0 >>> > WARNING: term literals:cfid196$ docFreq=45 != num docs seen 4 + num >>> docs >>> > deleted 0 >>> > WARNING: term literals:cfid196$ docFreq=1 != num docs seen 4 + num docs >>> > deleted 0 >>> > WARNING: term literals:cfid196$cashier docFreq=1 != num docs seen 37 + >>> num >>> > docs deleted 0 >>> > WARNING: term literals:cfid196$interrogation docFreq=181 != num docs >>> seen 1 >>> > + num docs deleted 0 >>> > WARNING: term literals:cfid196$leader docFreq=1 != num docs seen 353 + >>> num >>> > docs deleted 0 >>> > WARNING: term literals:cfid196$microsoft docFreq=3114 != num docs seen >>> 1 + >>> > num docs deleted 0 >>> > WARNING: term literals:cfid196$nt docFreq=200 != num docs seen 1 + num >>> docs >>> > deleted 0 >>> > OK [6497769 terms; 145296880 terms/docs pairs; 293458734 tokens] >>> > test: stored fields.......OK [914547 total field count; avg 3 fields >>> per >>> > doc] >>> > test: term vectors........OK [0 total vector count; avg 0 term/freq >>> > vector fields per doc] >>> > >>> > No problems were detected with this index. >>> > >>> > Peter >>> > >>> > >>> > On Wed, Oct 28, 2009 at 11:29 AM, Michael McCandless < >>> > [email protected]> wrote: >>> > >>> >> On Wed, Oct 28, 2009 at 10:58 AM, Peter Keegan <[email protected] >>> > >>> >> wrote: >>> >> > The only change I made to the source code was the patch for >>> >> PayloadNearQuery >>> >> > (LUCENE-1986). >>> >> >>> >> That patch certainly shouldn't lead to this. >>> >> >>> >> > It's possible that our content contains U+FFFF. I will run in >>> debugger >>> >> and >>> >> > see. >>> >> >>> >> OK may as well check just so we cover all possibilities. >>> >> >>> >> > The data is 'sensitive', so I may not be able to provide a bad >>> segment, >>> >> > unfortunately. >>> >> >>> >> OK, maybe we can modify your CheckIndex instead. Let's start with >>> >> this, which prints a warning whenever the docFreq differs but >>> >> otherwise continues (vs throwing RuntimeException). I'm curious how >>> >> many terms show this, and whether the TermEnum keeps working after >>> >> this term that has different docFreq: >>> >> >>> >> Index: src/java/org/apache/lucene/index/CheckIndex.java >>> >> =================================================================== >>> >> --- src/java/org/apache/lucene/index/CheckIndex.java (revision >>> 829889) >>> >> +++ src/java/org/apache/lucene/index/CheckIndex.java (working copy) >>> >> @@ -672,8 +672,8 @@ >>> >> } >>> >> >>> >> if (freq0 + delCount != docFreq) { >>> >> - throw new RuntimeException("term " + term + " docFreq=" + >>> >> - docFreq + " != num docs seen " + >>> >> freq0 + " + num docs deleted " + delCount); >>> >> + System.out.println("WARNING: term " + term + " docFreq=" + >>> >> + docFreq + " != num docs seen " + freq0 + >>> >> " + num docs deleted " + delCount); >>> >> } >>> >> } >>> >> >>> >> Mike >>> >> >>> >> --------------------------------------------------------------------- >>> >> To unsubscribe, e-mail: [email protected] >>> >> For additional commands, e-mail: [email protected] >>> >> >>> >> >>> > >>> >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
