Btw, this 2.9 indexer is fast! I indexed 4Gb (1.07 million docs) with optimization in just under 30 min. I used setRAMBufferSizeMB=1.9G
Peter On Thu, Oct 29, 2009 at 3:46 PM, Peter Keegan <peterlkee...@gmail.com>wrote: > A handful of the source documents did contain the U+FFFF character. The > patch from *LUCENE-2016<https://issues.apache.org/jira/browse/LUCENE-2016> > *fixed the problem. > Thanks Mike! > > Peter > > > On Wed, Oct 28, 2009 at 1:29 PM, Michael McCandless < > luc...@mikemccandless.com> wrote: > >> Hmm, only a few affected terms, and all this particular >> "literals:cfid196$" term, with optional suffixes. Really strange. >> >> One things that's odd is the exact term "literals:cfid196$" is printed >> twice, which should never happen (every unique term should be stored >> only once, in the terms dict). >> >> And, otherwise, CheckIndex got through the index just fine. >> >> Try searching a TermQuery with these affected terms and see if it >> succeeds? If so, maybe trying making an index with one or two of >> them, alone, and see if that index shows the problem? >> >> OK I'm attaching more mods. Can you re-run your CheckIndex? It will >> produce an enormous amount of output, but if you can excise the few >> lines around when that warning comes out & post back that'd be great. >> >> Mike >> >> On Wed, Oct 28, 2009 at 12:23 PM, Peter Keegan <peterlkee...@gmail.com> >> wrote: >> > Just to be safe, I ran with the official jar file from one of the >> mirrors >> > and reproduced the problem. >> > The debug session is not showing any characters = '\uffff' (checking >> this in >> > Tokenizer). >> > The output from the modified CheckIndex follows. There are only a few >> terms >> > with the inconsistency. They are all legitimate terms from the app's >> > context. With this info, I might be able to isolate the source >> documents. >> > What should I be looking for when they are indexed? >> > >> > CheckInput output: >> > >> > Opening index @ >> D:\mnsavs\lresumes1\lresumes1.luc\lresumes1.search.main.4 >> > >> > Segments file=segments_2 numSegments=3 version=FORMAT_DIAGNOSTICS >> [Lucene >> > 2.9] >> > 1 of 3: name=_0 docCount=413585 >> > compound=false >> > hasProx=true >> > numFiles=8 >> > size (MB)=1,148.817 >> > diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0 >> > 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64, >> > java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} >> > docStoreOffset=0 >> > docStoreSegment=_0 >> > docStoreIsCompoundFile=false >> > no deletions >> > test: open reader.........OK >> > test: fields..............OK [33 fields] >> > test: field norms.........OK [33 fields] >> > test: terms, freq, prox...OK [7704753 terms; 180326717 terms/docs >> pairs; >> > 340244234 tokens] >> > test: stored fields.......OK [1240755 total field count; avg 3 fields >> > per doc] >> > test: term vectors........OK [0 total vector count; avg 0 term/freq >> > vector fields per doc] >> > >> > 2 of 3: name=_1 docCount=359068 >> > compound=false >> > hasProx=true >> > numFiles=8 >> > size (MB)=1,125.161 >> > diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0 >> > 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64, >> > java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} >> > docStoreOffset=413585 >> > docStoreSegment=_0 >> > docStoreIsCompoundFile=false >> > no deletions >> > test: open reader.........OK >> > test: fields..............OK [33 fields] >> > test: field norms.........OK [33 fields] >> > test: terms, freq, prox...WARNING: term literals:cfid196$ docFreq=43 >> != >> > num docs seen 4 + num docs deleted 0 >> > WARNING: term literals:cfid196$ docFreq=1 != num docs seen 4 + num docs >> > deleted 0 >> > WARNING: term literals:cfid196$ docFreq=1 != num docs seen 4 + num docs >> > deleted 0 >> > WARNING: term literals:cfid196$commandant docFreq=1 != num docs seen 9 >> + >> > num docs deleted 0 >> > WARNING: term literals:cfid196$on docFreq=3178 != num docs seen 1 + num >> > docs deleted 0 >> > OK [7137621 terms; 179101847 terms/docs pairs; 346076058 tokens] >> > test: stored fields.......OK [1077204 total field count; avg 3 fields >> > per doc] >> > test: term vectors........OK [0 total vector count; avg 0 term/freq >> > vector fields per doc] >> > >> > 3 of 3: name=_2 docCount=304849 >> > compound=false >> > hasProx=true >> > numFiles=8 >> > size (MB)=962.004 >> > diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0 >> > 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64, >> > java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} >> > docStoreOffset=772653 >> > docStoreSegment=_0 >> > docStoreIsCompoundFile=false >> > no deletions >> > test: open reader.........OK >> > test: fields..............OK [33 fields] >> > test: field norms.........OK [33 fields] >> > test: terms, freq, prox...WARNING: term contents:? docFreq=1 != num >> > docs seen 246 + num docs deleted 0 >> > WARNING: term literals:cfid196$ docFreq=45 != num docs seen 4 + num >> docs >> > deleted 0 >> > WARNING: term literals:cfid196$ docFreq=1 != num docs seen 4 + num docs >> > deleted 0 >> > WARNING: term literals:cfid196$cashier docFreq=1 != num docs seen 37 + >> num >> > docs deleted 0 >> > WARNING: term literals:cfid196$interrogation docFreq=181 != num docs >> seen 1 >> > + num docs deleted 0 >> > WARNING: term literals:cfid196$leader docFreq=1 != num docs seen 353 + >> num >> > docs deleted 0 >> > WARNING: term literals:cfid196$microsoft docFreq=3114 != num docs seen >> 1 + >> > num docs deleted 0 >> > WARNING: term literals:cfid196$nt docFreq=200 != num docs seen 1 + num >> docs >> > deleted 0 >> > OK [6497769 terms; 145296880 terms/docs pairs; 293458734 tokens] >> > test: stored fields.......OK [914547 total field count; avg 3 fields >> per >> > doc] >> > test: term vectors........OK [0 total vector count; avg 0 term/freq >> > vector fields per doc] >> > >> > No problems were detected with this index. >> > >> > Peter >> > >> > >> > On Wed, Oct 28, 2009 at 11:29 AM, Michael McCandless < >> > luc...@mikemccandless.com> wrote: >> > >> >> On Wed, Oct 28, 2009 at 10:58 AM, Peter Keegan <peterlkee...@gmail.com >> > >> >> wrote: >> >> > The only change I made to the source code was the patch for >> >> PayloadNearQuery >> >> > (LUCENE-1986). >> >> >> >> That patch certainly shouldn't lead to this. >> >> >> >> > It's possible that our content contains U+FFFF. I will run in >> debugger >> >> and >> >> > see. >> >> >> >> OK may as well check just so we cover all possibilities. >> >> >> >> > The data is 'sensitive', so I may not be able to provide a bad >> segment, >> >> > unfortunately. >> >> >> >> OK, maybe we can modify your CheckIndex instead. Let's start with >> >> this, which prints a warning whenever the docFreq differs but >> >> otherwise continues (vs throwing RuntimeException). I'm curious how >> >> many terms show this, and whether the TermEnum keeps working after >> >> this term that has different docFreq: >> >> >> >> Index: src/java/org/apache/lucene/index/CheckIndex.java >> >> =================================================================== >> >> --- src/java/org/apache/lucene/index/CheckIndex.java (revision >> 829889) >> >> +++ src/java/org/apache/lucene/index/CheckIndex.java (working copy) >> >> @@ -672,8 +672,8 @@ >> >> } >> >> >> >> if (freq0 + delCount != docFreq) { >> >> - throw new RuntimeException("term " + term + " docFreq=" + >> >> - docFreq + " != num docs seen " + >> >> freq0 + " + num docs deleted " + delCount); >> >> + System.out.println("WARNING: term " + term + " docFreq=" + >> >> + docFreq + " != num docs seen " + freq0 + >> >> " + num docs deleted " + delCount); >> >> } >> >> } >> >> >> >> Mike >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> > >> > >