Any chance I could get you to try that again with a buffer of like 800MB to a gig and do a comparison?
I've been investigating the returns you get with a larger buffer size. It appears to be pretty diminishing returns over 100MB or so - at higher than that, I've gotten both slower speeds for some sizes, and larger gains for others. But only better by 5-10 docs a second up to a gig. But I can't reliably test at over a gig - I have only 4 GB of RAM, and even with that, at over a gig it starts to page and the performance gets hit. I'd love to see what kind of benefit you see going from around a gig to just under 2. Peter Keegan wrote: > Btw, this 2.9 indexer is fast! I indexed 4Gb (1.07 million docs) with > optimization in just under 30 min. > I used setRAMBufferSizeMB=1.9G > > Peter > > On Thu, Oct 29, 2009 at 3:46 PM, Peter Keegan <peterlkee...@gmail.com>wrote: > > >> A handful of the source documents did contain the U+FFFF character. The >> patch from *LUCENE-2016<https://issues.apache.org/jira/browse/LUCENE-2016> >> *fixed the problem. >> Thanks Mike! >> >> Peter >> >> >> On Wed, Oct 28, 2009 at 1:29 PM, Michael McCandless < >> luc...@mikemccandless.com> wrote: >> >> >>> Hmm, only a few affected terms, and all this particular >>> "literals:cfid196$" term, with optional suffixes. Really strange. >>> >>> One things that's odd is the exact term "literals:cfid196$" is printed >>> twice, which should never happen (every unique term should be stored >>> only once, in the terms dict). >>> >>> And, otherwise, CheckIndex got through the index just fine. >>> >>> Try searching a TermQuery with these affected terms and see if it >>> succeeds? If so, maybe trying making an index with one or two of >>> them, alone, and see if that index shows the problem? >>> >>> OK I'm attaching more mods. Can you re-run your CheckIndex? It will >>> produce an enormous amount of output, but if you can excise the few >>> lines around when that warning comes out & post back that'd be great. >>> >>> Mike >>> >>> On Wed, Oct 28, 2009 at 12:23 PM, Peter Keegan <peterlkee...@gmail.com> >>> wrote: >>> >>>> Just to be safe, I ran with the official jar file from one of the >>>> >>> mirrors >>> >>>> and reproduced the problem. >>>> The debug session is not showing any characters = '\uffff' (checking >>>> >>> this in >>> >>>> Tokenizer). >>>> The output from the modified CheckIndex follows. There are only a few >>>> >>> terms >>> >>>> with the inconsistency. They are all legitimate terms from the app's >>>> context. With this info, I might be able to isolate the source >>>> >>> documents. >>> >>>> What should I be looking for when they are indexed? >>>> >>>> CheckInput output: >>>> >>>> Opening index @ >>>> >>> D:\mnsavs\lresumes1\lresumes1.luc\lresumes1.search.main.4 >>> >>>> Segments file=segments_2 numSegments=3 version=FORMAT_DIAGNOSTICS >>>> >>> [Lucene >>> >>>> 2.9] >>>> 1 of 3: name=_0 docCount=413585 >>>> compound=false >>>> hasProx=true >>>> numFiles=8 >>>> size (MB)=1,148.817 >>>> diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0 >>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64, >>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} >>>> docStoreOffset=0 >>>> docStoreSegment=_0 >>>> docStoreIsCompoundFile=false >>>> no deletions >>>> test: open reader.........OK >>>> test: fields..............OK [33 fields] >>>> test: field norms.........OK [33 fields] >>>> test: terms, freq, prox...OK [7704753 terms; 180326717 terms/docs >>>> >>> pairs; >>> >>>> 340244234 tokens] >>>> test: stored fields.......OK [1240755 total field count; avg 3 fields >>>> per doc] >>>> test: term vectors........OK [0 total vector count; avg 0 term/freq >>>> vector fields per doc] >>>> >>>> 2 of 3: name=_1 docCount=359068 >>>> compound=false >>>> hasProx=true >>>> numFiles=8 >>>> size (MB)=1,125.161 >>>> diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0 >>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64, >>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} >>>> docStoreOffset=413585 >>>> docStoreSegment=_0 >>>> docStoreIsCompoundFile=false >>>> no deletions >>>> test: open reader.........OK >>>> test: fields..............OK [33 fields] >>>> test: field norms.........OK [33 fields] >>>> test: terms, freq, prox...WARNING: term literals:cfid196$ docFreq=43 >>>> >>> != >>> >>>> num docs seen 4 + num docs deleted 0 >>>> WARNING: term literals:cfid196$ docFreq=1 != num docs seen 4 + num docs >>>> deleted 0 >>>> WARNING: term literals:cfid196$ docFreq=1 != num docs seen 4 + num docs >>>> deleted 0 >>>> WARNING: term literals:cfid196$commandant docFreq=1 != num docs seen 9 >>>> >>> + >>> >>>> num docs deleted 0 >>>> WARNING: term literals:cfid196$on docFreq=3178 != num docs seen 1 + num >>>> docs deleted 0 >>>> OK [7137621 terms; 179101847 terms/docs pairs; 346076058 tokens] >>>> test: stored fields.......OK [1077204 total field count; avg 3 fields >>>> per doc] >>>> test: term vectors........OK [0 total vector count; avg 0 term/freq >>>> vector fields per doc] >>>> >>>> 3 of 3: name=_2 docCount=304849 >>>> compound=false >>>> hasProx=true >>>> numFiles=8 >>>> size (MB)=962.004 >>>> diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0 >>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64, >>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} >>>> docStoreOffset=772653 >>>> docStoreSegment=_0 >>>> docStoreIsCompoundFile=false >>>> no deletions >>>> test: open reader.........OK >>>> test: fields..............OK [33 fields] >>>> test: field norms.........OK [33 fields] >>>> test: terms, freq, prox...WARNING: term contents:? docFreq=1 != num >>>> docs seen 246 + num docs deleted 0 >>>> WARNING: term literals:cfid196$ docFreq=45 != num docs seen 4 + num >>>> >>> docs >>> >>>> deleted 0 >>>> WARNING: term literals:cfid196$ docFreq=1 != num docs seen 4 + num docs >>>> deleted 0 >>>> WARNING: term literals:cfid196$cashier docFreq=1 != num docs seen 37 + >>>> >>> num >>> >>>> docs deleted 0 >>>> WARNING: term literals:cfid196$interrogation docFreq=181 != num docs >>>> >>> seen 1 >>> >>>> + num docs deleted 0 >>>> WARNING: term literals:cfid196$leader docFreq=1 != num docs seen 353 + >>>> >>> num >>> >>>> docs deleted 0 >>>> WARNING: term literals:cfid196$microsoft docFreq=3114 != num docs seen >>>> >>> 1 + >>> >>>> num docs deleted 0 >>>> WARNING: term literals:cfid196$nt docFreq=200 != num docs seen 1 + num >>>> >>> docs >>> >>>> deleted 0 >>>> OK [6497769 terms; 145296880 terms/docs pairs; 293458734 tokens] >>>> test: stored fields.......OK [914547 total field count; avg 3 fields >>>> >>> per >>> >>>> doc] >>>> test: term vectors........OK [0 total vector count; avg 0 term/freq >>>> vector fields per doc] >>>> >>>> No problems were detected with this index. >>>> >>>> Peter >>>> >>>> >>>> On Wed, Oct 28, 2009 at 11:29 AM, Michael McCandless < >>>> luc...@mikemccandless.com> wrote: >>>> >>>> >>>>> On Wed, Oct 28, 2009 at 10:58 AM, Peter Keegan <peterlkee...@gmail.com >>>>> >>>>> wrote: >>>>> >>>>>> The only change I made to the source code was the patch for >>>>>> >>>>> PayloadNearQuery >>>>> >>>>>> (LUCENE-1986). >>>>>> >>>>> That patch certainly shouldn't lead to this. >>>>> >>>>> >>>>>> It's possible that our content contains U+FFFF. I will run in >>>>>> >>> debugger >>> >>>>> and >>>>> >>>>>> see. >>>>>> >>>>> OK may as well check just so we cover all possibilities. >>>>> >>>>> >>>>>> The data is 'sensitive', so I may not be able to provide a bad >>>>>> >>> segment, >>> >>>>>> unfortunately. >>>>>> >>>>> OK, maybe we can modify your CheckIndex instead. Let's start with >>>>> this, which prints a warning whenever the docFreq differs but >>>>> otherwise continues (vs throwing RuntimeException). I'm curious how >>>>> many terms show this, and whether the TermEnum keeps working after >>>>> this term that has different docFreq: >>>>> >>>>> Index: src/java/org/apache/lucene/index/CheckIndex.java >>>>> =================================================================== >>>>> --- src/java/org/apache/lucene/index/CheckIndex.java (revision >>>>> >>> 829889) >>> >>>>> +++ src/java/org/apache/lucene/index/CheckIndex.java (working copy) >>>>> @@ -672,8 +672,8 @@ >>>>> } >>>>> >>>>> if (freq0 + delCount != docFreq) { >>>>> - throw new RuntimeException("term " + term + " docFreq=" + >>>>> - docFreq + " != num docs seen " + >>>>> freq0 + " + num docs deleted " + delCount); >>>>> + System.out.println("WARNING: term " + term + " docFreq=" + >>>>> + docFreq + " != num docs seen " + freq0 + >>>>> " + num docs deleted " + delCount); >>>>> } >>>>> } >>>>> >>>>> Mike >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >>>>> >>>>> >> > > -- - Mark http://www.lucidimagination.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org