Mark, With 1.9G, I had to increase the JVM heap significantly (to 8G) to avoid paging and GC hits. Here is a table comparing indexing times, optimizing times and peak memory usage as a function of the RAMBufferSize. This was run on a 64-bit server with 32GB RAM:
RamSize Index(min) Optimize(min) Max VM 1.9G 24 5 5G 800M 24 5 4G Not much difference. I'll make a couple more runs with lower values. Btw, the indexing times are really about 5 min. shorter because of some non-Lucene related delays after the last document. Peter On Thu, Oct 29, 2009 at 4:30 PM, Mark Miller <markrmil...@gmail.com> wrote: > Any chance I could get you to try that again with a buffer of like 800MB > to a gig and do a comparison? > > I've been investigating the returns you get with a larger buffer size. > It appears to be pretty diminishing returns over 100MB or so - at higher > than that, I've gotten both slower speeds for some sizes, and larger > gains for others. But only better by 5-10 docs a second up to a gig. But > I can't reliably test at over a gig - I have only 4 GB of RAM, and even > with that, at over a gig it starts to page and the performance gets hit. > I'd love to see what kind of benefit you see going from around a gig to > just under 2. > > Peter Keegan wrote: > > Btw, this 2.9 indexer is fast! I indexed 4Gb (1.07 million docs) with > > optimization in just under 30 min. > > I used setRAMBufferSizeMB=1.9G > > > > Peter > > > > On Thu, Oct 29, 2009 at 3:46 PM, Peter Keegan <peterlkee...@gmail.com > >wrote: > > > > > >> A handful of the source documents did contain the U+FFFF character. The > >> patch from *LUCENE-2016< > https://issues.apache.org/jira/browse/LUCENE-2016> > >> *fixed the problem. > >> Thanks Mike! > >> > >> Peter > >> > >> > >> On Wed, Oct 28, 2009 at 1:29 PM, Michael McCandless < > >> luc...@mikemccandless.com> wrote: > >> > >> > >>> Hmm, only a few affected terms, and all this particular > >>> "literals:cfid196$" term, with optional suffixes. Really strange. > >>> > >>> One things that's odd is the exact term "literals:cfid196$" is printed > >>> twice, which should never happen (every unique term should be stored > >>> only once, in the terms dict). > >>> > >>> And, otherwise, CheckIndex got through the index just fine. > >>> > >>> Try searching a TermQuery with these affected terms and see if it > >>> succeeds? If so, maybe trying making an index with one or two of > >>> them, alone, and see if that index shows the problem? > >>> > >>> OK I'm attaching more mods. Can you re-run your CheckIndex? It will > >>> produce an enormous amount of output, but if you can excise the few > >>> lines around when that warning comes out & post back that'd be great. > >>> > >>> Mike > >>> > >>> On Wed, Oct 28, 2009 at 12:23 PM, Peter Keegan <peterlkee...@gmail.com > > > >>> wrote: > >>> > >>>> Just to be safe, I ran with the official jar file from one of the > >>>> > >>> mirrors > >>> > >>>> and reproduced the problem. > >>>> The debug session is not showing any characters = '\uffff' (checking > >>>> > >>> this in > >>> > >>>> Tokenizer). > >>>> The output from the modified CheckIndex follows. There are only a few > >>>> > >>> terms > >>> > >>>> with the inconsistency. They are all legitimate terms from the app's > >>>> context. With this info, I might be able to isolate the source > >>>> > >>> documents. > >>> > >>>> What should I be looking for when they are indexed? > >>>> > >>>> CheckInput output: > >>>> > >>>> Opening index @ > >>>> > >>> D:\mnsavs\lresumes1\lresumes1.luc\lresumes1.search.main.4 > >>> > >>>> Segments file=segments_2 numSegments=3 version=FORMAT_DIAGNOSTICS > >>>> > >>> [Lucene > >>> > >>>> 2.9] > >>>> 1 of 3: name=_0 docCount=413585 > >>>> compound=false > >>>> hasProx=true > >>>> numFiles=8 > >>>> size (MB)=1,148.817 > >>>> diagnostics = {os.version=5.2, os=Windows 2003, > lucene.version=2.9.0 > >>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64, > >>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} > >>>> docStoreOffset=0 > >>>> docStoreSegment=_0 > >>>> docStoreIsCompoundFile=false > >>>> no deletions > >>>> test: open reader.........OK > >>>> test: fields..............OK [33 fields] > >>>> test: field norms.........OK [33 fields] > >>>> test: terms, freq, prox...OK [7704753 terms; 180326717 terms/docs > >>>> > >>> pairs; > >>> > >>>> 340244234 tokens] > >>>> test: stored fields.......OK [1240755 total field count; avg 3 > fields > >>>> per doc] > >>>> test: term vectors........OK [0 total vector count; avg 0 term/freq > >>>> vector fields per doc] > >>>> > >>>> 2 of 3: name=_1 docCount=359068 > >>>> compound=false > >>>> hasProx=true > >>>> numFiles=8 > >>>> size (MB)=1,125.161 > >>>> diagnostics = {os.version=5.2, os=Windows 2003, > lucene.version=2.9.0 > >>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64, > >>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} > >>>> docStoreOffset=413585 > >>>> docStoreSegment=_0 > >>>> docStoreIsCompoundFile=false > >>>> no deletions > >>>> test: open reader.........OK > >>>> test: fields..............OK [33 fields] > >>>> test: field norms.........OK [33 fields] > >>>> test: terms, freq, prox...WARNING: term literals:cfid196$ > docFreq=43 > >>>> > >>> != > >>> > >>>> num docs seen 4 + num docs deleted 0 > >>>> WARNING: term literals:cfid196$ docFreq=1 != num docs seen 4 + num > docs > >>>> deleted 0 > >>>> WARNING: term literals:cfid196$ docFreq=1 != num docs seen 4 + num > docs > >>>> deleted 0 > >>>> WARNING: term literals:cfid196$commandant docFreq=1 != num docs seen > 9 > >>>> > >>> + > >>> > >>>> num docs deleted 0 > >>>> WARNING: term literals:cfid196$on docFreq=3178 != num docs seen 1 + > num > >>>> docs deleted 0 > >>>> OK [7137621 terms; 179101847 terms/docs pairs; 346076058 tokens] > >>>> test: stored fields.......OK [1077204 total field count; avg 3 > fields > >>>> per doc] > >>>> test: term vectors........OK [0 total vector count; avg 0 term/freq > >>>> vector fields per doc] > >>>> > >>>> 3 of 3: name=_2 docCount=304849 > >>>> compound=false > >>>> hasProx=true > >>>> numFiles=8 > >>>> size (MB)=962.004 > >>>> diagnostics = {os.version=5.2, os=Windows 2003, > lucene.version=2.9.0 > >>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64, > >>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} > >>>> docStoreOffset=772653 > >>>> docStoreSegment=_0 > >>>> docStoreIsCompoundFile=false > >>>> no deletions > >>>> test: open reader.........OK > >>>> test: fields..............OK [33 fields] > >>>> test: field norms.........OK [33 fields] > >>>> test: terms, freq, prox...WARNING: term contents:? docFreq=1 != > num > >>>> docs seen 246 + num docs deleted 0 > >>>> WARNING: term literals:cfid196$ docFreq=45 != num docs seen 4 + num > >>>> > >>> docs > >>> > >>>> deleted 0 > >>>> WARNING: term literals:cfid196$ docFreq=1 != num docs seen 4 + num > docs > >>>> deleted 0 > >>>> WARNING: term literals:cfid196$cashier docFreq=1 != num docs seen 37 > + > >>>> > >>> num > >>> > >>>> docs deleted 0 > >>>> WARNING: term literals:cfid196$interrogation docFreq=181 != num docs > >>>> > >>> seen 1 > >>> > >>>> + num docs deleted 0 > >>>> WARNING: term literals:cfid196$leader docFreq=1 != num docs seen 353 > + > >>>> > >>> num > >>> > >>>> docs deleted 0 > >>>> WARNING: term literals:cfid196$microsoft docFreq=3114 != num docs > seen > >>>> > >>> 1 + > >>> > >>>> num docs deleted 0 > >>>> WARNING: term literals:cfid196$nt docFreq=200 != num docs seen 1 + > num > >>>> > >>> docs > >>> > >>>> deleted 0 > >>>> OK [6497769 terms; 145296880 terms/docs pairs; 293458734 tokens] > >>>> test: stored fields.......OK [914547 total field count; avg 3 > fields > >>>> > >>> per > >>> > >>>> doc] > >>>> test: term vectors........OK [0 total vector count; avg 0 term/freq > >>>> vector fields per doc] > >>>> > >>>> No problems were detected with this index. > >>>> > >>>> Peter > >>>> > >>>> > >>>> On Wed, Oct 28, 2009 at 11:29 AM, Michael McCandless < > >>>> luc...@mikemccandless.com> wrote: > >>>> > >>>> > >>>>> On Wed, Oct 28, 2009 at 10:58 AM, Peter Keegan < > peterlkee...@gmail.com > >>>>> > >>>>> wrote: > >>>>> > >>>>>> The only change I made to the source code was the patch for > >>>>>> > >>>>> PayloadNearQuery > >>>>> > >>>>>> (LUCENE-1986). > >>>>>> > >>>>> That patch certainly shouldn't lead to this. > >>>>> > >>>>> > >>>>>> It's possible that our content contains U+FFFF. I will run in > >>>>>> > >>> debugger > >>> > >>>>> and > >>>>> > >>>>>> see. > >>>>>> > >>>>> OK may as well check just so we cover all possibilities. > >>>>> > >>>>> > >>>>>> The data is 'sensitive', so I may not be able to provide a bad > >>>>>> > >>> segment, > >>> > >>>>>> unfortunately. > >>>>>> > >>>>> OK, maybe we can modify your CheckIndex instead. Let's start with > >>>>> this, which prints a warning whenever the docFreq differs but > >>>>> otherwise continues (vs throwing RuntimeException). I'm curious how > >>>>> many terms show this, and whether the TermEnum keeps working after > >>>>> this term that has different docFreq: > >>>>> > >>>>> Index: src/java/org/apache/lucene/index/CheckIndex.java > >>>>> =================================================================== > >>>>> --- src/java/org/apache/lucene/index/CheckIndex.java (revision > >>>>> > >>> 829889) > >>> > >>>>> +++ src/java/org/apache/lucene/index/CheckIndex.java (working > copy) > >>>>> @@ -672,8 +672,8 @@ > >>>>> } > >>>>> > >>>>> if (freq0 + delCount != docFreq) { > >>>>> - throw new RuntimeException("term " + term + " docFreq=" + > >>>>> - docFreq + " != num docs seen " > + > >>>>> freq0 + " + num docs deleted " + delCount); > >>>>> + System.out.println("WARNING: term " + term + " docFreq=" > + > >>>>> + docFreq + " != num docs seen " + freq0 > + > >>>>> " + num docs deleted " + delCount); > >>>>> } > >>>>> } > >>>>> > >>>>> Mike > >>>>> > >>>>> --------------------------------------------------------------------- > >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>>>> > >>>>> > >>>>> > >> > > > > > > > -- > - Mark > > http://www.lucidimagination.com > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >