A couple more data points: RamSize Index(min) Optimize(min) Peak mem 1.9G 24 5 5G 800M 24 5 4G 400M 25 5 3.5G 100M 25 5 3G 50M 26 4 3G
Peter On Thu, Oct 29, 2009 at 8:49 PM, Mark Miller <markrmil...@gmail.com> wrote: > Thanks a lot Peter! Really appreciate it. > > Peter Keegan wrote: > > Mark, > > > > With 1.9G, I had to increase the JVM heap significantly (to 8G) to avoid > > paging and GC hits. Here is a table comparing indexing times, optimizing > > times and peak memory usage as a function of the RAMBufferSize. This was > > run on a 64-bit server with 32GB RAM: > > > > RamSize Index(min) Optimize(min) Max VM > > 1.9G 24 5 5G > > 800M 24 5 4G > > > > Not much difference. I'll make a couple more runs with lower values. > > Btw, the indexing times are really about 5 min. shorter because of some > > non-Lucene related delays after the last document. > > > > Peter > > > > > > > > On Thu, Oct 29, 2009 at 4:30 PM, Mark Miller <markrmil...@gmail.com> > wrote: > > > > > >> Any chance I could get you to try that again with a buffer of like 800MB > >> to a gig and do a comparison? > >> > >> I've been investigating the returns you get with a larger buffer size. > >> It appears to be pretty diminishing returns over 100MB or so - at higher > >> than that, I've gotten both slower speeds for some sizes, and larger > >> gains for others. But only better by 5-10 docs a second up to a gig. But > >> I can't reliably test at over a gig - I have only 4 GB of RAM, and even > >> with that, at over a gig it starts to page and the performance gets hit. > >> I'd love to see what kind of benefit you see going from around a gig to > >> just under 2. > >> > >> Peter Keegan wrote: > >> > >>> Btw, this 2.9 indexer is fast! I indexed 4Gb (1.07 million docs) with > >>> optimization in just under 30 min. > >>> I used setRAMBufferSizeMB=1.9G > >>> > >>> Peter > >>> > >>> On Thu, Oct 29, 2009 at 3:46 PM, Peter Keegan <peterlkee...@gmail.com > >>> wrote: > >>> > >>> > >>> > >>>> A handful of the source documents did contain the U+FFFF character. > The > >>>> patch from *LUCENE-2016< > >>>> > >> https://issues.apache.org/jira/browse/LUCENE-2016> > >> > >>>> *fixed the problem. > >>>> Thanks Mike! > >>>> > >>>> Peter > >>>> > >>>> > >>>> On Wed, Oct 28, 2009 at 1:29 PM, Michael McCandless < > >>>> luc...@mikemccandless.com> wrote: > >>>> > >>>> > >>>> > >>>>> Hmm, only a few affected terms, and all this particular > >>>>> "literals:cfid196$" term, with optional suffixes. Really strange. > >>>>> > >>>>> One things that's odd is the exact term "literals:cfid196$" is > printed > >>>>> twice, which should never happen (every unique term should be stored > >>>>> only once, in the terms dict). > >>>>> > >>>>> And, otherwise, CheckIndex got through the index just fine. > >>>>> > >>>>> Try searching a TermQuery with these affected terms and see if it > >>>>> succeeds? If so, maybe trying making an index with one or two of > >>>>> them, alone, and see if that index shows the problem? > >>>>> > >>>>> OK I'm attaching more mods. Can you re-run your CheckIndex? It will > >>>>> produce an enormous amount of output, but if you can excise the few > >>>>> lines around when that warning comes out & post back that'd be great. > >>>>> > >>>>> Mike > >>>>> > >>>>> On Wed, Oct 28, 2009 at 12:23 PM, Peter Keegan < > peterlkee...@gmail.com > >>>>> > >>>>> wrote: > >>>>> > >>>>> > >>>>>> Just to be safe, I ran with the official jar file from one of the > >>>>>> > >>>>>> > >>>>> mirrors > >>>>> > >>>>> > >>>>>> and reproduced the problem. > >>>>>> The debug session is not showing any characters = '\uffff' (checking > >>>>>> > >>>>>> > >>>>> this in > >>>>> > >>>>> > >>>>>> Tokenizer). > >>>>>> The output from the modified CheckIndex follows. There are only a > few > >>>>>> > >>>>>> > >>>>> terms > >>>>> > >>>>> > >>>>>> with the inconsistency. They are all legitimate terms from the app's > >>>>>> context. With this info, I might be able to isolate the source > >>>>>> > >>>>>> > >>>>> documents. > >>>>> > >>>>> > >>>>>> What should I be looking for when they are indexed? > >>>>>> > >>>>>> CheckInput output: > >>>>>> > >>>>>> Opening index @ > >>>>>> > >>>>>> > >>>>> D:\mnsavs\lresumes1\lresumes1.luc\lresumes1.search.main.4 > >>>>> > >>>>> > >>>>>> Segments file=segments_2 numSegments=3 version=FORMAT_DIAGNOSTICS > >>>>>> > >>>>>> > >>>>> [Lucene > >>>>> > >>>>> > >>>>>> 2.9] > >>>>>> 1 of 3: name=_0 docCount=413585 > >>>>>> compound=false > >>>>>> hasProx=true > >>>>>> numFiles=8 > >>>>>> size (MB)=1,148.817 > >>>>>> diagnostics = {os.version=5.2, os=Windows 2003, > >>>>>> > >> lucene.version=2.9.0 > >> > >>>>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64, > >>>>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} > >>>>>> docStoreOffset=0 > >>>>>> docStoreSegment=_0 > >>>>>> docStoreIsCompoundFile=false > >>>>>> no deletions > >>>>>> test: open reader.........OK > >>>>>> test: fields..............OK [33 fields] > >>>>>> test: field norms.........OK [33 fields] > >>>>>> test: terms, freq, prox...OK [7704753 terms; 180326717 terms/docs > >>>>>> > >>>>>> > >>>>> pairs; > >>>>> > >>>>> > >>>>>> 340244234 tokens] > >>>>>> test: stored fields.......OK [1240755 total field count; avg 3 > >>>>>> > >> fields > >> > >>>>>> per doc] > >>>>>> test: term vectors........OK [0 total vector count; avg 0 > term/freq > >>>>>> vector fields per doc] > >>>>>> > >>>>>> 2 of 3: name=_1 docCount=359068 > >>>>>> compound=false > >>>>>> hasProx=true > >>>>>> numFiles=8 > >>>>>> size (MB)=1,125.161 > >>>>>> diagnostics = {os.version=5.2, os=Windows 2003, > >>>>>> > >> lucene.version=2.9.0 > >> > >>>>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64, > >>>>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} > >>>>>> docStoreOffset=413585 > >>>>>> docStoreSegment=_0 > >>>>>> docStoreIsCompoundFile=false > >>>>>> no deletions > >>>>>> test: open reader.........OK > >>>>>> test: fields..............OK [33 fields] > >>>>>> test: field norms.........OK [33 fields] > >>>>>> test: terms, freq, prox...WARNING: term literals:cfid196$ > >>>>>> > >> docFreq=43 > >> > >>>>> != > >>>>> > >>>>> > >>>>>> num docs seen 4 + num docs deleted 0 > >>>>>> WARNING: term literals:cfid196$ docFreq=1 != num docs seen 4 + num > >>>>>> > >> docs > >> > >>>>>> deleted 0 > >>>>>> WARNING: term literals:cfid196$ docFreq=1 != num docs seen 4 + num > >>>>>> > >> docs > >> > >>>>>> deleted 0 > >>>>>> WARNING: term literals:cfid196$commandant docFreq=1 != num docs > seen > >>>>>> > >> 9 > >> > >>>>> + > >>>>> > >>>>> > >>>>>> num docs deleted 0 > >>>>>> WARNING: term literals:cfid196$on docFreq=3178 != num docs seen 1 + > >>>>>> > >> num > >> > >>>>>> docs deleted 0 > >>>>>> OK [7137621 terms; 179101847 terms/docs pairs; 346076058 tokens] > >>>>>> test: stored fields.......OK [1077204 total field count; avg 3 > >>>>>> > >> fields > >> > >>>>>> per doc] > >>>>>> test: term vectors........OK [0 total vector count; avg 0 > term/freq > >>>>>> vector fields per doc] > >>>>>> > >>>>>> 3 of 3: name=_2 docCount=304849 > >>>>>> compound=false > >>>>>> hasProx=true > >>>>>> numFiles=8 > >>>>>> size (MB)=962.004 > >>>>>> diagnostics = {os.version=5.2, os=Windows 2003, > >>>>>> > >> lucene.version=2.9.0 > >> > >>>>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64, > >>>>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} > >>>>>> docStoreOffset=772653 > >>>>>> docStoreSegment=_0 > >>>>>> docStoreIsCompoundFile=false > >>>>>> no deletions > >>>>>> test: open reader.........OK > >>>>>> test: fields..............OK [33 fields] > >>>>>> test: field norms.........OK [33 fields] > >>>>>> test: terms, freq, prox...WARNING: term contents:? docFreq=1 != > >>>>>> > >> num > >> > >>>>>> docs seen 246 + num docs deleted 0 > >>>>>> WARNING: term literals:cfid196$ docFreq=45 != num docs seen 4 + num > >>>>>> > >>>>>> > >>>>> docs > >>>>> > >>>>> > >>>>>> deleted 0 > >>>>>> WARNING: term literals:cfid196$ docFreq=1 != num docs seen 4 + num > >>>>>> > >> docs > >> > >>>>>> deleted 0 > >>>>>> WARNING: term literals:cfid196$cashier docFreq=1 != num docs seen > 37 > >>>>>> > >> + > >> > >>>>> num > >>>>> > >>>>> > >>>>>> docs deleted 0 > >>>>>> WARNING: term literals:cfid196$interrogation docFreq=181 != num > docs > >>>>>> > >>>>>> > >>>>> seen 1 > >>>>> > >>>>> > >>>>>> + num docs deleted 0 > >>>>>> WARNING: term literals:cfid196$leader docFreq=1 != num docs seen > 353 > >>>>>> > >> + > >> > >>>>> num > >>>>> > >>>>> > >>>>>> docs deleted 0 > >>>>>> WARNING: term literals:cfid196$microsoft docFreq=3114 != num docs > >>>>>> > >> seen > >> > >>>>> 1 + > >>>>> > >>>>> > >>>>>> num docs deleted 0 > >>>>>> WARNING: term literals:cfid196$nt docFreq=200 != num docs seen 1 + > >>>>>> > >> num > >> > >>>>> docs > >>>>> > >>>>> > >>>>>> deleted 0 > >>>>>> OK [6497769 terms; 145296880 terms/docs pairs; 293458734 tokens] > >>>>>> test: stored fields.......OK [914547 total field count; avg 3 > >>>>>> > >> fields > >> > >>>>> per > >>>>> > >>>>> > >>>>>> doc] > >>>>>> test: term vectors........OK [0 total vector count; avg 0 > term/freq > >>>>>> vector fields per doc] > >>>>>> > >>>>>> No problems were detected with this index. > >>>>>> > >>>>>> Peter > >>>>>> > >>>>>> > >>>>>> On Wed, Oct 28, 2009 at 11:29 AM, Michael McCandless < > >>>>>> luc...@mikemccandless.com> wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>>> On Wed, Oct 28, 2009 at 10:58 AM, Peter Keegan < > >>>>>>> > >> peterlkee...@gmail.com > >> > >>>>>>> wrote: > >>>>>>> > >>>>>>> > >>>>>>>> The only change I made to the source code was the patch for > >>>>>>>> > >>>>>>>> > >>>>>>> PayloadNearQuery > >>>>>>> > >>>>>>> > >>>>>>>> (LUCENE-1986). > >>>>>>>> > >>>>>>>> > >>>>>>> That patch certainly shouldn't lead to this. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> It's possible that our content contains U+FFFF. I will run in > >>>>>>>> > >>>>>>>> > >>>>> debugger > >>>>> > >>>>> > >>>>>>> and > >>>>>>> > >>>>>>> > >>>>>>>> see. > >>>>>>>> > >>>>>>>> > >>>>>>> OK may as well check just so we cover all possibilities. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> The data is 'sensitive', so I may not be able to provide a bad > >>>>>>>> > >>>>>>>> > >>>>> segment, > >>>>> > >>>>> > >>>>>>>> unfortunately. > >>>>>>>> > >>>>>>>> > >>>>>>> OK, maybe we can modify your CheckIndex instead. Let's start with > >>>>>>> this, which prints a warning whenever the docFreq differs but > >>>>>>> otherwise continues (vs throwing RuntimeException). I'm curious > how > >>>>>>> many terms show this, and whether the TermEnum keeps working after > >>>>>>> this term that has different docFreq: > >>>>>>> > >>>>>>> Index: src/java/org/apache/lucene/index/CheckIndex.java > >>>>>>> =================================================================== > >>>>>>> --- src/java/org/apache/lucene/index/CheckIndex.java (revision > >>>>>>> > >>>>>>> > >>>>> 829889) > >>>>> > >>>>> > >>>>>>> +++ src/java/org/apache/lucene/index/CheckIndex.java (working > >>>>>>> > >> copy) > >> > >>>>>>> @@ -672,8 +672,8 @@ > >>>>>>> } > >>>>>>> > >>>>>>> if (freq0 + delCount != docFreq) { > >>>>>>> - throw new RuntimeException("term " + term + " docFreq=" > + > >>>>>>> - docFreq + " != num docs seen > " > >>>>>>> > >> + > >> > >>>>>>> freq0 + " + num docs deleted " + delCount); > >>>>>>> + System.out.println("WARNING: term " + term + " > docFreq=" > >>>>>>> > >> + > >> > >>>>>>> + docFreq + " != num docs seen " + > freq0 > >>>>>>> > >> + > >> > >>>>>>> " + num docs deleted " + delCount); > >>>>>>> } > >>>>>>> } > >>>>>>> > >>>>>>> Mike > >>>>>>> > >>>>>>> > --------------------------------------------------------------------- > >>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>> > >> -- > >> - Mark > >> > >> http://www.lucidimagination.com > >> > >> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > >> > > > > > > > -- > - Mark > > http://www.lucidimagination.com > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >