Thanks a lot Peter! Really appreciate it. Peter Keegan wrote: > Mark, > > With 1.9G, I had to increase the JVM heap significantly (to 8G) to avoid > paging and GC hits. Here is a table comparing indexing times, optimizing > times and peak memory usage as a function of the RAMBufferSize. This was > run on a 64-bit server with 32GB RAM: > > RamSize Index(min) Optimize(min) Max VM > 1.9G 24 5 5G > 800M 24 5 4G > > Not much difference. I'll make a couple more runs with lower values. > Btw, the indexing times are really about 5 min. shorter because of some > non-Lucene related delays after the last document. > > Peter > > > > On Thu, Oct 29, 2009 at 4:30 PM, Mark Miller <markrmil...@gmail.com> wrote: > > >> Any chance I could get you to try that again with a buffer of like 800MB >> to a gig and do a comparison? >> >> I've been investigating the returns you get with a larger buffer size. >> It appears to be pretty diminishing returns over 100MB or so - at higher >> than that, I've gotten both slower speeds for some sizes, and larger >> gains for others. But only better by 5-10 docs a second up to a gig. But >> I can't reliably test at over a gig - I have only 4 GB of RAM, and even >> with that, at over a gig it starts to page and the performance gets hit. >> I'd love to see what kind of benefit you see going from around a gig to >> just under 2. >> >> Peter Keegan wrote: >> >>> Btw, this 2.9 indexer is fast! I indexed 4Gb (1.07 million docs) with >>> optimization in just under 30 min. >>> I used setRAMBufferSizeMB=1.9G >>> >>> Peter >>> >>> On Thu, Oct 29, 2009 at 3:46 PM, Peter Keegan <peterlkee...@gmail.com >>> wrote: >>> >>> >>> >>>> A handful of the source documents did contain the U+FFFF character. The >>>> patch from *LUCENE-2016< >>>> >> https://issues.apache.org/jira/browse/LUCENE-2016> >> >>>> *fixed the problem. >>>> Thanks Mike! >>>> >>>> Peter >>>> >>>> >>>> On Wed, Oct 28, 2009 at 1:29 PM, Michael McCandless < >>>> luc...@mikemccandless.com> wrote: >>>> >>>> >>>> >>>>> Hmm, only a few affected terms, and all this particular >>>>> "literals:cfid196$" term, with optional suffixes. Really strange. >>>>> >>>>> One things that's odd is the exact term "literals:cfid196$" is printed >>>>> twice, which should never happen (every unique term should be stored >>>>> only once, in the terms dict). >>>>> >>>>> And, otherwise, CheckIndex got through the index just fine. >>>>> >>>>> Try searching a TermQuery with these affected terms and see if it >>>>> succeeds? If so, maybe trying making an index with one or two of >>>>> them, alone, and see if that index shows the problem? >>>>> >>>>> OK I'm attaching more mods. Can you re-run your CheckIndex? It will >>>>> produce an enormous amount of output, but if you can excise the few >>>>> lines around when that warning comes out & post back that'd be great. >>>>> >>>>> Mike >>>>> >>>>> On Wed, Oct 28, 2009 at 12:23 PM, Peter Keegan <peterlkee...@gmail.com >>>>> >>>>> wrote: >>>>> >>>>> >>>>>> Just to be safe, I ran with the official jar file from one of the >>>>>> >>>>>> >>>>> mirrors >>>>> >>>>> >>>>>> and reproduced the problem. >>>>>> The debug session is not showing any characters = '\uffff' (checking >>>>>> >>>>>> >>>>> this in >>>>> >>>>> >>>>>> Tokenizer). >>>>>> The output from the modified CheckIndex follows. There are only a few >>>>>> >>>>>> >>>>> terms >>>>> >>>>> >>>>>> with the inconsistency. They are all legitimate terms from the app's >>>>>> context. With this info, I might be able to isolate the source >>>>>> >>>>>> >>>>> documents. >>>>> >>>>> >>>>>> What should I be looking for when they are indexed? >>>>>> >>>>>> CheckInput output: >>>>>> >>>>>> Opening index @ >>>>>> >>>>>> >>>>> D:\mnsavs\lresumes1\lresumes1.luc\lresumes1.search.main.4 >>>>> >>>>> >>>>>> Segments file=segments_2 numSegments=3 version=FORMAT_DIAGNOSTICS >>>>>> >>>>>> >>>>> [Lucene >>>>> >>>>> >>>>>> 2.9] >>>>>> 1 of 3: name=_0 docCount=413585 >>>>>> compound=false >>>>>> hasProx=true >>>>>> numFiles=8 >>>>>> size (MB)=1,148.817 >>>>>> diagnostics = {os.version=5.2, os=Windows 2003, >>>>>> >> lucene.version=2.9.0 >> >>>>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64, >>>>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} >>>>>> docStoreOffset=0 >>>>>> docStoreSegment=_0 >>>>>> docStoreIsCompoundFile=false >>>>>> no deletions >>>>>> test: open reader.........OK >>>>>> test: fields..............OK [33 fields] >>>>>> test: field norms.........OK [33 fields] >>>>>> test: terms, freq, prox...OK [7704753 terms; 180326717 terms/docs >>>>>> >>>>>> >>>>> pairs; >>>>> >>>>> >>>>>> 340244234 tokens] >>>>>> test: stored fields.......OK [1240755 total field count; avg 3 >>>>>> >> fields >> >>>>>> per doc] >>>>>> test: term vectors........OK [0 total vector count; avg 0 term/freq >>>>>> vector fields per doc] >>>>>> >>>>>> 2 of 3: name=_1 docCount=359068 >>>>>> compound=false >>>>>> hasProx=true >>>>>> numFiles=8 >>>>>> size (MB)=1,125.161 >>>>>> diagnostics = {os.version=5.2, os=Windows 2003, >>>>>> >> lucene.version=2.9.0 >> >>>>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64, >>>>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} >>>>>> docStoreOffset=413585 >>>>>> docStoreSegment=_0 >>>>>> docStoreIsCompoundFile=false >>>>>> no deletions >>>>>> test: open reader.........OK >>>>>> test: fields..............OK [33 fields] >>>>>> test: field norms.........OK [33 fields] >>>>>> test: terms, freq, prox...WARNING: term literals:cfid196$ >>>>>> >> docFreq=43 >> >>>>> != >>>>> >>>>> >>>>>> num docs seen 4 + num docs deleted 0 >>>>>> WARNING: term literals:cfid196$ docFreq=1 != num docs seen 4 + num >>>>>> >> docs >> >>>>>> deleted 0 >>>>>> WARNING: term literals:cfid196$ docFreq=1 != num docs seen 4 + num >>>>>> >> docs >> >>>>>> deleted 0 >>>>>> WARNING: term literals:cfid196$commandant docFreq=1 != num docs seen >>>>>> >> 9 >> >>>>> + >>>>> >>>>> >>>>>> num docs deleted 0 >>>>>> WARNING: term literals:cfid196$on docFreq=3178 != num docs seen 1 + >>>>>> >> num >> >>>>>> docs deleted 0 >>>>>> OK [7137621 terms; 179101847 terms/docs pairs; 346076058 tokens] >>>>>> test: stored fields.......OK [1077204 total field count; avg 3 >>>>>> >> fields >> >>>>>> per doc] >>>>>> test: term vectors........OK [0 total vector count; avg 0 term/freq >>>>>> vector fields per doc] >>>>>> >>>>>> 3 of 3: name=_2 docCount=304849 >>>>>> compound=false >>>>>> hasProx=true >>>>>> numFiles=8 >>>>>> size (MB)=962.004 >>>>>> diagnostics = {os.version=5.2, os=Windows 2003, >>>>>> >> lucene.version=2.9.0 >> >>>>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64, >>>>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.} >>>>>> docStoreOffset=772653 >>>>>> docStoreSegment=_0 >>>>>> docStoreIsCompoundFile=false >>>>>> no deletions >>>>>> test: open reader.........OK >>>>>> test: fields..............OK [33 fields] >>>>>> test: field norms.........OK [33 fields] >>>>>> test: terms, freq, prox...WARNING: term contents:? docFreq=1 != >>>>>> >> num >> >>>>>> docs seen 246 + num docs deleted 0 >>>>>> WARNING: term literals:cfid196$ docFreq=45 != num docs seen 4 + num >>>>>> >>>>>> >>>>> docs >>>>> >>>>> >>>>>> deleted 0 >>>>>> WARNING: term literals:cfid196$ docFreq=1 != num docs seen 4 + num >>>>>> >> docs >> >>>>>> deleted 0 >>>>>> WARNING: term literals:cfid196$cashier docFreq=1 != num docs seen 37 >>>>>> >> + >> >>>>> num >>>>> >>>>> >>>>>> docs deleted 0 >>>>>> WARNING: term literals:cfid196$interrogation docFreq=181 != num docs >>>>>> >>>>>> >>>>> seen 1 >>>>> >>>>> >>>>>> + num docs deleted 0 >>>>>> WARNING: term literals:cfid196$leader docFreq=1 != num docs seen 353 >>>>>> >> + >> >>>>> num >>>>> >>>>> >>>>>> docs deleted 0 >>>>>> WARNING: term literals:cfid196$microsoft docFreq=3114 != num docs >>>>>> >> seen >> >>>>> 1 + >>>>> >>>>> >>>>>> num docs deleted 0 >>>>>> WARNING: term literals:cfid196$nt docFreq=200 != num docs seen 1 + >>>>>> >> num >> >>>>> docs >>>>> >>>>> >>>>>> deleted 0 >>>>>> OK [6497769 terms; 145296880 terms/docs pairs; 293458734 tokens] >>>>>> test: stored fields.......OK [914547 total field count; avg 3 >>>>>> >> fields >> >>>>> per >>>>> >>>>> >>>>>> doc] >>>>>> test: term vectors........OK [0 total vector count; avg 0 term/freq >>>>>> vector fields per doc] >>>>>> >>>>>> No problems were detected with this index. >>>>>> >>>>>> Peter >>>>>> >>>>>> >>>>>> On Wed, Oct 28, 2009 at 11:29 AM, Michael McCandless < >>>>>> luc...@mikemccandless.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>>> On Wed, Oct 28, 2009 at 10:58 AM, Peter Keegan < >>>>>>> >> peterlkee...@gmail.com >> >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>>>> The only change I made to the source code was the patch for >>>>>>>> >>>>>>>> >>>>>>> PayloadNearQuery >>>>>>> >>>>>>> >>>>>>>> (LUCENE-1986). >>>>>>>> >>>>>>>> >>>>>>> That patch certainly shouldn't lead to this. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> It's possible that our content contains U+FFFF. I will run in >>>>>>>> >>>>>>>> >>>>> debugger >>>>> >>>>> >>>>>>> and >>>>>>> >>>>>>> >>>>>>>> see. >>>>>>>> >>>>>>>> >>>>>>> OK may as well check just so we cover all possibilities. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> The data is 'sensitive', so I may not be able to provide a bad >>>>>>>> >>>>>>>> >>>>> segment, >>>>> >>>>> >>>>>>>> unfortunately. >>>>>>>> >>>>>>>> >>>>>>> OK, maybe we can modify your CheckIndex instead. Let's start with >>>>>>> this, which prints a warning whenever the docFreq differs but >>>>>>> otherwise continues (vs throwing RuntimeException). I'm curious how >>>>>>> many terms show this, and whether the TermEnum keeps working after >>>>>>> this term that has different docFreq: >>>>>>> >>>>>>> Index: src/java/org/apache/lucene/index/CheckIndex.java >>>>>>> =================================================================== >>>>>>> --- src/java/org/apache/lucene/index/CheckIndex.java (revision >>>>>>> >>>>>>> >>>>> 829889) >>>>> >>>>> >>>>>>> +++ src/java/org/apache/lucene/index/CheckIndex.java (working >>>>>>> >> copy) >> >>>>>>> @@ -672,8 +672,8 @@ >>>>>>> } >>>>>>> >>>>>>> if (freq0 + delCount != docFreq) { >>>>>>> - throw new RuntimeException("term " + term + " docFreq=" + >>>>>>> - docFreq + " != num docs seen " >>>>>>> >> + >> >>>>>>> freq0 + " + num docs deleted " + delCount); >>>>>>> + System.out.println("WARNING: term " + term + " docFreq=" >>>>>>> >> + >> >>>>>>> + docFreq + " != num docs seen " + freq0 >>>>>>> >> + >> >>>>>>> " + num docs deleted " + delCount); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> Mike >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>> >> -- >> - Mark >> >> http://www.lucidimagination.com >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> > >
-- - Mark http://www.lucidimagination.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org