A couple more data points:

RamSize        Index(min)    Optimize(min)        Peak mem
1.9G        24        5        5G
800M        24        5        4G
400M        25          5        3.5G
100M        25        5        3G
50M         26          4        3G

Peter


On Thu, Oct 29, 2009 at 8:49 PM, Mark Miller <markrmil...@gmail.com> wrote:

> Thanks a lot Peter! Really appreciate it.
>
> Peter Keegan wrote:
> > Mark,
> >
> > With 1.9G, I had to increase the JVM heap significantly (to 8G)  to avoid
> > paging and GC hits. Here is a table comparing indexing times, optimizing
> > times and peak memory usage as a function of the  RAMBufferSize. This was
> > run on a 64-bit server with 32GB RAM:
> >
> > RamSize        Index(min)    Optimize(min)     Max VM
> > 1.9G         24            5                   5G
> > 800M        24            5                  4G
> >
> > Not much difference. I'll make a couple more runs with lower values.
> > Btw, the indexing times are really about 5 min. shorter because of some
> > non-Lucene related delays after the last document.
> >
> > Peter
> >
> >
> >
> > On Thu, Oct 29, 2009 at 4:30 PM, Mark Miller <markrmil...@gmail.com>
> wrote:
> >
> >
> >> Any chance I could get you to try that again with a buffer of like 800MB
> >> to a gig and do a comparison?
> >>
> >> I've been investigating the returns you get with a larger buffer size.
> >> It appears to be pretty diminishing returns over 100MB or so - at higher
> >> than that, I've gotten both slower speeds for some sizes, and larger
> >> gains for others. But only better by 5-10 docs a second up to a gig. But
> >> I can't reliably test at over a gig - I have only 4 GB of RAM, and even
> >> with that, at over a gig it starts to page and the performance gets hit.
> >> I'd love to see what kind of benefit you see going from around a gig to
> >> just under 2.
> >>
> >> Peter Keegan wrote:
> >>
> >>> Btw, this 2.9 indexer is fast! I indexed 4Gb (1.07 million docs) with
> >>> optimization in just under 30 min.
> >>> I used setRAMBufferSizeMB=1.9G
> >>>
> >>> Peter
> >>>
> >>> On Thu, Oct 29, 2009 at 3:46 PM, Peter Keegan <peterlkee...@gmail.com
> >>> wrote:
> >>>
> >>>
> >>>
> >>>> A handful of the source documents did contain the U+FFFF character.
> The
> >>>> patch from *LUCENE-2016<
> >>>>
> >> https://issues.apache.org/jira/browse/LUCENE-2016>
> >>
> >>>> *fixed the problem.
> >>>> Thanks Mike!
> >>>>
> >>>> Peter
> >>>>
> >>>>
> >>>> On Wed, Oct 28, 2009 at 1:29 PM, Michael McCandless <
> >>>> luc...@mikemccandless.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> Hmm, only a few affected terms, and all this particular
> >>>>> "literals:cfid196$" term, with optional suffixes.  Really strange.
> >>>>>
> >>>>> One things that's odd is the exact term "literals:cfid196$" is
> printed
> >>>>> twice, which should never happen (every unique term should be stored
> >>>>> only once, in the terms dict).
> >>>>>
> >>>>> And, otherwise, CheckIndex got through the index just fine.
> >>>>>
> >>>>> Try searching a TermQuery with these affected terms and see if it
> >>>>> succeeds?  If so, maybe trying making an index with one or two of
> >>>>> them, alone, and see if that index shows the problem?
> >>>>>
> >>>>> OK I'm attaching more mods.  Can you re-run your CheckIndex?  It will
> >>>>> produce an enormous amount of output, but if you can excise the few
> >>>>> lines around when that warning comes out & post back that'd be great.
> >>>>>
> >>>>> Mike
> >>>>>
> >>>>> On Wed, Oct 28, 2009 at 12:23 PM, Peter Keegan <
> peterlkee...@gmail.com
> >>>>>
> >>>>> wrote:
> >>>>>
> >>>>>
> >>>>>> Just to be safe, I ran with the official jar file from one of the
> >>>>>>
> >>>>>>
> >>>>> mirrors
> >>>>>
> >>>>>
> >>>>>> and reproduced the problem.
> >>>>>> The debug session is not showing any characters = '\uffff' (checking
> >>>>>>
> >>>>>>
> >>>>> this in
> >>>>>
> >>>>>
> >>>>>> Tokenizer).
> >>>>>> The output from the modified CheckIndex follows. There are only a
> few
> >>>>>>
> >>>>>>
> >>>>> terms
> >>>>>
> >>>>>
> >>>>>> with the inconsistency. They are all legitimate terms from the app's
> >>>>>> context. With this info, I might be able to isolate the source
> >>>>>>
> >>>>>>
> >>>>> documents.
> >>>>>
> >>>>>
> >>>>>> What should I be looking for when they are indexed?
> >>>>>>
> >>>>>> CheckInput output:
> >>>>>>
> >>>>>> Opening index @
> >>>>>>
> >>>>>>
> >>>>> D:\mnsavs\lresumes1\lresumes1.luc\lresumes1.search.main.4
> >>>>>
> >>>>>
> >>>>>> Segments file=segments_2 numSegments=3 version=FORMAT_DIAGNOSTICS
> >>>>>>
> >>>>>>
> >>>>> [Lucene
> >>>>>
> >>>>>
> >>>>>> 2.9]
> >>>>>>  1 of 3: name=_0 docCount=413585
> >>>>>>    compound=false
> >>>>>>    hasProx=true
> >>>>>>    numFiles=8
> >>>>>>    size (MB)=1,148.817
> >>>>>>    diagnostics = {os.version=5.2, os=Windows 2003,
> >>>>>>
> >> lucene.version=2.9.0
> >>
> >>>>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
> >>>>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
> >>>>>>    docStoreOffset=0
> >>>>>>    docStoreSegment=_0
> >>>>>>    docStoreIsCompoundFile=false
> >>>>>>    no deletions
> >>>>>>    test: open reader.........OK
> >>>>>>    test: fields..............OK [33 fields]
> >>>>>>    test: field norms.........OK [33 fields]
> >>>>>>    test: terms, freq, prox...OK [7704753 terms; 180326717 terms/docs
> >>>>>>
> >>>>>>
> >>>>> pairs;
> >>>>>
> >>>>>
> >>>>>> 340244234 tokens]
> >>>>>>    test: stored fields.......OK [1240755 total field count; avg 3
> >>>>>>
> >> fields
> >>
> >>>>>> per doc]
> >>>>>>    test: term vectors........OK [0 total vector count; avg 0
> term/freq
> >>>>>> vector fields per doc]
> >>>>>>
> >>>>>>  2 of 3: name=_1 docCount=359068
> >>>>>>    compound=false
> >>>>>>    hasProx=true
> >>>>>>    numFiles=8
> >>>>>>    size (MB)=1,125.161
> >>>>>>    diagnostics = {os.version=5.2, os=Windows 2003,
> >>>>>>
> >> lucene.version=2.9.0
> >>
> >>>>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
> >>>>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
> >>>>>>    docStoreOffset=413585
> >>>>>>    docStoreSegment=_0
> >>>>>>    docStoreIsCompoundFile=false
> >>>>>>    no deletions
> >>>>>>    test: open reader.........OK
> >>>>>>    test: fields..............OK [33 fields]
> >>>>>>    test: field norms.........OK [33 fields]
> >>>>>>    test: terms, freq, prox...WARNING: term  literals:cfid196$
> >>>>>>
> >> docFreq=43
> >>
> >>>>> !=
> >>>>>
> >>>>>
> >>>>>> num docs seen 4 + num docs deleted 0
> >>>>>> WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num
> >>>>>>
> >> docs
> >>
> >>>>>> deleted 0
> >>>>>> WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num
> >>>>>>
> >> docs
> >>
> >>>>>> deleted 0
> >>>>>> WARNING: term  literals:cfid196$commandant docFreq=1 != num docs
> seen
> >>>>>>
> >> 9
> >>
> >>>>> +
> >>>>>
> >>>>>
> >>>>>> num docs deleted 0
> >>>>>> WARNING: term  literals:cfid196$on docFreq=3178 != num docs seen 1 +
> >>>>>>
> >> num
> >>
> >>>>>> docs deleted 0
> >>>>>> OK [7137621 terms; 179101847 terms/docs pairs; 346076058 tokens]
> >>>>>>    test: stored fields.......OK [1077204 total field count; avg 3
> >>>>>>
> >> fields
> >>
> >>>>>> per doc]
> >>>>>>    test: term vectors........OK [0 total vector count; avg 0
> term/freq
> >>>>>> vector fields per doc]
> >>>>>>
> >>>>>>  3 of 3: name=_2 docCount=304849
> >>>>>>    compound=false
> >>>>>>    hasProx=true
> >>>>>>    numFiles=8
> >>>>>>    size (MB)=962.004
> >>>>>>    diagnostics = {os.version=5.2, os=Windows 2003,
> >>>>>>
> >> lucene.version=2.9.0
> >>
> >>>>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
> >>>>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
> >>>>>>    docStoreOffset=772653
> >>>>>>    docStoreSegment=_0
> >>>>>>    docStoreIsCompoundFile=false
> >>>>>>    no deletions
> >>>>>>    test: open reader.........OK
> >>>>>>    test: fields..............OK [33 fields]
> >>>>>>    test: field norms.........OK [33 fields]
> >>>>>>    test: terms, freq, prox...WARNING: term  contents:? docFreq=1 !=
> >>>>>>
> >> num
> >>
> >>>>>> docs seen 246 + num docs deleted 0
> >>>>>> WARNING: term  literals:cfid196$ docFreq=45 != num docs seen 4 + num
> >>>>>>
> >>>>>>
> >>>>> docs
> >>>>>
> >>>>>
> >>>>>> deleted 0
> >>>>>> WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num
> >>>>>>
> >> docs
> >>
> >>>>>> deleted 0
> >>>>>> WARNING: term  literals:cfid196$cashier docFreq=1 != num docs seen
> 37
> >>>>>>
> >> +
> >>
> >>>>> num
> >>>>>
> >>>>>
> >>>>>> docs deleted 0
> >>>>>> WARNING: term  literals:cfid196$interrogation docFreq=181 != num
> docs
> >>>>>>
> >>>>>>
> >>>>> seen 1
> >>>>>
> >>>>>
> >>>>>> + num docs deleted 0
> >>>>>> WARNING: term  literals:cfid196$leader docFreq=1 != num docs seen
> 353
> >>>>>>
> >> +
> >>
> >>>>> num
> >>>>>
> >>>>>
> >>>>>> docs deleted 0
> >>>>>> WARNING: term  literals:cfid196$microsoft docFreq=3114 != num docs
> >>>>>>
> >> seen
> >>
> >>>>> 1 +
> >>>>>
> >>>>>
> >>>>>> num docs deleted 0
> >>>>>> WARNING: term  literals:cfid196$nt docFreq=200 != num docs seen 1 +
> >>>>>>
> >> num
> >>
> >>>>> docs
> >>>>>
> >>>>>
> >>>>>> deleted 0
> >>>>>> OK [6497769 terms; 145296880 terms/docs pairs; 293458734 tokens]
> >>>>>>    test: stored fields.......OK [914547 total field count; avg 3
> >>>>>>
> >> fields
> >>
> >>>>> per
> >>>>>
> >>>>>
> >>>>>> doc]
> >>>>>>    test: term vectors........OK [0 total vector count; avg 0
> term/freq
> >>>>>> vector fields per doc]
> >>>>>>
> >>>>>> No problems were detected with this index.
> >>>>>>
> >>>>>> Peter
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Oct 28, 2009 at 11:29 AM, Michael McCandless <
> >>>>>> luc...@mikemccandless.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On Wed, Oct 28, 2009 at 10:58 AM, Peter Keegan <
> >>>>>>>
> >> peterlkee...@gmail.com
> >>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>> The only change I made to the source code was the patch for
> >>>>>>>>
> >>>>>>>>
> >>>>>>> PayloadNearQuery
> >>>>>>>
> >>>>>>>
> >>>>>>>> (LUCENE-1986).
> >>>>>>>>
> >>>>>>>>
> >>>>>>> That patch certainly shouldn't lead to this.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> It's possible that our content contains U+FFFF. I will run in
> >>>>>>>>
> >>>>>>>>
> >>>>> debugger
> >>>>>
> >>>>>
> >>>>>>> and
> >>>>>>>
> >>>>>>>
> >>>>>>>> see.
> >>>>>>>>
> >>>>>>>>
> >>>>>>> OK may as well check just so we cover all possibilities.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> The data is 'sensitive', so I may not be able to provide a bad
> >>>>>>>>
> >>>>>>>>
> >>>>> segment,
> >>>>>
> >>>>>
> >>>>>>>> unfortunately.
> >>>>>>>>
> >>>>>>>>
> >>>>>>> OK, maybe we can modify your CheckIndex instead.  Let's start with
> >>>>>>> this, which prints a warning whenever the docFreq differs but
> >>>>>>> otherwise continues (vs throwing RuntimeException).  I'm curious
> how
> >>>>>>> many terms show this, and whether the TermEnum keeps working after
> >>>>>>> this term that has different docFreq:
> >>>>>>>
> >>>>>>> Index: src/java/org/apache/lucene/index/CheckIndex.java
> >>>>>>> ===================================================================
> >>>>>>> --- src/java/org/apache/lucene/index/CheckIndex.java    (revision
> >>>>>>>
> >>>>>>>
> >>>>> 829889)
> >>>>>
> >>>>>
> >>>>>>> +++ src/java/org/apache/lucene/index/CheckIndex.java    (working
> >>>>>>>
> >> copy)
> >>
> >>>>>>> @@ -672,8 +672,8 @@
> >>>>>>>         }
> >>>>>>>
> >>>>>>>         if (freq0 + delCount != docFreq) {
> >>>>>>> -          throw new RuntimeException("term " + term + " docFreq="
> +
> >>>>>>> -                                     docFreq + " != num docs seen
> "
> >>>>>>>
> >> +
> >>
> >>>>>>> freq0 + " + num docs deleted " + delCount);
> >>>>>>> +          System.out.println("WARNING: term  " + term + "
> docFreq="
> >>>>>>>
> >> +
> >>
> >>>>>>> +                             docFreq + " != num docs seen " +
> freq0
> >>>>>>>
> >> +
> >>
> >>>>>>> " + num docs deleted " + delCount);
> >>>>>>>         }
> >>>>>>>       }
> >>>>>>>
> >>>>>>> Mike
> >>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>
> >> --
> >> - Mark
> >>
> >> http://www.lucidimagination.com
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
> >>
> >
> >
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Reply via email to