Mark,

With 1.9G, I had to increase the JVM heap significantly (to 8G)  to avoid
paging and GC hits. Here is a table comparing indexing times, optimizing
times and peak memory usage as a function of the  RAMBufferSize. This was
run on a 64-bit server with 32GB RAM:

RamSize        Index(min)    Optimize(min)     Max VM
1.9G         24            5                   5G
800M        24            5                  4G

Not much difference. I'll make a couple more runs with lower values.
Btw, the indexing times are really about 5 min. shorter because of some
non-Lucene related delays after the last document.

Peter



On Thu, Oct 29, 2009 at 4:30 PM, Mark Miller <markrmil...@gmail.com> wrote:

> Any chance I could get you to try that again with a buffer of like 800MB
> to a gig and do a comparison?
>
> I've been investigating the returns you get with a larger buffer size.
> It appears to be pretty diminishing returns over 100MB or so - at higher
> than that, I've gotten both slower speeds for some sizes, and larger
> gains for others. But only better by 5-10 docs a second up to a gig. But
> I can't reliably test at over a gig - I have only 4 GB of RAM, and even
> with that, at over a gig it starts to page and the performance gets hit.
> I'd love to see what kind of benefit you see going from around a gig to
> just under 2.
>
> Peter Keegan wrote:
> > Btw, this 2.9 indexer is fast! I indexed 4Gb (1.07 million docs) with
> > optimization in just under 30 min.
> > I used setRAMBufferSizeMB=1.9G
> >
> > Peter
> >
> > On Thu, Oct 29, 2009 at 3:46 PM, Peter Keegan <peterlkee...@gmail.com
> >wrote:
> >
> >
> >> A handful of the source documents did contain the U+FFFF character. The
> >> patch from *LUCENE-2016<
> https://issues.apache.org/jira/browse/LUCENE-2016>
> >> *fixed the problem.
> >> Thanks Mike!
> >>
> >> Peter
> >>
> >>
> >> On Wed, Oct 28, 2009 at 1:29 PM, Michael McCandless <
> >> luc...@mikemccandless.com> wrote:
> >>
> >>
> >>> Hmm, only a few affected terms, and all this particular
> >>> "literals:cfid196$" term, with optional suffixes.  Really strange.
> >>>
> >>> One things that's odd is the exact term "literals:cfid196$" is printed
> >>> twice, which should never happen (every unique term should be stored
> >>> only once, in the terms dict).
> >>>
> >>> And, otherwise, CheckIndex got through the index just fine.
> >>>
> >>> Try searching a TermQuery with these affected terms and see if it
> >>> succeeds?  If so, maybe trying making an index with one or two of
> >>> them, alone, and see if that index shows the problem?
> >>>
> >>> OK I'm attaching more mods.  Can you re-run your CheckIndex?  It will
> >>> produce an enormous amount of output, but if you can excise the few
> >>> lines around when that warning comes out & post back that'd be great.
> >>>
> >>> Mike
> >>>
> >>> On Wed, Oct 28, 2009 at 12:23 PM, Peter Keegan <peterlkee...@gmail.com
> >
> >>> wrote:
> >>>
> >>>> Just to be safe, I ran with the official jar file from one of the
> >>>>
> >>> mirrors
> >>>
> >>>> and reproduced the problem.
> >>>> The debug session is not showing any characters = '\uffff' (checking
> >>>>
> >>> this in
> >>>
> >>>> Tokenizer).
> >>>> The output from the modified CheckIndex follows. There are only a few
> >>>>
> >>> terms
> >>>
> >>>> with the inconsistency. They are all legitimate terms from the app's
> >>>> context. With this info, I might be able to isolate the source
> >>>>
> >>> documents.
> >>>
> >>>> What should I be looking for when they are indexed?
> >>>>
> >>>> CheckInput output:
> >>>>
> >>>> Opening index @
> >>>>
> >>> D:\mnsavs\lresumes1\lresumes1.luc\lresumes1.search.main.4
> >>>
> >>>> Segments file=segments_2 numSegments=3 version=FORMAT_DIAGNOSTICS
> >>>>
> >>> [Lucene
> >>>
> >>>> 2.9]
> >>>>  1 of 3: name=_0 docCount=413585
> >>>>    compound=false
> >>>>    hasProx=true
> >>>>    numFiles=8
> >>>>    size (MB)=1,148.817
> >>>>    diagnostics = {os.version=5.2, os=Windows 2003,
> lucene.version=2.9.0
> >>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
> >>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
> >>>>    docStoreOffset=0
> >>>>    docStoreSegment=_0
> >>>>    docStoreIsCompoundFile=false
> >>>>    no deletions
> >>>>    test: open reader.........OK
> >>>>    test: fields..............OK [33 fields]
> >>>>    test: field norms.........OK [33 fields]
> >>>>    test: terms, freq, prox...OK [7704753 terms; 180326717 terms/docs
> >>>>
> >>> pairs;
> >>>
> >>>> 340244234 tokens]
> >>>>    test: stored fields.......OK [1240755 total field count; avg 3
> fields
> >>>> per doc]
> >>>>    test: term vectors........OK [0 total vector count; avg 0 term/freq
> >>>> vector fields per doc]
> >>>>
> >>>>  2 of 3: name=_1 docCount=359068
> >>>>    compound=false
> >>>>    hasProx=true
> >>>>    numFiles=8
> >>>>    size (MB)=1,125.161
> >>>>    diagnostics = {os.version=5.2, os=Windows 2003,
> lucene.version=2.9.0
> >>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
> >>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
> >>>>    docStoreOffset=413585
> >>>>    docStoreSegment=_0
> >>>>    docStoreIsCompoundFile=false
> >>>>    no deletions
> >>>>    test: open reader.........OK
> >>>>    test: fields..............OK [33 fields]
> >>>>    test: field norms.........OK [33 fields]
> >>>>    test: terms, freq, prox...WARNING: term  literals:cfid196$
> docFreq=43
> >>>>
> >>> !=
> >>>
> >>>> num docs seen 4 + num docs deleted 0
> >>>> WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num
> docs
> >>>> deleted 0
> >>>> WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num
> docs
> >>>> deleted 0
> >>>> WARNING: term  literals:cfid196$commandant docFreq=1 != num docs seen
> 9
> >>>>
> >>> +
> >>>
> >>>> num docs deleted 0
> >>>> WARNING: term  literals:cfid196$on docFreq=3178 != num docs seen 1 +
> num
> >>>> docs deleted 0
> >>>> OK [7137621 terms; 179101847 terms/docs pairs; 346076058 tokens]
> >>>>    test: stored fields.......OK [1077204 total field count; avg 3
> fields
> >>>> per doc]
> >>>>    test: term vectors........OK [0 total vector count; avg 0 term/freq
> >>>> vector fields per doc]
> >>>>
> >>>>  3 of 3: name=_2 docCount=304849
> >>>>    compound=false
> >>>>    hasProx=true
> >>>>    numFiles=8
> >>>>    size (MB)=962.004
> >>>>    diagnostics = {os.version=5.2, os=Windows 2003,
> lucene.version=2.9.0
> >>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
> >>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
> >>>>    docStoreOffset=772653
> >>>>    docStoreSegment=_0
> >>>>    docStoreIsCompoundFile=false
> >>>>    no deletions
> >>>>    test: open reader.........OK
> >>>>    test: fields..............OK [33 fields]
> >>>>    test: field norms.........OK [33 fields]
> >>>>    test: terms, freq, prox...WARNING: term  contents:? docFreq=1 !=
> num
> >>>> docs seen 246 + num docs deleted 0
> >>>> WARNING: term  literals:cfid196$ docFreq=45 != num docs seen 4 + num
> >>>>
> >>> docs
> >>>
> >>>> deleted 0
> >>>> WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num
> docs
> >>>> deleted 0
> >>>> WARNING: term  literals:cfid196$cashier docFreq=1 != num docs seen 37
> +
> >>>>
> >>> num
> >>>
> >>>> docs deleted 0
> >>>> WARNING: term  literals:cfid196$interrogation docFreq=181 != num docs
> >>>>
> >>> seen 1
> >>>
> >>>> + num docs deleted 0
> >>>> WARNING: term  literals:cfid196$leader docFreq=1 != num docs seen 353
> +
> >>>>
> >>> num
> >>>
> >>>> docs deleted 0
> >>>> WARNING: term  literals:cfid196$microsoft docFreq=3114 != num docs
> seen
> >>>>
> >>> 1 +
> >>>
> >>>> num docs deleted 0
> >>>> WARNING: term  literals:cfid196$nt docFreq=200 != num docs seen 1 +
> num
> >>>>
> >>> docs
> >>>
> >>>> deleted 0
> >>>> OK [6497769 terms; 145296880 terms/docs pairs; 293458734 tokens]
> >>>>    test: stored fields.......OK [914547 total field count; avg 3
> fields
> >>>>
> >>> per
> >>>
> >>>> doc]
> >>>>    test: term vectors........OK [0 total vector count; avg 0 term/freq
> >>>> vector fields per doc]
> >>>>
> >>>> No problems were detected with this index.
> >>>>
> >>>> Peter
> >>>>
> >>>>
> >>>> On Wed, Oct 28, 2009 at 11:29 AM, Michael McCandless <
> >>>> luc...@mikemccandless.com> wrote:
> >>>>
> >>>>
> >>>>> On Wed, Oct 28, 2009 at 10:58 AM, Peter Keegan <
> peterlkee...@gmail.com
> >>>>>
> >>>>> wrote:
> >>>>>
> >>>>>> The only change I made to the source code was the patch for
> >>>>>>
> >>>>> PayloadNearQuery
> >>>>>
> >>>>>> (LUCENE-1986).
> >>>>>>
> >>>>> That patch certainly shouldn't lead to this.
> >>>>>
> >>>>>
> >>>>>> It's possible that our content contains U+FFFF. I will run in
> >>>>>>
> >>> debugger
> >>>
> >>>>> and
> >>>>>
> >>>>>> see.
> >>>>>>
> >>>>> OK may as well check just so we cover all possibilities.
> >>>>>
> >>>>>
> >>>>>> The data is 'sensitive', so I may not be able to provide a bad
> >>>>>>
> >>> segment,
> >>>
> >>>>>> unfortunately.
> >>>>>>
> >>>>> OK, maybe we can modify your CheckIndex instead.  Let's start with
> >>>>> this, which prints a warning whenever the docFreq differs but
> >>>>> otherwise continues (vs throwing RuntimeException).  I'm curious how
> >>>>> many terms show this, and whether the TermEnum keeps working after
> >>>>> this term that has different docFreq:
> >>>>>
> >>>>> Index: src/java/org/apache/lucene/index/CheckIndex.java
> >>>>> ===================================================================
> >>>>> --- src/java/org/apache/lucene/index/CheckIndex.java    (revision
> >>>>>
> >>> 829889)
> >>>
> >>>>> +++ src/java/org/apache/lucene/index/CheckIndex.java    (working
> copy)
> >>>>> @@ -672,8 +672,8 @@
> >>>>>         }
> >>>>>
> >>>>>         if (freq0 + delCount != docFreq) {
> >>>>> -          throw new RuntimeException("term " + term + " docFreq=" +
> >>>>> -                                     docFreq + " != num docs seen "
> +
> >>>>> freq0 + " + num docs deleted " + delCount);
> >>>>> +          System.out.println("WARNING: term  " + term + " docFreq="
> +
> >>>>> +                             docFreq + " != num docs seen " + freq0
> +
> >>>>> " + num docs deleted " + delCount);
> >>>>>         }
> >>>>>       }
> >>>>>
> >>>>> Mike
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>>>
> >>>>>
> >>>>>
> >>
> >
> >
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Reply via email to