Any chance I could get you to try that again with a buffer of like 800MB
to a gig and do a comparison?

I've been investigating the returns you get with a larger buffer size.
It appears to be pretty diminishing returns over 100MB or so - at higher
than that, I've gotten both slower speeds for some sizes, and larger
gains for others. But only better by 5-10 docs a second up to a gig. But
I can't reliably test at over a gig - I have only 4 GB of RAM, and even
with that, at over a gig it starts to page and the performance gets hit.
I'd love to see what kind of benefit you see going from around a gig to
just under 2.

Peter Keegan wrote:
> Btw, this 2.9 indexer is fast! I indexed 4Gb (1.07 million docs) with
> optimization in just under 30 min.
> I used setRAMBufferSizeMB=1.9G
>
> Peter
>
> On Thu, Oct 29, 2009 at 3:46 PM, Peter Keegan <peterlkee...@gmail.com>wrote:
>
>   
>> A handful of the source documents did contain the U+FFFF character. The
>> patch from *LUCENE-2016<https://issues.apache.org/jira/browse/LUCENE-2016>
>> *fixed the problem.
>> Thanks Mike!
>>
>> Peter
>>
>>
>> On Wed, Oct 28, 2009 at 1:29 PM, Michael McCandless <
>> luc...@mikemccandless.com> wrote:
>>
>>     
>>> Hmm, only a few affected terms, and all this particular
>>> "literals:cfid196$" term, with optional suffixes.  Really strange.
>>>
>>> One things that's odd is the exact term "literals:cfid196$" is printed
>>> twice, which should never happen (every unique term should be stored
>>> only once, in the terms dict).
>>>
>>> And, otherwise, CheckIndex got through the index just fine.
>>>
>>> Try searching a TermQuery with these affected terms and see if it
>>> succeeds?  If so, maybe trying making an index with one or two of
>>> them, alone, and see if that index shows the problem?
>>>
>>> OK I'm attaching more mods.  Can you re-run your CheckIndex?  It will
>>> produce an enormous amount of output, but if you can excise the few
>>> lines around when that warning comes out & post back that'd be great.
>>>
>>> Mike
>>>
>>> On Wed, Oct 28, 2009 at 12:23 PM, Peter Keegan <peterlkee...@gmail.com>
>>> wrote:
>>>       
>>>> Just to be safe, I ran with the official jar file from one of the
>>>>         
>>> mirrors
>>>       
>>>> and reproduced the problem.
>>>> The debug session is not showing any characters = '\uffff' (checking
>>>>         
>>> this in
>>>       
>>>> Tokenizer).
>>>> The output from the modified CheckIndex follows. There are only a few
>>>>         
>>> terms
>>>       
>>>> with the inconsistency. They are all legitimate terms from the app's
>>>> context. With this info, I might be able to isolate the source
>>>>         
>>> documents.
>>>       
>>>> What should I be looking for when they are indexed?
>>>>
>>>> CheckInput output:
>>>>
>>>> Opening index @
>>>>         
>>> D:\mnsavs\lresumes1\lresumes1.luc\lresumes1.search.main.4
>>>       
>>>> Segments file=segments_2 numSegments=3 version=FORMAT_DIAGNOSTICS
>>>>         
>>> [Lucene
>>>       
>>>> 2.9]
>>>>  1 of 3: name=_0 docCount=413585
>>>>    compound=false
>>>>    hasProx=true
>>>>    numFiles=8
>>>>    size (MB)=1,148.817
>>>>    diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0
>>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
>>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
>>>>    docStoreOffset=0
>>>>    docStoreSegment=_0
>>>>    docStoreIsCompoundFile=false
>>>>    no deletions
>>>>    test: open reader.........OK
>>>>    test: fields..............OK [33 fields]
>>>>    test: field norms.........OK [33 fields]
>>>>    test: terms, freq, prox...OK [7704753 terms; 180326717 terms/docs
>>>>         
>>> pairs;
>>>       
>>>> 340244234 tokens]
>>>>    test: stored fields.......OK [1240755 total field count; avg 3 fields
>>>> per doc]
>>>>    test: term vectors........OK [0 total vector count; avg 0 term/freq
>>>> vector fields per doc]
>>>>
>>>>  2 of 3: name=_1 docCount=359068
>>>>    compound=false
>>>>    hasProx=true
>>>>    numFiles=8
>>>>    size (MB)=1,125.161
>>>>    diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0
>>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
>>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
>>>>    docStoreOffset=413585
>>>>    docStoreSegment=_0
>>>>    docStoreIsCompoundFile=false
>>>>    no deletions
>>>>    test: open reader.........OK
>>>>    test: fields..............OK [33 fields]
>>>>    test: field norms.........OK [33 fields]
>>>>    test: terms, freq, prox...WARNING: term  literals:cfid196$ docFreq=43
>>>>         
>>> !=
>>>       
>>>> num docs seen 4 + num docs deleted 0
>>>> WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num docs
>>>> deleted 0
>>>> WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num docs
>>>> deleted 0
>>>> WARNING: term  literals:cfid196$commandant docFreq=1 != num docs seen 9
>>>>         
>>> +
>>>       
>>>> num docs deleted 0
>>>> WARNING: term  literals:cfid196$on docFreq=3178 != num docs seen 1 + num
>>>> docs deleted 0
>>>> OK [7137621 terms; 179101847 terms/docs pairs; 346076058 tokens]
>>>>    test: stored fields.......OK [1077204 total field count; avg 3 fields
>>>> per doc]
>>>>    test: term vectors........OK [0 total vector count; avg 0 term/freq
>>>> vector fields per doc]
>>>>
>>>>  3 of 3: name=_2 docCount=304849
>>>>    compound=false
>>>>    hasProx=true
>>>>    numFiles=8
>>>>    size (MB)=962.004
>>>>    diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0
>>>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
>>>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
>>>>    docStoreOffset=772653
>>>>    docStoreSegment=_0
>>>>    docStoreIsCompoundFile=false
>>>>    no deletions
>>>>    test: open reader.........OK
>>>>    test: fields..............OK [33 fields]
>>>>    test: field norms.........OK [33 fields]
>>>>    test: terms, freq, prox...WARNING: term  contents:? docFreq=1 != num
>>>> docs seen 246 + num docs deleted 0
>>>> WARNING: term  literals:cfid196$ docFreq=45 != num docs seen 4 + num
>>>>         
>>> docs
>>>       
>>>> deleted 0
>>>> WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num docs
>>>> deleted 0
>>>> WARNING: term  literals:cfid196$cashier docFreq=1 != num docs seen 37 +
>>>>         
>>> num
>>>       
>>>> docs deleted 0
>>>> WARNING: term  literals:cfid196$interrogation docFreq=181 != num docs
>>>>         
>>> seen 1
>>>       
>>>> + num docs deleted 0
>>>> WARNING: term  literals:cfid196$leader docFreq=1 != num docs seen 353 +
>>>>         
>>> num
>>>       
>>>> docs deleted 0
>>>> WARNING: term  literals:cfid196$microsoft docFreq=3114 != num docs seen
>>>>         
>>> 1 +
>>>       
>>>> num docs deleted 0
>>>> WARNING: term  literals:cfid196$nt docFreq=200 != num docs seen 1 + num
>>>>         
>>> docs
>>>       
>>>> deleted 0
>>>> OK [6497769 terms; 145296880 terms/docs pairs; 293458734 tokens]
>>>>    test: stored fields.......OK [914547 total field count; avg 3 fields
>>>>         
>>> per
>>>       
>>>> doc]
>>>>    test: term vectors........OK [0 total vector count; avg 0 term/freq
>>>> vector fields per doc]
>>>>
>>>> No problems were detected with this index.
>>>>
>>>> Peter
>>>>
>>>>
>>>> On Wed, Oct 28, 2009 at 11:29 AM, Michael McCandless <
>>>> luc...@mikemccandless.com> wrote:
>>>>
>>>>         
>>>>> On Wed, Oct 28, 2009 at 10:58 AM, Peter Keegan <peterlkee...@gmail.com
>>>>>           
>>>>> wrote:
>>>>>           
>>>>>> The only change I made to the source code was the patch for
>>>>>>             
>>>>> PayloadNearQuery
>>>>>           
>>>>>> (LUCENE-1986).
>>>>>>             
>>>>> That patch certainly shouldn't lead to this.
>>>>>
>>>>>           
>>>>>> It's possible that our content contains U+FFFF. I will run in
>>>>>>             
>>> debugger
>>>       
>>>>> and
>>>>>           
>>>>>> see.
>>>>>>             
>>>>> OK may as well check just so we cover all possibilities.
>>>>>
>>>>>           
>>>>>> The data is 'sensitive', so I may not be able to provide a bad
>>>>>>             
>>> segment,
>>>       
>>>>>> unfortunately.
>>>>>>             
>>>>> OK, maybe we can modify your CheckIndex instead.  Let's start with
>>>>> this, which prints a warning whenever the docFreq differs but
>>>>> otherwise continues (vs throwing RuntimeException).  I'm curious how
>>>>> many terms show this, and whether the TermEnum keeps working after
>>>>> this term that has different docFreq:
>>>>>
>>>>> Index: src/java/org/apache/lucene/index/CheckIndex.java
>>>>> ===================================================================
>>>>> --- src/java/org/apache/lucene/index/CheckIndex.java    (revision
>>>>>           
>>> 829889)
>>>       
>>>>> +++ src/java/org/apache/lucene/index/CheckIndex.java    (working copy)
>>>>> @@ -672,8 +672,8 @@
>>>>>         }
>>>>>
>>>>>         if (freq0 + delCount != docFreq) {
>>>>> -          throw new RuntimeException("term " + term + " docFreq=" +
>>>>> -                                     docFreq + " != num docs seen " +
>>>>> freq0 + " + num docs deleted " + delCount);
>>>>> +          System.out.println("WARNING: term  " + term + " docFreq=" +
>>>>> +                             docFreq + " != num docs seen " + freq0 +
>>>>> " + num docs deleted " + delCount);
>>>>>         }
>>>>>       }
>>>>>
>>>>> Mike
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>
>>>>>
>>>>>           
>>     
>
>   


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to