[ 
https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499226
 ] 

Michael McCandless commented on LUCENE-888:
-------------------------------------------


> we did some time ago a few tests and simply concluded, it boils down to what 
> Doug said, "It does incur extra system calls, but uses less memory, which is 
> a tradeoff."
>
> in *our* setup 4k was kind of magic number , ca 5-8% faster. I guess it is 
> actually a compromise between time spent in extra os calls vs probability of 
> reading more than you will really use (waste time on it). Why 4k number 
> happens often to be good compromise is probably the difference in speed of 
> buffer copy for 4k vs 1k being negligible compared to time spent on system 
> calls.
> 
> The only conclusion we came up to is that you have to measure it and find 
> good compromise. Our case is a bit atypical, short documents (1G index, 60Mio 
> docs) and queries with a lot of terms (80-200), Win 2003 Server, single disk.
>
> And I do not remember was it before or after we started using MMAP, so no 
> idea really if this affects MMAP setup, guess not. 

Interesting!  Do you remember if your 5-8% gain was for searching or
indexing?  This issue is focusing on buffer size impact on indexing
performance and LUCENE-893 is focusing on search performance.


> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-888.patch
>
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to