[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

Doug Cutting (JIRA) Fri, 25 May 2007 12:56:37 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499212
 ]


Doug Cutting commented on LUCENE-888:
-------------------------------------

> then we don't save IO by limiting the buffer size to 1 KB

I'm confused by this.  My assumption is that, when you make a request to read 
1k from a disk file, that the OS reads substantially more than 1k from the disk 
and places it in the buffer cache.  (The cost of randomly reading 1k is nearly 
the same as randomly reading 100k--both are dominated by seek.) So, if you make 
another request to read 1k shortly thereafter you'll get it from the buffer 
cache and the incremental cost will be that of making a system call.

In general, one should thus rely on the buffer cache and read-ahead, and make 
input buffers only big enough so that system call overhead is insignificant.  
An alternate strategy is to not trust the buffer cache and read-ahead, but 
rather to make your buffers large enough so that transfer time dominates seeks. 
 This can require 1MB or larger buffers, so isn't always practical.

So, back to your statement, a 1k buffer doesn't save physical i/o, but nor 
should it incur extra physical i/o.  It does incur extra system calls, but uses 
less memory, which is a tradeoff.  Is that what you meant?

> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-888.patch
>
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

Reply via email to