[ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499280 ]
Marvin Humphrey commented on LUCENE-888: ---------------------------------------- I have some auxiliary data points to report after experimenting with buffer size in KS today on three different systems: OS X 10.4.9, FreeBSD 5.3, and an old RedHat 9 box. The FS i/o classes in KinoSearch use a FILE* and fopen/fwrite/fread/fseek/ftell, rather than file descriptors and the POSIX family of functions. Theoretically, this is wasteful because FILE* stream i/o is buffered, so there's double buffering happening. I've meant to change that for some time. However, when I've used setvbuf(self->fhandle, NULL, _IONBF) to eliminate the buffer for the underlying FILE* object, performance tanks -- indexing time doubles. I still don't understand exactly why, but I know a little more now. * Swapping out the FILE* for a descriptor and switching all the I/O calls to POSIX variants has no measurable impact on any of these systems. * Changing the KS buffer size from 1024 to 4096 has no measurable impact on any of these systems. * Using setvbuf to eliminate the buffering at output turns out to have no impact on indexing performance. It's only killing off the read mode FILE* buffer that causes the problem. So, it seems that the only change I can make moves the numbers in the wrong direction. The results are somewhat puzzling because I would ordinarily have blamed sub-optimal flush/refill scheduling in my app for the degraded performance with setvbuf() on read mode. However, the POSIX i/o calls are unbuffered, so that's not it. My best guess is that disabling buffering for read mode disables an fseek/ftell optimization. > Improve indexing performance by increasing internal buffer sizes > ---------------------------------------------------------------- > > Key: LUCENE-888 > URL: https://issues.apache.org/jira/browse/LUCENE-888 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.1 > Reporter: Michael McCandless > Assigned To: Michael McCandless > Priority: Minor > Attachments: LUCENE-888.patch, LUCENE-888.take2.patch > > > In working on LUCENE-843, I noticed that two buffer sizes have a > substantial impact on overall indexing performance. > First is BufferedIndexOutput.BUFFER_SIZE (also used by > BufferedIndexInput). Second is CompoundFileWriter's buffer used to > actually build the compound file. Both are now 1 KB (1024 bytes). > I ran the same indexing test I'm using for LUCENE-843. I'm indexing > ~5,500 byte plain text docs derived from the Europarl corpus > (English). I index 200,000 docs with compound file enabled and term > vector positions & offsets stored plus stored fields. I flush > documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to > not hit LUCENE-845. The resulting index is 1.7 GB. The index is not > optimized in the end and I left mergeFactor @ 10. > I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO > system. > At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if > I increase both buffers to 8 KB it takes 554 sec to build the index, > which is an 11% overall gain! > I will run more tests to see if there is a natural knee in the curve > (buffer size above which we don't really gain much more performance). > I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE > at 1024, at least for now. During searching there can be quite a few > of this class instantiated, and likely a larger buffer size for the > freq/prox streams could actually hurt search performance for those > searches that use skipping. > The CompoundFileWriter buffer is created only briefly, so I think we > can use a fairly large (32 KB?) buffer there. And there should not be > too many BufferedIndexOutputs alive at once so I think a large-ish > buffer (16 KB?) should be OK. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]