Introduction: The I/O buffer of the kernel are currently allocated in buffer_map sized statically upon boot, and never grows. This limits the scale of I/O performance on a host with large physical memory. We used to tune NBUF to cope with that problem. This workaround, however, results in a lot of wired pages not available for user processes, which is not acceptable for memory-bound applications.
In order to run both I/O-bound and memory-bound processes on the same host, it is essential to achieve: A) allocation of buffer from kernel_map to break the limit of a map size, and B) page reclaim from idle buffers to regulate the number of wired pages. The patch at: http://people.FreeBSD.org/~tanimura/patches/dynamicbuf.diff.gz implements buffer allocation from kernel_map and reclaim of buffer pages. With this patch, make kernel-depend && make kernel completes about 30-60 seconds faster on my PC. Implementation in Detail: A) is easy; first you need to do s/buffer_map/kernel_map/. Since an arbitrary number of buffer pages can be allocated dynamically, buffer headers (struct buf) should be allocated dynamically as well. Glue them together into a list so that they can be traversed by boot() et. al. In order to accomplish B), we must find buffers both the filesystem and I/O codes will not touch. The clean buffer queue holds such the buffers. (exception: if the vnode associated with a clean buffer is held by the namecache, it may access the buffer page.) Thus, we should unwire the pages of a buffer prior to enqueuing it to the clean queue, and rewire the pages down in bremfree() if the pages are not reclaimed. Although unwiring gives a page a chance of being reclaimed, we can go further. In Solaris, it is known that file cache pages should be reclaimed prior to the other kinds of pages (anonymous, executable, etc.) for a better performance. Mainly due to a lack of time to work on distinguishing the kind of a page to be unwired, I simply pass all unwired pages to vm_page_dontneed(). This approach places most of the unwired buffer pages at just one step to the cache queue. Experimental Evaluation and Results: The times taken to complete make kernel-depend && make kernel just after booting into single-user mode have been measured on my ThinkPad 600E (CPU: Pentium II 366MHz, RAM: 160MB) by time(1). The number passed to the -j option of make(1) has been varied from 1 to 30 in order to control the pressure of the memory demand for user processes. The baseline is the kernel without my patch. The following table shows the results. All of the times are in seconds. -j baseline w/ my patch real user sys real user sys 1 1608.21 1387.94 125.96 1577.88 1391.02 100.90 10 1576.10 1360.17 132.76 1531.79 1347.30 103.60 20 1568.01 1280.89 133.22 1509.36 1276.75 104.69 30 1923.42 1215.00 155.50 1865.13 1219.07 113.43 Most of the improvements in the real times are accomplished by the speedup of system calls. The hit ratio of getblk() may be increased, but not examined yet. Another interesting results are the numbers of swaps, shown below. -j baseline w/ my patch 1 0 0 10 0 0 20 141 77 30 530 465 Since the baseline kernel does not free buffer pages at all(*), it may be putting a pressure on the pages too much. (*) bfreekva() is called only when the whole KVA is too fragmented. Userland Interfaces: The sysctl variable vfs.bufspace now reports the size of the pages allocated for buffer, both wired and unwired. A new sysctl variable, vfs.bufwiredspace tells the size of the buffer pages wired down. vfs.bufkvaspace returns the size of the KVA space for buffer. Future Works: The handling of unwired pages can be improved by scanning only buffer pages. In that case, we may have to run the vm page scanner more frequently, as does Solaris. vfs.bufspace does not track the buffer pages reclaimed by the page scanner. They are counted when the buffer associated with those pages are removed from the clean queue, which is too late. Benchmark tools concentrating on disk I/O performance (bonnie, iozone, postmark, etc) may be more suitable than make kernel for evaluation. Comments and flames are welcome. Thanks a lot. -- Seigo Tanimura <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message