Re: Dynamic growth of the buffer and buffer page reclaim
Yes, this makes a lot of sense to me. You are exercising the system in a way that breaks the LRU algorithm. The buffer cache, without your patch, is carefully tuned to deal with this case... that is why vm_page_dontneed() exists and why the vm_object code calls it. This creates a little extra work when the buffer cache cycles, but prevents the system from reusing pages that it actually needs under certainly types of load. In particular, the situation the system is saving itself from by making this call is the situation where a user is reading data file(s) sequentially which are far larger then can be reasonably cached. In that situation strict LRU operation would result in terrible performance due to the system attempting to unconditionally cache data it is going to have to throw away anyway, and soon, which displaces older cached data that it will actually need soon. LRU isn't always the best policy. When you disable vm_page_dontneed() the huge amount of data you are moving through the system create a huge amount of pressure on the rest of the VM system, thus the slower performance when your data operations exceed what can be reasonably cached. This would also have a severely detrimental effect on production systems running real loads. It's a tradeoff. The system is trading off some cpu overhead generally in order to deal with a fairly common heavy-loading case and in order to reduce the pressure on the VM system for situations (such as reading a large file sequentially) which have no business putting pressure on the VM system. e.g. the system is trying to avoid blowing away user B's cache when user A reads a huge file. Your patch is changing the tradeoff, but not really making things better overall. Sure, the buildworld test went faster, but that's just one type of load. I am somewhat surprised at your 32MB tests. Are you sure you stabilized the dd before getting those timings? It would take more then one run of the dd on the file to completely cache it (that's one of the effects of vm_page_dontneed(). Since the system can't predict whether a large file is going to be re-read over and over again, or just read once, or even how much data will be read, it depresses the priority of pages statistically so it might take several full reads of the file for the system to realize that you really do want to cache the whole thing. In anycase, 32MB dd's should be fully cached in the buffer cache, with no rewiring of pages occuring at all, so I'm not sure why your patch is faster for that case. It shouldn't be. Or the 64MB case. The 96MB case is getting close to what your setup can cache reasonably. The pre-patch code can deal with it, but with your patch you are probably putting enough extra pressure on the VM system to force the pageout daemon to run earlier then it would without the patch. The VM system is a very finely tuned beast. That isn't to say that it can't be improved, I'm sure it can, and I encourage you to play with it! But you have to be wary of it as well. The VM system is tuned primarily for performance under heavy loads. There is a slight loss of performance under light loads because of the extra management. You have to be sure not to screw up the heavy-load performance when running light-load benchmarks. A buildworld is a light load benchmark, primarily because it execs so programs so many times (the compiler) that there are a lot of free VM pages sitting around for it to use. Buildworlds do not load-test the VM system all that well! A dd test is not supposed to load-test the VM system either. This is why we have vm_page_dontneeds()'s.. user B's cache shouldn't be blown away just because user A is reading a large file. We lose a little in a light load test but gain a lot under real world loads which put constant pressure on the VM system. -Matt Matthew Dillon <[EMAIL PROTECTED]> :I tried that on the same PC as my last benchmark. The PC has 160MB :RAM, so I created a file of 256MB. : :One pre-read (in order to stabilize the buffer cache) and four read :tests were run consecutively for each of six distinct read sizes just :after boot. The average read times (in seconds) and speeds (in :MB/sec) are shown below: : : : without my patchwith my patch :read size timespeed timespeed :32MB .49765.5.47169.0 :64MB 1.0263.6.90172.1 :96MB 2.2450.55.5218.9 :128MB 20.76.1916.57.79 :192MB 32.95.8332.95.83 :256MB 42.56.02
Re: Dynamic growth of the buffer and buffer page reclaim
On Mon, 28 Oct 2002 00:54:57 -0800 (PST), Matthew Dillon <[EMAIL PROTECTED]> said: dillon> I can demonstrate the issue with a simple test. Create a large file dillon> with dd, larger then physical memory: dillon> dd if=/dev/zero of=test bs=1m count=4096# create a 4G file. dillon> Then dd (read) portions of the file and observe the performance. dillon> Do this several times to get stable numbers. dillon> dd if=test of=/dev/null bs=1m count=16 # repeat several times dillon> dd if=test of=/dev/null bs=1m count=32 # etc... dillon> You will find that read performance will drop in two significant dillon> places: (1) When the data no longer fits in the buffer cache and dillon> the buffer cache is forced to teardown wirings and rewire other dillon> pages from the VM page cache. Still no physical I/O is being done. dillon> (2) When the data no longer fits in the VM page cache and the system dillon> is forced to perform physical I/O. I tried that on the same PC as my last benchmark. The PC has 160MB RAM, so I created a file of 256MB. One pre-read (in order to stabilize the buffer cache) and four read tests were run consecutively for each of six distinct read sizes just after boot. The average read times (in seconds) and speeds (in MB/sec) are shown below: without my patchwith my patch read size timespeed timespeed 32MB.49765.5.47169.0 64MB1.0263.6.90172.1 96MB2.2450.55.5218.9 128MB 20.76.1916.57.79 192MB 32.95.8332.95.83 256MB 42.56.0243.05.95 dillon> Its case (1) that you are manipulating with your patch, and as you can dillon> see it is entirely dependant on the number of wired pages that the dillon> system is able to maintain in the buffer cache. The results of 128MB-read are likely to be so. 96MB-read gave interesting results. Since vfs_unwirepages() passes buffer pages to vm_page_dontneed(), it seems that the page scanner reclaims buffer cache pages too aggressively. The table below shows the results with my patch where vfs_unwirepages() does not call vm_page_dontneed(). read size timespeed 32MB.50363.7 64MB.91670.5 96MB4.5727.1 128MB 17.07.62 192MB 35.85.36 256MB 46.05.56 The 96MB-read results were a little bit better, although the reads of larger sizes became slower. The unwired buffer pages may be putting a pressure on user process pages and the page scanner. -- Seigo Tanimura <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Dynamic growth of the buffer and buffer page reclaim
:I was going to comment on fragmentation issues, but that seems to have :been very well covered. I would like to point out that removing the :buffer_map not only contributes to kernel map fragmentation, but also :contention for the kernel map. It might also prevent us from removing :giant from the kernel map because it would add another interrupt time :consumer. Yes. Whatever the case any sort of temporary KVA mapping management system would need its own submap. It would be insane to use the kernel_map or kmem_map for this. In regards to Seigo's patch: The scaleability issue is entirely related to the KVA mapping portion of the buffer cache. Only I/O *WRITE* performance is specifically limited by the size of the buffer_map, due to the limited number of dirty buffers allowed in the map. This in turn is a restriction required by filesystems which must keep track of 'dirty' buffers in order to sequence out writes. Currently the only way around this limitation is to use mmap/MAP_NOSYNC. In otherwords, we support dirty VM pages that are not associated with the buffer cache but most of the filesystem algorithms are still based around the assumption that dirty pages will be mapped into dirty buffers. I/O *READ* caching is limited only by the VM Page cache. The reason you got slightly better numbers with your patch has nothing to do with I/O performance, it is simply related to the cost of the buffer instantiations and teardowns that occur in the limit buffer_map space mapping pages out of the VM page cache. Since you could have more buffers, there were fewer instantiations and teardowns. It's that simple. Unfortunately, this performance gain is *DIRECTLY* tied to the number of pages wired into the buffer cache. It is precisely the wired pages portion of the instantiation and teardown that eats the extra cpu. So the moment you regulate the number of wired pages in the system, you will blow the performance you are getting. I can demonstrate the issue with a simple test. Create a large file with dd, larger then physical memory: dd if=/dev/zero of=test bs=1m count=4096# create a 4G file. Then dd (read) portions of the file and observe the performance. Do this several times to get stable numbers. dd if=test of=/dev/null bs=1m count=16 # repeat several times dd if=test of=/dev/null bs=1m count=32 # etc... You will find that read performance will drop in two significant places: (1) When the data no longer fits in the buffer cache and the buffer cache is forced to teardown wirings and rewire other pages from the VM page cache. Still no physical I/O is being done. (2) When the data no longer fits in the VM page cache and the system is forced to perform physical I/O. Its case (1) that you are manipulating with your patch, and as you can see it is entirely dependant on the number of wired pages that the system is able to maintain in the buffer cache. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Dynamic growth of the buffer and buffer page reclaim
Hmm. Well, the real problem is not going to be the struct bio but will instead be the filesystem support. Filesystems expect KVA mapped data from the buffer cache, and they use pointers to the data all over the place. The buffer cache is very efficient, at least as long filesystem block sizes are <= 16K. You can mix filesystem block sizes as long as they are <= 16K and there will be no remapping and no fragmentation and buffer cache operation will be O(1). If you mix filesystem block sizes <= 16K and > 16K the buffer cache will start to hit remapping and fragmentation cases (though its really the remapping cases that hurt). It isn't a terrible problem, but it is an issue. Tor has test cases for the above issue and could probably give you more information on it. The real performance problem is the fact that the buffer cache exists at all. I wouldn't bother fixing the remapping issue and would instead focus on getting rid of the buffer cache entirely. As I said, the issue there is filesystem block mapping support for meta-data (bitmaps, inodes), not I/O. -Matt Matthew Dillon <[EMAIL PROTECTED]> :On Mon, 28 Oct 2002, Seigo Tanimura wrote: : :> On Thu, 24 Oct 2002 15:05:30 +1000 (EST), :> Bruce Evans <[EMAIL PROTECTED]> said: :> :> bde> Almost exactly what we have. It turns out to be not very good, at least :> bde> in its current implementation, since remapping is too expensive. Things :> bde> work OK to the extent that remapping is not required, but so would a :> bde> much simpler implementation that uses less vm and more copying of data :> bde> (copying seems to be faster than remapping). :> :> Which process is expensive in remapping? Allocation of a KVA space? :> Page wiring? Or pmap operation? : :The allocation seemed to be most expensive when I looked at this about 2 :years ago. The cause of the remapping seemed to be that different amounts :of buffer kva were allocated for different buffer sizes. Copying between :filesystems with different block sizes therefore caused lots of remapping. :I think this cause of remapping has been fixed. VM has been improved too. :I'm not sure how much in this area. : :Bruce : To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Dynamic growth of the buffer and buffer page reclaim
On Mon, 28 Oct 2002, Seigo Tanimura wrote: > On Thu, 24 Oct 2002 15:05:30 +1000 (EST), > Bruce Evans <[EMAIL PROTECTED]> said: > > bde> Almost exactly what we have. It turns out to be not very good, at least > bde> in its current implementation, since remapping is too expensive. Things > bde> work OK to the extent that remapping is not required, but so would a > bde> much simpler implementation that uses less vm and more copying of data > bde> (copying seems to be faster than remapping). > > Which process is expensive in remapping? Allocation of a KVA space? > Page wiring? Or pmap operation? The allocation seemed to be most expensive when I looked at this about 2 years ago. The cause of the remapping seemed to be that different amounts of buffer kva were allocated for different buffer sizes. Copying between filesystems with different block sizes therefore caused lots of remapping. I think this cause of remapping has been fixed. VM has been improved too. I'm not sure how much in this area. Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Dynamic growth of the buffer and buffer page reclaim
On Thu, 24 Oct 2002 15:05:30 +1000 (EST), Bruce Evans <[EMAIL PROTECTED]> said: bde> Almost exactly what we have. It turns out to be not very good, at least bde> in its current implementation, since remapping is too expensive. Things bde> work OK to the extent that remapping is not required, but so would a bde> much simpler implementation that uses less vm and more copying of data bde> (copying seems to be faster than remapping). Which process is expensive in remapping? Allocation of a KVA space? Page wiring? Or pmap operation? -- Seigo Tanimura <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Dynamic growth of the buffer and buffer page reclaim
On Wed, 23 Oct 2002 16:51:44 -0400 (EDT), Jeff Roberson <[EMAIL PROTECTED]> said: jroberson> I do, however, like the page unwiring idea. As long as it's not too jroberson> expensive. I have been somewhat disappointed that the buffer cache's jroberson> buffers are hands off for the vm. I'm confused about your approach jroberson> though. I think that the rewire function is unnecessary. You could move jroberson> this code into allocbuf() which would limit the number of times that you jroberson> have to make a pass over this list and keep the maintenance of it in a jroberson> more central place. This would also remove the need for truncating the jroberson> buf. I just wanted to make sure that buffers not in the clean queue look as they used to do without the patch. At least, if a buffer does not become busy or held, then it need not be rewired down. -- Seigo Tanimura <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Dynamic growth of the buffer and buffer page reclaim
On Wed, 23 Oct 2002, Julian Elischer wrote: > Bill Jolitz had a plan for 386BSD where all the buffers were nearly > always unmapped from KVM. He was going to have a number of slots > available for mapping them which would be used in a lifo order > > The number of slots was going to be somehow tunable > but I don't remember the details. We essentially have this now. Most disk blocks are cached in physical pages (VMIO pages) and are disassociated from the the buffer cache and not mapped into vm. Some blocks are mapped into buffers. There are a limited number of slots (nbuf). nbuf hasn't grown nearly as fast as disks or main memory, so what was once a large non-unified buffer cache (nbuf * MAXBSIZE worth of caching) is now just a small number of vm mappings (nbuf of them). > When you wanted to access a buffer, it was mapped for you > (unless already mapped).. It would be unmapped when it's slot > was needed for something else. WHen you accessed a buffer already mapped > it would move it back to the top of the list. > Various events could pre-unmap a buffer. e.g. the related vm object was > closed. (0 references). Almost exactly what we have. It turns out to be not very good, at least in its current implementation, since remapping is too expensive. Things work OK to the extent that remapping is not required, but so would a much simpler implementation that uses less vm and more copying of data (copying seems to be faster than remapping). Bruce To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Dynamic growth of the buffer and buffer page reclaim
In message <[EMAIL PROTECTED]>, Ju lian Elischer writes: >Bill Jolitz had a plan for 386BSD where all the buffers were nearly >always unmapped from KVM. He was going to have a number of slots >available for mapping them which would be used in a lifo order This entire area needs to be rethought. And by "rethought" I really mean try to redesign it from scratch to match our current needs and see what that leads to compared to the stuff we have. On of my first TODO after the 5.x/6.x branch is to give struct bio the ability to communicate in a vector of separate pages, not necessarily mapped. This gives us a scatter gather ability in the entire disk I/O path. This opens up a host of possibilities for things like clustering, background writes (using copy-on-write pages) etc etc etc. Needless to say, it will also drastically change the working environment for struct buf. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Dynamic growth of the buffer and buffer page reclaim
On Wed, 23 Oct 2002, Jeff Roberson wrote: > > I do, however, like the page unwiring idea. As long as it's not too > expensive. I have been somewhat disappointed that the buffer cache's > buffers are hands off for the vm. I'm confused about your approach > though. I think that the rewire function is unnecessary. You could move > this code into allocbuf() which would limit the number of times that you > have to make a pass over this list and keep the maintenance of it in a > more central place. This would also remove the need for truncating the > buf. > Bill Jolitz had a plan for 386BSD where all the buffers were nearly always unmapped from KVM. He was going to have a number of slots available for mapping them which would be used in a lifo order The number of slots was going to be somehow tunable but I don't remember the details. When you wanted to access a buffer, it was mapped for you (unless already mapped).. It would be unmapped when it's slot was needed for something else. WHen you accessed a buffer already mapped it would move it back to the top of the list. Various events could pre-unmap a buffer. e.g. the related vm object was closed. (0 references). To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Dynamic growth of the buffer and buffer page reclaim
On Wed, 23 Oct 2002, Seigo Tanimura wrote: > On Wed, 23 Oct 2002 16:44:06 +1000 (EST), > Bruce Evans <[EMAIL PROTECTED]> said: > > Incidentally, Solaris 7 on sun4u reserves a space of 256MB in the KVM > according to Solaris Internals. On i386 (x86), the size is only 4MB. > Not sure whether they use those spaces in a pure form, or they cluster > some consecutive pages (which leads to fragmentation), though... > > NetBSD UBC also makes a map dedicated to buffers in kernel_map. > > Maybe there is a point to have a map dedicated to the buffer space for > a better stability, and the size of the buffer map could be much > smaller than now. During my testing, I found that only up to 6-7MB of > the buffers out of 40-50MB were wired down (ie busy, locked for > background write or dirty) at most. > I was going to comment on fragmentation issues, but that seems to have been very well covered. I would like to point out that removing the buffer_map not only contributes to kernel map fragmentation, but also contention for the kernel map. It might also prevent us from removing giant from the kernel map because it would add another interrupt time consumer. I do, however, like the page unwiring idea. As long as it's not too expensive. I have been somewhat disappointed that the buffer cache's buffers are hands off for the vm. I'm confused about your approach though. I think that the rewire function is unnecessary. You could move this code into allocbuf() which would limit the number of times that you have to make a pass over this list and keep the maintenance of it in a more central place. This would also remove the need for truncating the buf. I have some other ideas for the buffer cache that you may be interested in. I have been discussing them in private for some time but I'll bring it up on arch soon so that others can comment. Cheers, Jeff To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Dynamic growth of the buffer and buffer page reclaim
On Wed, 23 Oct 2002 16:44:06 +1000 (EST), Bruce Evans <[EMAIL PROTECTED]> said: bde> I should be the last to defend the current design and implementation of bde> the buffer cache, since I think it gets almost everything wrong (the bde> implementation is OK, but has vast complications to work around design bde> errors), but I think buffer_map is one of the things that it gets right bde> (if we're going to have buffers at all). (snip) bde> I use the following changes in -current to enlarge the buffer cache and bde> avoid fragmentation. These only work because I don't have much physical bde> memory (512MB max). Even i386's have enough vm for the pure form of bde> buffer_map to work: bde> - enlarge BKVASIZE to MAXBSIZE so that fragmentation can not (should not?) bde> occur. bde> - enlarge nbuf by a factor of (my_BKVASIZE / current_BKVASIZE) to work bde> around bugs. The point of BKVASIZE got lost somewhere. bde> - enlarge nbuf and associated variables by another factor of 2 or 4 to bde> get a larger buffer cache. bde> This is marginal for 512MB physical, and probably wouldn't work if I had bde> a lot of mbufs. nbuf is about 4000 and buffer_map takes about 256MB. bde> 256MB is a lot, but nbuf = 4000 isn't a lot. I used buffer caches bde> with 2000 * 1K buffers under Minix and Linux before FreeBSD, and ISTR bde> having an nbuf of 5000 or so in FreeBSD-1.1. At least 2880 buffers are bde> needed to properly cache a tiny 1.44MB floppy with an msdosfs file bde> system with a block size of 512, and that was an important test case. bde> End of FreeBSD-[2-5] history. Incidentally, Solaris 7 on sun4u reserves a space of 256MB in the KVM according to Solaris Internals. On i386 (x86), the size is only 4MB. Not sure whether they use those spaces in a pure form, or they cluster some consecutive pages (which leads to fragmentation), though... NetBSD UBC also makes a map dedicated to buffers in kernel_map. Maybe there is a point to have a map dedicated to the buffer space for a better stability, and the size of the buffer map could be much smaller than now. During my testing, I found that only up to 6-7MB of the buffers out of 40-50MB were wired down (ie busy, locked for background write or dirty) at most. -- Seigo Tanimura <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Dynamic growth of the buffer and buffer page reclaim
On Tue, 22 Oct 2002, Seigo Tanimura wrote: > Introduction: > > The I/O buffer of the kernel are currently allocated in buffer_map > sized statically upon boot, and never grows. This limits the scale of > I/O performance on a host with large physical memory. We used to tune > NBUF to cope with that problem. This workaround, however, results in > a lot of wired pages not available for user processes, which is not > acceptable for memory-bound applications. > > In order to run both I/O-bound and memory-bound processes on the same > host, it is essential to achieve: > > A) allocation of buffer from kernel_map to break the limit of a map >size, and > > B) page reclaim from idle buffers to regulate the number of wired >pages. > > The patch at: > > http://people.FreeBSD.org/~tanimura/patches/dynamicbuf.diff.gz I should be the last to defend the current design and implementation of the buffer cache, since I think it gets almost everything wrong (the implementation is OK, but has vast complications to work around design errors), but I think buffer_map is one of the things that it gets right (if we're going to have buffers at all). Some history of this problem: FreeBSD-1: Allocating from kernel_map instead of buffer_map would almost take us back to FreeBSD-1 where buffers were allocated from kmem_map using malloc(). This caused larger problems with fragmentation. Some of these were due to foot-shooting, but I think large-memory machines give essentially the same problems and complete fragmentation of kernel_map would cause more problems than complete fragmentation of any other map. Part of the foot-shooting was to allocate too little vm to the kernel and correspondingly too little vm to kmem_map. The (i386) kernel was at originally at 0xFE00, so there was only 32MB of kernel vm. 32MB was far too small even for the relatively small physical memories at the time (1992 or 1993), so this was changed to 0xF000 in FreeBSD-1.1.5. Then there was 256MB of kernel vm. I suspect that this increase reduced the fragmentation problems to insignificance in most but not all cases. Some of the interesting cases at the time of FreeBSD-1 were: - machines with a small amount of physical memory. These should have few problems since there is not enough physical memory to make the maps more than sparse (unless the maps are undersized). - machines with a not so small amount of physical memory. It's possible that the too-small-in-general value for nbuf limits problems. - machines which only use one type of filesystem with one (small?) block size. If all allocations have the same size, then there need be no fragmentation. I'm not sure how strong this effect was in FreeBSD-1. malloc() used a power-of-2 algorithm, but only up to a certain size which covered 4K-blocks but possibly not 8K-blocks. Note that machines with large amounts of memory were likely to be specialized machines so were likely to take advantage of this without really trying, just by not mounting or not significantly using unusual filesystems like msdosfs, etx2fs and cd9660. I used the following allocation policies in my version of FreeBSD-1.1.5: - enlarge nbuf and the limit on buffer space (freebufspace) by a factor of 2 or 4 to get a larger buffer cache - enlarge nbuf by another factor of 8, but don't enlarge freebufspzce, so that buffers of size 512 can hold as much as buffers of size 4096. I didn't care about buffers of size 8192 or larger at the time. - actually enforce the freebufspace limit by discarding buffers in allocbuf() using a simplistic algorithm. This worked well enough, but I only tested it on a 486's with 8-16MB. The buffer cache had size 2MB or so. End of FreeBSD-1 history. FreeBSD-[2-5]: Use of buffer_map was somehow implemented at the beginning in rev.1.2 of vfs_bio.c although this wasn't in FreeBSD-1.1.5. Either I'm missing some history or it was only in dyson's tree for FreeBSD-1. Rev.1.2 used buffer map in its purest form: each of nbuf buffers has a data buffer consisting of MAXBSIZE bytes of vm attached to it at bufinit() time. The allocation never changes and we simply map physical pages into the vm when we have actual data. The problems with this are that MAXBSIZE is rather large and nbuf should be rather large (and/or dynamic). Subsequent changes add vast complications to reduce the amount of vm. I think these complications should only exist on machines with limited amounts of vm (mainly i386's). One of the complications was to reintroduce fragmentation problems. buffer_map only has enough space for nbuf buffers of size BKVASIZE, and the mappings are not statically allocated. Another of the complications is to discard buffers to reduce the fragmentation problems. Perhaps similar defragmentation would have worked well enough in FreeBSD-1.1. I suspect that you change depends on this defragmentation, but I don't think the defragmentation can work as well, since it can only touc
Re: Dynamic growth of the buffer and buffer page reclaim
On Tue, 22 Oct 2002, Seigo Tanimura wrote: > Introduction: > [...] > > The patch at: > > http://people.FreeBSD.org/~tanimura/patches/dynamicbuf.diff.gz Cool.. > > > -jbaselinew/ my patch > realusersys realusersys > 1 1608.21 1387.94 125.96 1577.88 1391.02 100.90 > 101576.10 1360.17 132.76 1531.79 1347.30 103.60 > 201568.01 1280.89 133.22 1509.36 1276.75 104.69 > 301923.42 1215.00 155.50 1865.13 1219.07 113.43 > definitly statistically significant. > > Another interesting results are the numbers of swaps, shown below. > > -jbaselinew/ my patch > 1 0 0 > 100 0 > 20141 77 > 30530 465 this too. > > > Comments and flames are welcome. Thanks a lot. > No flames.. Julian To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Dynamic growth of the buffer and buffer page reclaim
In message <[EMAIL PROTECTED]>, Seigo Tanimur a writes: >The patch at: > >http://people.FreeBSD.org/~tanimura/patches/dynamicbuf.diff.gz >Comments and flames are welcome. Thanks a lot. This looks very very interesting! -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 [EMAIL PROTECTED] | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Dynamic growth of the buffer and buffer page reclaim
Introduction: The I/O buffer of the kernel are currently allocated in buffer_map sized statically upon boot, and never grows. This limits the scale of I/O performance on a host with large physical memory. We used to tune NBUF to cope with that problem. This workaround, however, results in a lot of wired pages not available for user processes, which is not acceptable for memory-bound applications. In order to run both I/O-bound and memory-bound processes on the same host, it is essential to achieve: A) allocation of buffer from kernel_map to break the limit of a map size, and B) page reclaim from idle buffers to regulate the number of wired pages. The patch at: http://people.FreeBSD.org/~tanimura/patches/dynamicbuf.diff.gz implements buffer allocation from kernel_map and reclaim of buffer pages. With this patch, make kernel-depend && make kernel completes about 30-60 seconds faster on my PC. Implementation in Detail: A) is easy; first you need to do s/buffer_map/kernel_map/. Since an arbitrary number of buffer pages can be allocated dynamically, buffer headers (struct buf) should be allocated dynamically as well. Glue them together into a list so that they can be traversed by boot() et. al. In order to accomplish B), we must find buffers both the filesystem and I/O codes will not touch. The clean buffer queue holds such the buffers. (exception: if the vnode associated with a clean buffer is held by the namecache, it may access the buffer page.) Thus, we should unwire the pages of a buffer prior to enqueuing it to the clean queue, and rewire the pages down in bremfree() if the pages are not reclaimed. Although unwiring gives a page a chance of being reclaimed, we can go further. In Solaris, it is known that file cache pages should be reclaimed prior to the other kinds of pages (anonymous, executable, etc.) for a better performance. Mainly due to a lack of time to work on distinguishing the kind of a page to be unwired, I simply pass all unwired pages to vm_page_dontneed(). This approach places most of the unwired buffer pages at just one step to the cache queue. Experimental Evaluation and Results: The times taken to complete make kernel-depend && make kernel just after booting into single-user mode have been measured on my ThinkPad 600E (CPU: Pentium II 366MHz, RAM: 160MB) by time(1). The number passed to the -j option of make(1) has been varied from 1 to 30 in order to control the pressure of the memory demand for user processes. The baseline is the kernel without my patch. The following table shows the results. All of the times are in seconds. -j baselinew/ my patch realusersys realusersys 1 1608.21 1387.94 125.96 1577.88 1391.02 100.90 10 1576.10 1360.17 132.76 1531.79 1347.30 103.60 20 1568.01 1280.89 133.22 1509.36 1276.75 104.69 30 1923.42 1215.00 155.50 1865.13 1219.07 113.43 Most of the improvements in the real times are accomplished by the speedup of system calls. The hit ratio of getblk() may be increased, but not examined yet. Another interesting results are the numbers of swaps, shown below. -j baselinew/ my patch 1 0 0 10 0 0 20 141 77 30 530 465 Since the baseline kernel does not free buffer pages at all(*), it may be putting a pressure on the pages too much. (*) bfreekva() is called only when the whole KVA is too fragmented. Userland Interfaces: The sysctl variable vfs.bufspace now reports the size of the pages allocated for buffer, both wired and unwired. A new sysctl variable, vfs.bufwiredspace tells the size of the buffer pages wired down. vfs.bufkvaspace returns the size of the KVA space for buffer. Future Works: The handling of unwired pages can be improved by scanning only buffer pages. In that case, we may have to run the vm page scanner more frequently, as does Solaris. vfs.bufspace does not track the buffer pages reclaimed by the page scanner. They are counted when the buffer associated with those pages are removed from the clean queue, which is too late. Benchmark tools concentrating on disk I/O performance (bonnie, iozone, postmark, etc) may be more suitable than make kernel for evaluation. Comments and flames are welcome. Thanks a lot. -- Seigo Tanimura <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message