Re: vm_pageout_scan badness
: :Matt Dillon wrote: :> :> You may be able to achieve an effect very similar to mlock(), but :> runnable by the 'news' user without hacking the kernel, by :> writing a quick little C program to mmap() the two smaller history :> files and then madvise() the map using MADV_WILLNEED in a loop :> with a sleep(15). Keeping in mind that expire may recreate those :> files, the program should unmap, close(), and re-open()/mmap/madvise the :> descriptors every so often (like once a minute). You shouldn't have :> to access the underlying pages but that would also have a similar :> effect. If you do, use a volatile pointer so GCC doesn't optimize :> the access out of the loop. e.g. : :Err... wouldn't it be better to write a quick little C program that :mlocked the files? It would need suid, sure, but as a small program :without user input it wouldn't have security problems. : :-- :Daniel C. Sobral (8-DCS) :[EMAIL PROTECTED] :[EMAIL PROTECTED] mlock()ing is dangerous when used on a cyclic file. If you aren't careful you can run your system out of memory. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm_pageout_scan badness
Matt Dillon wrote: > > You may be able to achieve an effect very similar to mlock(), but > runnable by the 'news' user without hacking the kernel, by > writing a quick little C program to mmap() the two smaller history > files and then madvise() the map using MADV_WILLNEED in a loop > with a sleep(15). Keeping in mind that expire may recreate those > files, the program should unmap, close(), and re-open()/mmap/madvise the > descriptors every so often (like once a minute). You shouldn't have > to access the underlying pages but that would also have a similar > effect. If you do, use a volatile pointer so GCC doesn't optimize > the access out of the loop. e.g. Err... wouldn't it be better to write a quick little C program that mlocked the files? It would need suid, sure, but as a small program without user input it wouldn't have security problems. -- Daniel C. Sobral(8-DCS) [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] "The bronze landed last, which canceled that method of impartial choice." To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm_pageout_scan badness
Matt Dillon wrote: > > One possible fix would be to have the kernel track cache hits and misses > on a file and implement a heuristic from those statistics which is used > to reduce the 'initial page weighting' for pages read-in from the > 'generally uncacheable file'. This would cause the kernel to reuse > those cache pages more quickly and prevent it from throwing away (reusing) > cache pages associated with more cacheable files like the .index and > .hash files. I don't have time to do this now, but it's definitely > something I am going to keep in mind for a later release. That sounds very, very clever. In fact, it sounds so clever I keep wondering what is the huge flaw with it. :-) Still, promising, to say the least. -- Daniel C. Sobral(8-DCS) [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] "The bronze landed last, which canceled that method of impartial choice." To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm_pageout_scan badness
Excellent. What I believe is going on is that without the madvise()/mlock() the general accesses to the 1 GB main history file are causing the pages to be flushed from the .hash and .index files too quickly. The performance problems in general appear to be due to the system trying to cache more of the (essentially uncacheable) main history file at the expense of not caching as much of the (emminently cacheable) .index and .hash files. One possible fix would be to have the kernel track cache hits and misses on a file and implement a heuristic from those statistics which is used to reduce the 'initial page weighting' for pages read-in from the 'generally uncacheable file'. This would cause the kernel to reuse those cache pages more quickly and prevent it from throwing away (reusing) cache pages associated with more cacheable files like the .index and .hash files. I don't have time to do this now, but it's definitely something I am going to keep in mind for a later release. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm_pageout_scan badness
> :The mlock man page refers to some system limit on wired pages; I get no > :error when mlock()'ing the hash file, and I'm reasonably sure I tweaked > :the INN source to treat both files identically (and on the other machines > :I have running, the timestamps of both files remains pretty much unchanged). > :I'm not sure why I'm not seeing the desired results here with both files > > I think you are on to something here. It's got to be mlock(). Run > 'limit' from csh/tcsh and you will see a 'memorylocked' resource. > Whatever this resource is as of when innd is run -- presumably however > it is initialized for the 'news' user (see /etc/login.conf) is going Yep, `unlimited'... same as the bash `ulimit -a'. OH NO. I HAVE IT SET TO `infinity' IN LOGIN DOT CONF, no wonder it is all b0rken-like. The weird thing is that mlock() does return success, the amount of wired memory matches the two files, and I've seen nothing obvious in the source code as to why it's different, but I'll keep plugging away at it. > History files are nortorious for random I/O... the problem is due > to the hash table being, well, a hash table. The hash table > lookups are bad enough but this will also result in random-like > lookups on the main history file. You get a little better > locality of reference on the main history file (meaning the system Ah, but ... This is how the recent history format (based on MD5 hashes) introduced as dbz v6 at the time you were busy with Diablo and your history mechanism there differs from that which you remember -- AI, speaking of your 64-bit CRC history mechanism, whatever happened to the links that would get you there from the backplane homepage... -- in this case, you don't do the random-like lookups to verify message ID presence in the text file at all. Everything you do is in the data in the two hash tables. At least for transit. I'm not sure if the reader requests do require a hit on the main file -- it'd be worth it to point a Diablo frontend at such a box to see how it does there even when the overview performance for traditional readership is, uh, suboptimal. I think it does but that's a trivial seek to one specific known offset. I'm sure this is applicable to other databases somehow, for those who aren't doing news and are bored stiff by this. > At the moment madvise() MADV_WILLNEED does nothing more then activate > the pages in question and force them into the process'es mmap. > You have to call it every so often to keep the pages 'fresh'... calling > it once isn't going to do anything. Well, it definitely does do a Good Thing when I call it once, as you can see from the initial timer numbers that approach the long-running values I'm used to (that I tried to simulate by doing lookups on a small fraction of history entries, in hope of activating a majority of the needed pages, that wasn't perfect but was a decent hack). You can see from the timestamps of the debugging here that while it slows down the startup somewhat, the work of reading in the data happens quickly and is a definite positive tradeoff: Dec 6 07:32:14 crotchety innd: dbz openhashtable /news/db/history.index Dec 6 07:32:14 crotchety innd: dbz madvise WILLNEED ok Dec 6 07:32:14 crotchety innd: dbz madvise RANDOM ok Dec 6 07:32:14 crotchety innd: dbz madvise NOSYNC ok Dec 6 07:32:27 crotchety innd: dbz mlock ok Dec 6 07:32:27 crotchety innd: dbz openhashtable /news/db/history.hash Dec 6 07:32:27 crotchety innd: dbz madvise WILLNEED ok Dec 6 07:32:27 crotchety innd: dbz madvise RANDOM ok Dec 6 07:32:27 crotchety innd: dbz madvise NOSYNC ok Dec 6 07:32:38 crotchety innd: dbz mlock ok This happens quickly when the data is still in cache, leading me to believe it's something else affecting the .hash file (I added the madvise() MADV_NOSYNC call just in case somehow it wasn't happening in the mmap() for some reason): Dec 6 09:29:34 crotchety innd: dbz openhashtable /news/db/history.index Dec 6 09:29:34 crotchety innd: dbz madvise WILLNEED ok Dec 6 09:29:34 crotchety innd: dbz madvise RANDOM ok Dec 6 09:29:34 crotchety innd: dbz madvise NOSYNC ok Dec 6 09:29:34 crotchety innd: dbz mlock ok Dec 6 09:29:34 crotchety innd: dbz openhashtable /news/db/history.hash Dec 6 09:29:34 crotchety innd: dbz madvise WILLNEED ok Dec 6 09:29:34 crotchety innd: dbz madvise RANDOM ok Dec 6 09:29:34 crotchety innd: dbz madvise NOSYNC ok Dec 6 09:29:34 crotchety innd: dbz mlock ok > You may be able to achieve an effect very similar to mlock(), but > runnable by the 'news' user without hacking the kernel, by Yeah, sounds like a hack, but I figured out what was going on earlier with my mlock() hack -- INN and the reader daemon now use a dynamically linked library so the nnrpd processes also were trying to mlock() the files too. Hmmm. Either I can statically compile INN (which I chose to do) or I can further butcher the source by attempting to
Re: vm_pageout_scan badness
:To recap, the difference here is that by cheating, I was able to mlock :one of the two files (the behaviour I was hoping to be able to achieve :through first MAP_NOSYNC alone, then in combination with MADV_WILLNEED :to keep all the pages in memory so much as possible) and achieve a much :improved level of performance -- I'm able to catch up on backlogs from :a full feed that had built up during the time I wasn't cheating -- by :using memory for the history database files rather than for general :filesystem caching. I even have spare capacity! Woo. : :The mlock man page refers to some system limit on wired pages; I get no :error when mlock()'ing the hash file, and I'm reasonably sure I tweaked :the INN source to treat both files identically (and on the other machines :I have running, the timestamps of both files remains pretty much unchanged). :I'm not sure why I'm not seeing the desired results here with both files :(maybe some call hidden somewhere I haven't located yet), but I hope you :can see the improvements so far. I even let abusive readers pound on :me. Well, for a while 'til I got tired of 'em. I think you are on to something here. It's got to be mlock(). Run 'limit' from csh/tcsh and you will see a 'memorylocked' resource. Whatever this resource is as of when innd is run -- presumably however it is initialized for the 'news' user (see /etc/login.conf) is going to effect mlock() operation. mlock() will wire pages. I think you can safely call it on your two smaller history files (history.hash, history.index). I can definitely see how this could result in better performance. :I still don't know for certain if the disk updates I am seeing are :slow because they aren't sorted well, or if they're random pages and :not a sequential set, given that I hope I've ruled out fragmentation :of the database files. I still maintain that in the case of a true :MADV_RANDOM madvise'd file, any attempts to clean out `unused' pages :are ill-advised, or if they're needed, anything other than freeing of :sequential pages results in excess disk activity that gains nothing, :if it's the case that this is not how it's done, due to the nature :of random access. History files are nortorious for random I/O... the problem is due to the hash table being, well, a hash table. The hash table lookups are bad enough but this will also result in random-like lookups on the main history file. You get a little better locality of reference on the main history file (meaning the system can do a better job caching it optimally), but the hash tables are a lost cause so mlock()ing them could be a very good thing. :Yeah, hacking the vm source to allow me to mlock() isn't kosher, but :I wanted to test a theory. Doing so probably requires a few more :tweaks in the INN source to handle expiry, so it seems, so I'd rather :the vm subsystem do this for me automagically with the right invocation :of the suitable mmap/madvise operations, if this is reasonable. At the moment madvise() MADV_WILLNEED does nothing more then activate the pages in question and force them into the process'es mmap. You have to call it every so often to keep the pages 'fresh'... calling it once isn't going to do anything. When you call madvise() MADV_WILLNEED the system has to go through a number of steps before the pages will be thrown away: - it has to remove them from the process pmap - it has to deactivate them - it has to cache them - then it can free them You may be able to achieve an effect very similar to mlock(), but runnable by the 'news' user without hacking the kernel, by writing a quick little C program to mmap() the two smaller history files and then madvise() the map using MADV_WILLNEED in a loop with a sleep(15). Keeping in mind that expire may recreate those files, the program should unmap, close(), and re-open()/mmap/madvise the descriptors every so often (like once a minute). You shouldn't have to access the underlying pages but that would also have a similar effect. If you do, use a volatile pointer so GCC doesn't optimize the access out of the loop. e.g. for (ptr = mapBase; ptr < mapEnd; ptr += pageSize) { volatile char c = *ptr; } or for (ptr = mapBase; ptr < mapEnd; ptr += pageSize) { dummyroutine(*ptr); } And my earlier suggestion above would look something like: for (;;) { open descriptor map for (i = 0; i < 15; ++i) { madvise(mapBase, mapSize, MADV_WILLNEED); sleep(15); } munmap close descriptor } -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm_pageout_scan badness
I wouldn't worry about madvise() too much. 4.2 has a really good heuristic that figures it out for the most part. (still reading the rest of your postings) -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm_pageout_scan badness
Howdy, I'm going to breach all sorts of ethics in the worst way by following up to my own message, just to throw out some new info... 'kay? Matt wrote, and I quote -- : > However, I noticed something interesting! Of course I clipped away the interesting Thing, but note the following that I saw... : INN after adding the memory, I did a `cp -p' on both the history.hash : and history.index files, just to start fresh and clean. It didn't seem [...] : > There is an easy way to test file fragmentation. Kill off everything : > and do a 'dd if=history of=/dev/null bs=32k'. Do the same for : > history.hash and history.index. Look at the iostat on the history : > drive. Specifically, do an 'iostat 1' and look at the KB/t (kilobytes : > per transfer). You should see 32-64KB/t. If you see 8K/t the file : > is severely fragmented. Go through the entire history file(s) w/ dd... : : Okay, I'm doing this: The two hash-type files give me between 9 and : 10K/t; the history text file gives me more like 60KB/t. Hmmm. It's Now, remember what Matt wrote, that partially-cached data played havoc with read-ahead. That is apparently what I was seeing here, pulling some bit of data off the disk proper, but then pulling a chunk of data that was cached, and so on. I figured that out as I attempted to copy one of the files to create an unfragmented copy to test transfer size and saw the expected 64K (well DUH, that was the write size), and then attempted to `dd' these to /dev/null and saw ... no disk activity. The file was in cache. Bummer. Oh well, I had to reboot anyway for some reason, and did so. Immediately after reboot I `dd'ed the two database files and got the expected 64K/t of an unfragmented file. I also made copies of them just to push their contents into memory, because... : The actual history lookups and updates that matter are all done within : the memory taken up by the .index and .hash files. So, by keeping : them in memory, one doesn't need to do any disk activity at all for : lookups, and updates, well, so long as you commit them to the disk at : shutdown, all should be okay. That's what I'm attempting to achieve. : These lookups and updates are bleedin' expensive when disk activity : rears its ugly head. : : Not to worry, I'm going to keep plugging to see if there is a way for : me to lock these two files into memory so that they *stay* there, just : to prove whether or not that's a significant performance improvement. : I may have to break something, but hey... I b0rked something. I `fixed' the mlock operation to allow a lowly user such as myself to use it, just as proof of concept. (I still need to do a bit of tuning, I can see, but hey, I got results) So I attempt to pass all the madvise suggestions I can for both the history.index and .hash files, and then I attempt to mlock both of them. I don't get a failure, although the history.hash file (108MB) doesn't quite achieve the desired results -- I do see Good Things with the smaller history.index (72MB and don't remind me that 1MB really isn't 100bytes). Anyway, the number of `Wired' Megs in `top' is up from 71MB to 200+, and after some hours of operation, look at the timestamps of the two database files (the .n.* files are those I copied after reboot, and serve as a nice reference for when I started things) -rw-rw-r-- 1 news news 755280213 Dec 5 19:05 history -rw-rw-r-- 1 news news 57 Dec 5 19:05 history.dir -rw-rw-r-- 1 news news 10800 Dec 5 19:05 history.hash -rw-rw-r-- 1 news news 7200 Dec 5 08:44 history.index -rw-rw-r-- 1 news news 10800 Dec 5 08:43 history.n.hash -rw-rw-r-- 1 news news 7200 Dec 5 08:44 history.n.index So, okay, history.hash still sees disk activity, but look at a handful of INN timer stats following the boot: The last two stats with the default vm k0deZ before restart: Dec 5 08:30:40 crotchety innd: ME time 301532 idle 28002(120753) artwrite 70033(2853) artlink 0(0) hiswrite 49396(3097) hissync 28(6) ^ sitesend 460(5706) artctrl 296(25) artcncl 295(25) hishave 32016(8923) ^ hisgrep 45(10) artclean 20816(3150) perl 12536(3082) overv 29927(2853) python 0(0) ncread 33729(152735) ncproc 227796(152735) 80 seconds of 300 spent on history activity... urk... on a steady-state system with a few readers that had been running for some hours. Dec 5 08:35:37 crotchety innd: ME time 300052 idle 16425(136209) artwrite 77811(2726) artlink 0(0) hiswrite 35676(2941) hissync 28(6) sitesend 571(5450) artctrl 454(41) artcncl 451(41) hishave 33311(7392) hisgrep 55(14) artclean 22778(3000) perl 14137(2914) overv 28516(2726) python 0(0) ncread 38832(172145) ncproc 226513(172145) [REB00T] Dec 5 08:59:32 crotchety innd: ME time 300059 idle 62840(189385) artwrite 68361(5580) artlink 0(0) hiswrite 8782(6567) hissync 104(12
Re: vm_pageout_scan badness
> ok, since I got about 6 requests in four hours to be Cc'd, I'm > throwing this back onto the list. Sorry for the double-response that > some people are going to get! Ah, good, since I've been deliberately avoiding reading mail in an attempt to get something useful done in my last days in the country, and probably wouldn't get around to reading it until I'm without Net access in a couple weeks... (Also, because your mailer seems to be ignoring the `Reply-To:' header I've been using, but I'd get a copy through the cc: list, in case you puzzled over why your previous messages bounced) > I am going to include some additional thoughts in the front, then break > to my originally private email response. I'll mention that I've discovered the miracle of man pages, and found the interesting `madvise' capability of `MADV_WILLNEED' that, from the description, looks very promising. Pity the results I'm seeing still don't match my expectations. Also, in case the amount of system memory on this machine might be insufficient to do what I want with the size of the history.hash/.index files, I've just gotten an upgrade to a full gig. Unfortunately, now performance is worse than it had been, so it looks I'll be butchering the k0deZ to see if I can get my way. Now, for `madvise' -- this is already used in the INN source in lib/dbz.c (where one would add MAP_NOSYNC to the MAP__FLAGS) as MADV_RANDOM -- this matches the random access pattern of the history hash table. Supposedly, MADV_WILLNEED will tell the system to avoid freeing these pages, which looks to be my holy grail of this week, plus the immediate mapping that certainly can't hurt. There's only a single madvise call in the INN source, but I see that the Diablo code does make two calls to it (although both WILLNEED and, unlike INN, SEQUENTIAL access -- this could be part of the cause of the apparent misunderstanding of the INN history file that I see below). Since it looks to my non-progammer eyes like I can't combine the behaviours in a single call, I followed Diablo's example to specify both RANDOM and the WILLNEED that I thought would improve things. The machine is, of course, as you can see from the timings, not optimized at all, since I've just thrown something together as a proof of concept having run into a brick wall with the codes under test with Slowaris, And because a departmental edict has come down that I must migrate all services off Free/NetBSD and onto Slowaris, I can't expect to get the needed hardware to beef up the system -- even though the MAP_NOSYNC option on the transit machine enabled it to whup the pants off a far more expensive chunk of Sun hardware. So I'm trying to be able to say `Look, see? see what you can do with FreeBSD' as I'm shown out the door. > I ran a couple of tests with MAP_NOSYNC to make sure that the > fragmentation issue is real. It definitely is. If you create a > file by ftruncate()ing it to a large size, then mmap() it SHARED + > NOSYNC, then modify the file via the mmap, massive fragmentation occurs I've heard it confirmed that even the newer INN does not mmap() the newly-created files for makehistory or expire. As reported to the INN-workers mailing list: : From: [EMAIL PROTECTED] (Richard Todd) : Newsgroups: mailing.unix.inn-workers : Subject: Re: expire/makehistory and mmap/madvise'd dbz filez : Date: 4 Dec 2000 06:30:47 +0800 : Message-ID: <90ehin$1ndk$[EMAIL PROTECTED]> : : In servalan.mailinglist.inn-workers you write: : : >Moin moin : : >I'm engaged in a discussion on one of the FreeBSD developer lists : >and I thought I'd verify the present source against my memory of how : >INN 1.5 runs, to see if I might be having problems... : : >Anyway, the Makefile in the 1.5 expire directory has the following bit, : >that seems to be absent in present source, and I didn't see any : >obvious indication in the makedbz source as to how it's initializing : >the new files, which, if done wrong, could trigger some bugs, at least : >when `expire' is run. : : ># Build our own version of dbz.o for expire and makehistory, to avoid : ># any -DMMAP in DBZCFLAGS - using mmap() for dbz in expire can slow it : ># down really bad, and has no benefits as it pertains to the *new* .pag. : >dbz.o: ../lib/dbz.c : > $(CC) $(CFLAGS) -c ../lib/dbz.c : : >Is this functionality in the newest expire, or do I need to go a hackin'? : : Whether dbz uses mmap or not on a given invocation is controlled by the : dbzsetoptions() call; look for that call and setting of the INCORE_MEM : option in expire/expire.c and expire/makedbz.c. Neither expire nor : makedbz mmaps the new dbz indices it creates. The remaining condition I'm not positive about is the case of an overflow, that ideally would not be a case to consider, and is not the case on the machine now. > on the file. This is easily demonstrated by issuing a sequential read > on the file and noting that the syste
Re: vm_pageout_scan badness
ok, since I got about 6 requests in four hours to be Cc'd, I'm throwing this back onto the list. Sorry for the double-response that some people are going to get! I am going to include some additional thoughts in the front, then break to my originally private email response. I ran a couple of tests with MAP_NOSYNC to make sure that the fragmentation issue is real. It definitely is. If you create a file by ftruncate()ing it to a large size, then mmap() it SHARED + NOSYNC, then modify the file via the mmap, massive fragmentation occurs on the file. This is easily demonstrated by issuing a sequential read on the file and noting that the system is not able to do any clustering whatsoever and gets a measily 0.6MB/sec of throughput (on a disk that can do 12-15MB/sec). (and the disk seeks wildly during the read). When you create a large file and fill it with zero's, THEN mmap() it SHARED + NOSYNC and write to it randomly via the mmap(), the file remains laid on disk optimally. However, I noticed something interesting! When I dd if=file of=/dev/null bs=32k the file the first time after randomly writing it and then fsync()ing it, I only get 4MB/sec of throughput. If I dd the file a second time I get around 8MB/sec. If I dd it the third time I get the platter speed - 12-15MB/sec. The issue here has to do with the fact that the file is partially cached in the first two dd runs. The partially cached file shortcuts the I/O clustering code, preventing it from issueing read aheads once it hits a buffer that is already in the cache. So if you have a spattering of cached blocks and then read a file sequentially, you actually get lower throughput then if you don't have *any* cached blocks and then read the file sequentially. Verrry interesting! I think it may be beneficial to the clustering code to issue the full read-ahead even if some of the blocks in the middle are already cached. The clustering code only operates when sequential operation is detected, so I don't think it can make things worse. large file == at least 2 x main memory. -- original response -- Ok, lets concentrate on your hishave, artclean, artctrl, and overview numbers. :-rw-rw-r-- 1 news news 436206889 Dec 3 05:22 history :-rw-rw-r-- 1 news news 67 Dec 3 05:22 history.dir :-rw-rw-r-- 1 news news 8100 Dec 1 01:55 history.hash :-rw-rw-r-- 1 news news 5400 Nov 30 22:49 history.index : :More observations that may or may not mean anything -- before rebooting, :I timed the `fsync' commands on the 108MB and 72MB history files, as note: the fsync command will not flush MAP_NOSYNC pages. :The time taken to do the `fsync' was around one minute for the two :history files. And around 1 second for the BerkeleyDB file... This is an indication of file fragmentation, probably due to holes in the history file being filled via the mmap() instead of filled via write(). In order for MAP_NOSYNC to be reasonable, you have to fix the code that extends a file via ftruncate()s to write() zero's into the extended portion. :data getting flushed to disk, then it seems like someone's priorities :are a bit, well, wrong. The way I see it, by giving the MAP_NOSYNC :flag, I'm sort of asking for preferential treatment, kinda like mlock, :even though that's not available to me as `news' user. The pages are treated the way any VM page is treated... they'll be cached based on use. I don't think this is the problem. Ok, lets look at a summary of your timing results: hishave overv artcleanartctrl 38857(26474)112176(6077)12264(6930) 2297(308) 22114(28196)136855(6402)12757(7295) 1257(322) 13614(24312)156723(6071)13232(6800) 324(244) 9944(25198) 164223(6620)13441(7753) 255(160) 2777(50732) 24979(3788) 29821(4017) 131(51) 31975(11904)21593(3320) 25148(3567) 5935(340) Specifically, look at the last one where it blew up on you. hishave and artctrl are much worse, overview and artclean are about the same. This is an indication of excessive seeking on the history disk. I believe that this seeking may be due to file fragmentation. There is an easy way to test file fragmentation. Kill off everything and do a 'dd if=history of=/dev/null bs=32k'. Do the same for history.hash and history.index. Look at the iostat on the history drive. Specifically, do an 'iostat 1' and look at the KB/t (kilobytes per transfer). You should see 32-64KB/t. If you see 8K/t the file is severely fragmented. Go through the entire history file(s) w/ dd... the fragmentation may occur near the end. If the file turns out to be fragmented, the only way to fix it is to
Re: vm_pageout_scan badness
:> I'm going to take this off of hackers and to private email. My reply :> will be via private email. : :Actually, I was enjoying the discussion, since I was learning something :in the process of hearing you debug this remotely. : :It sure beats the K&R vs. ANSI discussion. :) : :Nate Heh. Well, I didn't think there'd be as much interest as there is, so I guess I'll throw it back onto the mailing list. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm_pageout_scan badness
> I'm going to take this off of hackers and to private email. My reply > will be via private email. Actually, I was enjoying the discussion, since I was learning something in the process of hearing you debug this remotely. It sure beats the K&R vs. ANSI discussion. :) Nate To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm_pageout_scan badness
:errr then keep me in the CC : :it's interesting : :-- : __--_|\ Julian Elischer : / \ [EMAIL PROTECTED] Sure thing. Anyone else who wants to be in the Cc, email me. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm_pageout_scan badness
Matt Dillon wrote: > > I'm going to take this off of hackers and to private email. My reply > will be via private email. > > -Matt > > To Unsubscribe: send mail to [EMAIL PROTECTED] > with "unsubscribe freebsd-hackers" in the body of the message errr then keep me in the CC it's interesting -- __--_|\ Julian Elischer / \ [EMAIL PROTECTED] ( OZ) World tour 2000 ---> X_.---._/ presently in: Budapest v To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm_pageout_scan badness
I'm going to take this off of hackers and to private email. My reply will be via private email. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm_pageout_scan badness
> :but at last look, history lookups and writes are accounting for more > :than half (!) of the INN news process time, with available idle time > :being essentially zero. So... > > No idle time? That doesn't sound like blocked I/O to me, it sounds > like the machine has run out of cpu. Um, I knew I'd be unclear somehow. The machine itself (with 2 CPUs) has plenty of idle time -- `top' reports typically 70-80% idle, and INN takes from 20-40% of CPU (being SMP, a process like `perl' locked to one CPU will appear around 98%, unlike a certain other OS that will show this percentage for the system total, rather than for a particular CPU). What I mean is that the INN process timer, which is basically Joe Greco's timer that wraps key functions with start/stop timer calls, showing where INN spends much of its time, is showing little to no idle time (meaning it couldn't take more articles in no matter how hard I push them). Let me show you the timer stats from the time I started things not long ago on this reader machine, where it's taking in backlogs: Dec 3 04:33:47 crotchety innd: ME time 300449 idle 376(4577) all times in milliseconds: elapsed time^^=5min ^^^idle time (numbers in parentheses are number of calls; only significant in calls like artwrite to show how many articles were actually written to spool, hiswrite to show how many unique articles were received over this time period, and hishave to show how many history lookups were done) artwrite 52601(6077) artlink 0(0) hiswrite 40200(7035) hissync 11(14) ^^^ 53 seconds writing articles ^^ 40 seconds updating history sitesend 647(12154) artctrl 2297(308) artcncl 2288(308) hishave 38857(26474) 39 seconds doing history lookups ^^ hisgrep 70(111) artclean 12264(6930) perl 13819(6838) overv 112176(6077) python 0(0) ncread 13818(21287) ncproc 284413(21287) Dec 3 04:38:48 crotchety innd: ME time 301584 idle 406(5926) artwrite 55774(6402) artlink 0(0) hiswrite 25483(7474) hissync 15(15) sitesend 733(12805) artctrl 1257(322) artcncl 1245(321) hishave 22114(28196) hisgrep 90(38) artclean 12757(7295) perl 14696(7191) overv 136855(6402) python 0(0) ncread 14446(23235) ncproc 284767(23235) (as time passes and more of the MAP_NOSYNC file is in memory, the time needed for history writes/lookups drops) [...] Dec 3 04:58:49 crotchety innd: ME time 300047 idle 566(6272) artwrite 59850(6071) artlink 0(0) hiswrite 11630(6894) hissync 33(14) sitesend 692(12142) artctrl 324(244) artcncl 320(244) hishave 13614(24312) hisgrep 0(77) artclean 13232(6800) perl 14531(6727) overv 156723(6071) python 0(0) ncread 15116(23838) ncproc 281745(23838) Dec 3 05:03:49 crotchety innd: ME time 300018 idle 366(5936) artwrite 56956(6620) artlink 0(0) hiswrite 8850(7749) hissync 7(15) sitesend 760(13240) artctrl 255(160) artcncl 255(160) hishave 9944(25198) hisgrep 0(31) artclean 13441(7753) perl 15605(7620) overv 164223(6620) python 0(0) ncread 14783(24123) ncproc 282791(24123) Most of the time is spent on the BerkeleyDB overview now. This is probably because some reader is giving repeated commands pounding the overview database.That reader's IP now has a different gateway address, and won't be bothering me for a while. Now, for a reference, here are the timings on a transit-only machine with no readers, after it's been running for a while: Dec 3 05:22:09 news-feed69 innd: ME time 30 idle 91045(91733) a reasonable amount of idle time ^^ artwrite 48083(2096) artlink 0(0) hiswrite 1639(2096) hissync 33(11) sitesend 4291(12510) artctrl 0(0) artcncl 0(0) hishave 1600(30129) hisgrep 0(0) artclean 25591(2121) perl 79(2096) overv 0(0) python 0(0) ncread 69798(147925) ncproc 108624(147919) A total of just over 3 seconds out of every 300 seconds spent on history activity. That's reflected by the timestamps on the NOSYNC'ed history database (index/hash) files you see here: -rw-rw-r-- 1 news news 436206889 Dec 3 05:22 history -rw-rw-r-- 1 news news 67 Dec 3 05:22 history.dir -rw-rw-r-- 1 news news 8100 Dec 1 01:55 history.hash -rw-rw-r-- 1 news news 5400 Nov 30 22:49 history.index However, the timings shown by `top' here show from 10 to 20% idle CPU time, even though INN itself has capacity to do more work. The problem is that I'm not seeing this on the reader box. Or if I do see it, it doesn't last long. The timestamps on the above files are pretty much current, in spite of the files being NOSYNC'ed. > :As is to be expected, INN increases in size as it does history lookups > :and updates, and the amount of memory shown as Active tracks this, > :more or less. But what's happening to the Free value! It's going > :down at as much as 4MB per `top' interval. Or should I say, what is > :happening to the Inactive value -- it's constan
Re: vm_pageout_scan badness
:closely the pattern of what happens to the available memory following :a fresh boot... At the moment, this (reader) machine has been up for :half a day, with performance barely able to keep up with a full feed :(but starting to slip as the overnight burst of binaries is starting), :but at last look, history lookups and writes are accounting for more :than half (!) of the INN news process time, with available idle time :being essentially zero. So... No idle time? That doesn't sound like blocked I/O to me, it sounds like the machine has run out of cpu. :Following the boot, things start out with plenty of memory Free, and :something like 4MB Active, which seems reasonable to me. Then I start :things. : :As is to be expected, INN increases in size as it does history lookups :and updates, and the amount of memory shown as Active tracks this, :more or less. But what's happening to the Free value! It's going :down at as much as 4MB per `top' interval. Or should I say, what is :happening to the Inactive value -- it's constantly increasing, and I :observe a rapid migration of all the Free memory to Inactive, until :the value of Inactive peaks out at the time that Free drops to about :996k, beyond which it changes little. None of the swap space has :been touched yet. : :As soon as the value for Free hits bottom and that of Inactive has :reached a max, now the migration happens from Inactive to Active -- :until this point, the value of Active has been roughly what I would :expect to see, given the size of the history hash/index files, and :the BerkeleyDB file I'm now using MAP_NOSYNC as well for a definite :improvement in overview access times. Hmm. An increasing 'inactive' most often occurs when a program is reading a file sequentially. It sounds like most of the inactive pages are probably due to reader requests from the spool. :> Is it possible that history file rewriting is creating an issue? Doesn't :> INN rewrite the history file every once in a while to clear out old :> garbage? I'm not up on the latest INN. : :In normal operation, no -- the text file is append-only (the text file :isn't used for lookups with the MD5-based hashing), and expire, which :I'm running manually, rewrites the hash files -- leading to a mysterious :lack of space today when I attempted to run both expire and makedbz (a :variant of makehistory), and apparently some reader processes or some :daemons still had the old inodes open, until suddenly in one swell foop, :some 750MB was freed up -- far more than I expected to see, so I should :probably look into this space usage sometime... : :This shouldn't be a problem the way I'm running things now. I haven't :run an expire process since the last reboot to observe things closely. Woa. 750MB? There are only two things that can cause that: * A process with hundreds of megabytes of private store exited * A large (500+ MB) file is deleted after having previously been mmap()'d. (or the process holding the last open descriptor to the file, after deletion, now exits). If I remember INN right, there is a situation that can occur here... the reader processes open up the history file in order to implement a certain NNTP commands. I'm trying to remember which one... I think its one of search commands. Fubar... anyone remember which NNTP command opens up the history file? In anycase, I remember at BEST I had to completely disable that command when running INN because it caused long-running reader processes to keep a descriptor open on now-deleted history files. When you do an expire run which replaces the history file, the original (now deleted) history file may still be open by those reader processes. This could easily account for your problems. This sort of situation occurs most often when there is no timeout or too-long a timeout in the reader processes, and/or if tcp keepalives are not turned on, plus when certain NNTP commands (used mostly by abusers, by the way, which try to download feeds via their reader access) are enabled. I would immediately research this... look for reader processes that have hung around too long and try killing them, then see if that clears out some memory. There will also be a serious file fragmentation issue using MAP_NOSYNC in the expire process. You can probably use MAP_NOSYNC safely in the INND core, but don't use it to rebuild the history file in the expire process. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm_pageout_scan badness
> :> Personally speaking, I would much rather use MAP_NOSYNC anyway, > even with > :... > :Everything starts out well, where the history disk is beaten at startup > :but as time passes, the time taken to do lookups and writes drops down > :to near-zero levels, and the disk gets quiet. And actually, the transit > :... > :What I notice is that the amount of memory used keeps increasing, until > :it's all used, and the Free amount shown by `top' drops to a meg or so. > :Cache and Buf get a bit, but most of it is Active. Far more than is > :accounted for by the processes. > > This is to be expected, because the dirty MAP_NOSYNC pages will not > be written out until they are forced out, or by msync(). I just discovered the user command `fsync' which has revealed a few things to me, clearing up some mysteries. Also, I've watched more closely the pattern of what happens to the available memory following a fresh boot... At the moment, this (reader) machine has been up for half a day, with performance barely able to keep up with a full feed (but starting to slip as the overnight burst of binaries is starting), but at last look, history lookups and writes are accounting for more than half (!) of the INN news process time, with available idle time being essentially zero. So... > :Now, what happens on the reader machine is that after some time of the > :Active memory increasing, it runs out and starts to swap out processes, > :and the timestamps on the history database files (.index and .hash, this > :is the md5-based history) get updated, rather than remaining at the > :time INN is started. Then the rapid history times skyrocket until it > :takes more than 1/4 of the time. I don't see this on the transit boxen > :even after days of operation. > > Hmm. That doesn't sound right. Free memory should drop to near zero, > but then what should happen is the pageout daemon should come along > and deactivate a big chunk of the 'active' pages... so you should > see a situation where you have, say, 200MB worth of active pages > and 200MB worth of inactive pages. After that the pageout daemon > should start paging out the inactive pages and increasing the 'cache'. > The number of 'free' pages will always be near zero, which is to be > expected. But it should not be swapping out any process. Here is what I noticed while watching the `top' values for Active, Inactive, and Free following this last boot (I didn't pay any attention to the other fields to notice any wild fluctuations there, next time maybe), on this machine with 512MB of RAM, if it reveals anything: Following the boot, things start out with plenty of memory Free, and something like 4MB Active, which seems reasonable to me. Then I start things. As is to be expected, INN increases in size as it does history lookups and updates, and the amount of memory shown as Active tracks this, more or less. But what's happening to the Free value! It's going down at as much as 4MB per `top' interval. Or should I say, what is happening to the Inactive value -- it's constantly increasing, and I observe a rapid migration of all the Free memory to Inactive, until the value of Inactive peaks out at the time that Free drops to about 996k, beyond which it changes little. None of the swap space has been touched yet. As soon as the value for Free hits bottom and that of Inactive has reached a max, now the migration happens from Inactive to Active -- until this point, the value of Active has been roughly what I would expect to see, given the size of the history hash/index files, and the BerkeleyDB file I'm now using MAP_NOSYNC as well for a definite improvement in overview access times. Anyway, I don't remember what values exactly I was seeing for Free and Inactive or Active, since I was just watching for general trends, but I seem to recall Active being ~100MB, and Inactive somewhat more. (Are you saying above that this Inactive value should be migrating to Cache, which I'm not seeing, rather than to Active, which I do see? If so, then hmmm.) Now memory is drifting at a fairly rapid pace from Inactive (the meaning of which I'm not exactly clear about, although there's some explanation in the `top' man page that hasn't quite clicked into understanding yet), over to the Active field, at something like 2MB or so per `top' interval. Free remains close to 1MB, but Active is constantly growing, although no processes are clearly taking up any of this, apart from INN which only accounts for around 100MB at this time, and isn't increasing at the rate of increase of Active memory. Anyway, the Active field continues to increase as Inactive decreases until finally Inactive bottoms out, down from several hundred MB to a one or two digit MB value (I don't remember exactly), while Active has increased to almost 400MB. This is something like 20 minutes after the reboot, and now the first bit of swap gets hit. However, the value of A
Re: vm_pageout_scan badness
:> Personally speaking, I would much rather use MAP_NOSYNC anyway, even with :> a fixed filesystem syncer. MAP_NOSYNC pages are not restricted by :... : :Yeah, no kidding -- here's what I see it screwing up. First, some :background: : :I've built three news machines, two transit boxen and one reader box, :with recent INN k0dez, and 4.2-STABLE of a few days ago (having tested :NetBSD, more on that later), and a brief detour into 5-current. :.. : :Everything starts out well, where the history disk is beaten at startup :but as time passes, the time taken to do lookups and writes drops down :to near-zero levels, and the disk gets quiet. And actually, the transit :... : :What I notice is that the amount of memory used keeps increasing, until :it's all used, and the Free amount shown by `top' drops to a meg or so. :Cache and Buf get a bit, but most of it is Active. Far more than is :accounted for by the processes. This is to be expected, because the dirty MAP_NOSYNC pages will not be written out until they are forced out, or by msync(). :Now, what happens on the reader machine is that after some time of the :Active memory increasing, it runs out and starts to swap out processes, :and the timestamps on the history database files (.index and .hash, this :is the md5-based history) get updated, rather than remaining at the :time INN is started. Then the rapid history times skyrocket until it :takes more than 1/4 of the time. I don't see this on the transit boxen :even after days of operation. Hmm. That doesn't sound right. Free memory should drop to near zero, but then what should happen is the pageout daemon should come along and deactivate a big chunk of the 'active' pages... so you should see a situation where you have, say, 200MB worth of active pages and 200MB worth of inactive pages. After that the pageout daemon should start paging out the inactive pages and increasing the 'cache'. The number of 'free' pages will always be near zero, which is to be expected. But it should not be swapping out any process. The actual amount of 'free' memory in the system is actually 'free+cache' pages. :Now, what happens when I stop INN and everything news-related is that :some memory is freed up, but still, there can be, say, 400MB still :reported as Active. More when I had a full gig in this machine to :... : :Then, when I reboot the machine, it gives the kernel messages about :syncing disks; done, and then suddenly the history drive light goes :on and it starts grinding for five minutes or so, before the actual :reboot happens. Right. This is to be expected. You have a lot of dirty pages in the system due to the use of MAP_NOSYNC that have to be flushed out. :No history activity happens when I shut down INN normally, which should :free the MAP_NOSYNC'ed pages and make them available to be written to :disk before rebooting, maybe. MAP_NOSYNC pages are not flushed when the referencing program exits. They stick around until they are forced out. You can flush them manually by using a mmap()/msync() combination. i.e. an msync() prior to munmap()ing (from INND only) ought to do it. :What I think is happening, based on these observations, is that the :data from the history hash files (less than 100MB) gets read into :memory, but the updates to it are not written over the data to be :replaced -- it's simply appended to, up to the limit of the available :memory. When this limit is reached on the transit machines, then :things stabilize and old pages get recycled (but still, more memory :overall is used than the size of the actual file). It doesn't append... the pages are reused. The set of 'active' pages in the VM system is effectively the set of all files accessed for the entire system, not just MAP_NOSYNC pages. If you are only MAP_NOSYNC'ing 100MB worth of pages, then only 100MB worth of pages will be left unflushed. Is it possible that history file rewriting is creating an issue? Doesn't INN rewrite the history file every once in a while to clear out old garbage? I'm not up on the latest INN. :I'm guessing that additional activity of the reader machine causes :jumps in memory usage not seen on the transit machines, that is enough :to force some of the unwritten dirty pages to be written to the :history file, as a few megs of swap get used, which is why it does :not stabilize as `nicely' as the transit machines. This makes sense... the amount of swap that gets used is critical. If we are talking about only a few megabytes, then your system is *not* swapping significantly, it is simply swapping out completely idle pages from things like idle getty's and such. This is a good thing. The disk activity would thus be mostly due to MAP_NOSYNC pages being written out. :Now, something I contemplated -- it seems that Bad Undesirable Things :happen as soon as I start
Re: vm_pageout_scan badness
Long ago, it was written here on 25 Oct 2000 by Matt Dillon: > :Consider that a file with a huge number of pages outstanding > :should probably be stealing pages from its own LRU list, and > :not the system, to satisfy new requests. This is particularly > :true of files that are demanding resources on a resource-bound > :system. > :... > : Terry Lambert > : [EMAIL PROTECTED] > > This isn't exactly what I was talking about. The issue in regards to > the filesystem syncer is that it fsync()'s an entire file. If > you have a big file (e.g. a USENET news history file) the > filesystem syncer can come along and exclusively lock it for > *seconds* while it is fsync()ing it, stalling all activity on > the file every 30 seconds. [...] > One of the reasons why Yahoo uses MAP_NOSYNC so much (causing the problem > that Alfred has been talking about) is because the filesystem > syncer is 'broken' in regards to generating unnecessarily long stalls. > > Personally speaking, I would much rather use MAP_NOSYNC anyway, even with > a fixed filesystem syncer. MAP_NOSYNC pages are not restricted by > the size of the filesystem buffer cache, so you can have a whole > lot more dirty pages in the system then you would normally be able to > have. This 'feature' has had the unfortunate side effect of screwing > up *THWACK* Yeah, no kidding -- here's what I see it screwing up. First, some background: I've built three news machines, two transit boxen and one reader box, with recent INN k0dez, and 4.2-STABLE of a few days ago (having tested NetBSD, more on that later), and a brief detour into 5-current. The two transit boxes have somewhere on the order of ~400MB memory or less; the amount I've put in the reader box has increased up to a Gig as I try to figure out what's happening. I'm using the MAP_NOSYNC on the history database files on all to try to get the NetBSD performance of not hitting history, and I've made a couple other minor tweaks to use mmap where the INN history code probably should, but doesn't. Everything starts out well, where the history disk is beaten at startup but as time passes, the time taken to do lookups and writes drops down to near-zero levels, and the disk gets quiet. And actually, the transit machines stay that way, while the reader machine gives me problems after some time. What I notice is that the amount of memory used keeps increasing, until it's all used, and the Free amount shown by `top' drops to a meg or so. Cache and Buf get a bit, but most of it is Active. Far more than is accounted for by the processes. Now, what happens on the reader machine is that after some time of the Active memory increasing, it runs out and starts to swap out processes, and the timestamps on the history database files (.index and .hash, this is the md5-based history) get updated, rather than remaining at the time INN is started. Then the rapid history times skyrocket until it takes more than 1/4 of the time. I don't see this on the transit boxen even after days of operation. Now, what happens when I stop INN and everything news-related is that some memory is freed up, but still, there can be, say, 400MB still reported as Active. More when I had a full gig in this machine to try to keep it from swapping, all of which got used... Then, when I reboot the machine, it gives the kernel messages about syncing disks; done, and then suddenly the history drive light goes on and it starts grinding for five minutes or so, before the actual reboot happens. No history activity happens when I shut down INN normally, which should free the MAP_NOSYNC'ed pages and make them available to be written to disk before rebooting, maybe. I'm also running BerkeleyDB for the reader overview on this machine, and I just discovered that I had applied MAP_NOSYNC to an earlier release, but the library linked in had not had this -- I just fixed that and am running that way now (and see a noticeable improvement) so now when I reboot, I may see both the overview database disk and the history disk get some pre-reboot activity, if what I think is happening really is happening. What I think is happening, based on these observations, is that the data from the history hash files (less than 100MB) gets read into memory, but the updates to it are not written over the data to be replaced -- it's simply appended to, up to the limit of the available memory. When this limit is reached on the transit machines, then things stabilize and old pages get recycled (but still, more memory overall is used than the size of the actual file). I'm guessing that additional activity of the reader machine causes jumps in memory usage not seen on the transit machines, that is enough to force some of the unwritten dirty pages to be written to the history file, as a few megs of swap get used, which is why it does not sta
Re: vm_pageout_scan badness
On Wed, 25 Oct 2000 21:54:42 + (GMT), Terry Lambert <[EMAIL PROTECTED]> wrote: >I think the idea of a fixed limit on the FS buffer cache is >probably wrong in the first place; certainly, there must be >high and low reserves, but: > >|--| all of memory > |-| FS allowed use >|-| non-FS allowed use >|| non-FSreserve > || FS reserve > >...in other words, a reserve-based system, rather than a limit >based system. This is what Compaq Tru64 (aka Digital UNIX aka OSF/1) does. It splits physical RAM as follows: || physical RAM || Static wired memory |===| managed memory |=-| dynamic wired memory |-| UBC memory || VM The default configuration provides: - up to 80% of RAM can be wired. - UBC (unified buffer cache) uses a minimum of 10% RAM and can use up to 100% RAM. - The VM subsystem can steal UBC pages if the UBC is using >20% RAM There's no minimum limit for VM space. The UBC can't directly steal VM pages, just pages off the common free list. The VM manages the free list by paging and swapping based on target page counts (fixed number of pages, not % of RAM). The FS metadata cache is a fixed size wired pool. I can think of benefits with the ability to separately control FS and non-FS RAM usage. The Tru64 defaults are definitely a very poor match with the application we run on it[1] and being able to reduce the RAM associated with filesystem buffers is an advantage. [1] Basically a number of processes querying a _very_ large Oracle SGA. Peter To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm_pageout_scan badness
:On Tue, Oct 24, 2000 at 01:10:19PM -0700, Matt Dillon wrote: :> Ouch. The original VM code assumed that pages would not often be :> ripped out from under the pageadaemon, so it felt free to restart :> whenever. I think you are absolutely correct in regards to the :> clustering code causing nearby-page ripouts. :> :> I don't have much time available, but let me take a crack at the :> problem tonight. : :While you are at it, would you care and have a look at PR19672. It :seems to be at least remotely relevant. ;-) Hmmm. Blech. contigmalloc is aweful. I'm not even sure if what it is doing is legal! If it can't find contiguous space it tries to flush the entire inactive and active queues. Every single page! not to mention the insane restarting. The algorithm is O(N^2) on an idle machine, and even worse on machines that might be doing something. There is no easy fix. contigmalloc would have to be completely rewritten. We could use the placemarker idea to make the loop 'retry' the page that blocked rather then restart at the beginning, but the fact that contigmalloc tries to flush the entire page queue means that it could very trivially get stuck on dead devices (e.g. like a dead NFS mount). Also, if we don't restart, there is less of a chance that contigmalloc can find sufficient free space. When it frees pages it does so half-hazzardly, and when it flushes pages out it makes no attempt to free them so an active process may reuse the page instantly. Bleh. I'm afraid I don't have the time to rewrite contigmalloc myself, but my brain is available to answer questions if someone else wants to have a go at it. -Matt :Cheers, :%Anton. :-- : and would be a nice addition :to HTML specification. : To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
VM pager patch (was Re: vm_pageout_scan badness)
Here's a test patch, inclusive of some debugging sysctls: vm.always_launder set to 1 to give up on trying to avoid pageouts. vm.vm_pageout_stats_rescans Number of times the main inactive scan in the pageout loop had to restart vm.vm_pageout_stats_xtralaunder Number of times a second pass had to be taken (in normal mode, with always_launder set to 0). This patch: * implements a placemarker to try to avoid restarts. * does not penalize the pageout daemon for being able to cluster writes. * adds an additional vnode check that should be there One last note: I wrote a quick and dirty program to mmap() a bunch of big files MAP_NOSYNC and then dirty them in a loop. I noticed that the filesystem update daemon 'froze up' the system for about a second every 30 seconds due to the huge number of dirty MAP_NOSYNC pages (about 1GB worth) sitting around (it has to scan the vm_page_t's even if it doesn't do anything with them). This is a separate issue. If Alfred, and others running heavily loaded systems are able to test this patch sufficiently, we can include it (minus the debugging sysctls) in the release. If not, I will wait until after the release is rolled before I commit it or whatever the final patch winds up looking like. -Matt Index: vm_page.c === RCS file: /home/ncvs/src/sys/vm/vm_page.c,v retrieving revision 1.147.2.3 diff -u -r1.147.2.3 vm_page.c --- vm_page.c 2000/08/04 22:31:11 1.147.2.3 +++ vm_page.c 2000/10/26 04:43:22 @@ -1783,6 +1783,12 @@ ("contigmalloc1: page %p is not PQ_INACTIVE", m)); next = TAILQ_NEXT(m, pageq); + /* +* ignore markers +*/ + if (m->flags & PG_MARKER) + continue; + if (vm_page_sleep_busy(m, TRUE, "vpctw0")) goto again1; vm_page_test_dirty(m); Index: vm_page.h === RCS file: /home/ncvs/src/sys/vm/vm_page.h,v retrieving revision 1.75.2.3 diff -u -r1.75.2.3 vm_page.h --- vm_page.h 2000/09/16 01:08:03 1.75.2.3 +++ vm_page.h 2000/10/26 04:17:28 @@ -251,6 +251,7 @@ #define PG_SWAPINPROG 0x0200 /* swap I/O in progress on page */ #define PG_NOSYNC 0x0400 /* do not collect for syncer */ #define PG_UNMANAGED 0x0800 /* No PV management for page */ +#define PG_MARKER 0x1000 /* special queue marker page */ /* * Misc constants. Index: vm_pageout.c === RCS file: /home/ncvs/src/sys/vm/vm_pageout.c,v retrieving revision 1.151.2.4 diff -u -r1.151.2.4 vm_pageout.c --- vm_pageout.c2000/08/04 22:31:11 1.151.2.4 +++ vm_pageout.c2000/10/26 05:07:45 @@ -143,6 +143,9 @@ static int disable_swap_pageouts=0; static int max_page_launder=100; +static int always_launder=0; +static int vm_pageout_stats_rescans=0; +static int vm_pageout_stats_xtralaunder=0; #if defined(NO_SWAPPING) static int vm_swap_enabled=0; static int vm_swap_idle_enabled=0; @@ -186,6 +189,12 @@ SYSCTL_INT(_vm, OID_AUTO, max_page_launder, CTLFLAG_RW, &max_page_launder, 0, "Maximum number of pages to clean per pass"); +SYSCTL_INT(_vm, OID_AUTO, always_launder, + CTLFLAG_RW, &always_launder, 0, "Always launder on the first pass"); +SYSCTL_INT(_vm, OID_AUTO, vm_pageout_stats_rescans, + CTLFLAG_RD, &vm_pageout_stats_rescans, 0, ""); +SYSCTL_INT(_vm, OID_AUTO, vm_pageout_stats_xtralaunder, + CTLFLAG_RD, &vm_pageout_stats_xtralaunder, 0, ""); #define VM_PAGEOUT_PAGE_COUNT 16 @@ -613,11 +622,16 @@ /* * vm_pageout_scan does the dirty work for the pageout daemon. + * + * This code is responsible for calculating the page shortage + * and then attempting to clean or free enough pages to hit that + * mark. */ static int vm_pageout_scan() { vm_page_t m, next; + struct vm_page marker; int page_shortage, maxscan, pcount; int addl_page_shortage, addl_page_shortage_init; int maxlaunder; @@ -651,27 +665,41 @@ /* * Figure out what to do with dirty pages when they are encountered. * Assume that 1/3 of the pages on the inactive list are clean. If -* we think we can reach our target, disable laundering (do not -* clean any dirty pages). If we
Re: vm_pageout_scan badness
> :Consider that a file with a huge number of pages outstanding > :should probably be stealing pages from its own LRU list, and > :not the system, to satisfy new requests. This is particularly > :true of files that are demanding resources on a resource-bound > :system. > > This isn't exactly what I was talking about. The issue in regards to > the filesystem syncer is that it fsync()'s an entire file. If > you have a big file (e.g. a USENET news history file) the > filesystem syncer can come along and exclusively lock it for > *seconds* while it is fsync()ing it, stalling all activity on > the file every 30 seconds. This seems like a broken (non)use of _SYNC parameters, but I definitely remember now about the FreeBSD breakage in the dirty page sync case not knowing what pages should be sync'ed or not, in the mmap region sync case of msync() degrading to fsync(). I guess O_WRITESYNC or msync() fixing is not an option? > The current VM system already does a good job in allowing files > to stealing pages from themselves. The sequential I/O detection > heuristic depresses the priority of pages as they are read making > it more likely for them to be reused. Since sequential I/O tends > to be the biggest abuser of file cache, the current FreeBSD > algorithms work well in real-life situations. We also have a few > other optimizations to reuse pages in there that I had added a year > or so ago (or fixed up, in the case of the sequential detection > heuristic). The biggest abuser that I have seen of this is actually not sequential. It is a linker that mmap()'s the object files, and then seeks all over creation to do the link, forcing all other pages out of core. I think the assumption that this is a sequential access problem, instead of a more general problem, is a bad one (FWIW, building per vnode working set quotas fixed the problem with the linker being antagonisitic). > One of the reasons why Yahoo uses MAP_NOSYNC so much (causing > the problem that Alfred has been talking about) is because the > filesystem syncer is 'broken' in regards to generating > unnecessarily long stalls. It doesn't stall when it should? 8-) 8-). I think this is a case of needing to eventually pay the piper for the music being played. If the pages are truly anonymous, then they don't need sync'ed; if they aren't, then they do need sync'ed. It sounds to me that if they are seeing long stalls, it's the msync() bug with not being able to tell what's dirty and what's clean... > Personally speaking, I would much rather use MAP_NOSYNC anyway, > even with a fixed filesystem syncer. MAP_NOSYNC pages are not > restricted by the size of the filesystem buffer cache, I see this as a bug in the non-MAP_NOSYNC case in FreeBSD's use vnodes as synonyms for vm_object_t's. I really doubt, though, that they are exceeding the maximum file size with a mapping; if not, then the issue is tuning. The limits on the size of the FS buffer cache are arbitrary; it should be possible to relax them. Again, I think the biggest problem here is historical, and it derives from the ability to dissociate a vnode with pages still hung off it from the backing inode (a cache bust). I suspect that if they increased the size of the ihash cache, they would see much better characteristics. My personal preference would be to not dissociate valid but clean pages from the reference object, until absolutely necessary. An easy fix for this would be to allow the FS to own the vnodes, not have a fixed size pool, and have a struct like: struct ufs_vnode { struct vnode; struct ufs_in_core_inode; }; And pass that around as if it were just a vnode, giving it back the the VFS that owned it, instead of using a system reclaim method, in order to reclaim it. Then if an ihash reclaim was wanted, it would have to free up the vnode resources to get it. Using high and low watermarks, instead of a fixed pool would complete the picture (the use of a fixed per-FS ihash pool in combination with a high/low watermarked per-system vnode pool is part of what causes the problem in the first place; an analytical mechanic or electronics buff would call this a classic case of "impedence mismatch"). > so you can have a whole lot more dirty pages in the system > then you would normally be able to have. E.g. they are working around an arbitrary, and wrong-for-them, administrative limit, instead of changing it. Bletch. > This 'feature' has had the unfortunate side effect of screwing > up the pageout daemon's algorithms, but that's fixable. I think the idea of a fixed limit on the FS buffer cache is probably wrong in the first place; certainly, there must be high and low reserves, but: |--| all of memory |-| FS allowed use |---
Re: vm_pageout_scan badness
: :Consider that a file with a huge number of pages outstanding :should probably be stealing pages from its own LRU list, and :not the system, to satisfy new requests. This is particularly :true of files that are demanding resources on a resource-bound :system. :... : Terry Lambert : [EMAIL PROTECTED] This isn't exactly what I was talking about. The issue in regards to the filesystem syncer is that it fsync()'s an entire file. If you have a big file (e.g. a USENET news history file) the filesystem syncer can come along and exclusively lock it for *seconds* while it is fsync()ing it, stalling all activity on the file every 30 seconds. The current VM system already does a good job in allowing files to stealing pages from themselves. The sequential I/O detection heuristic depresses the priority of pages as they are read making it more likely for them to be reused. Since sequential I/O tends to be the biggest abuser of file cache, the current FreeBSD algorithms work well in real-life situations. We also have a few other optimizations to reuse pages in there that I had added a year or so ago (or fixed up, in the case of the sequential detection heuristic). One of the reasons why Yahoo uses MAP_NOSYNC so much (causing the problem that Alfred has been talking about) is because the filesystem syncer is 'broken' in regards to generating unnecessarily long stalls. Personally speaking, I would much rather use MAP_NOSYNC anyway, even with a fixed filesystem syncer. MAP_NOSYNC pages are not restricted by the size of the filesystem buffer cache, so you can have a whole lot more dirty pages in the system then you would normally be able to have. This 'feature' has had the unfortunate side effect of screwing up the pageout daemon's algorithms, but that's fixable. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm_pageout_scan badness
On Tue, Oct 24, 2000 at 01:10:19PM -0700, Matt Dillon wrote: > Ouch. The original VM code assumed that pages would not often be > ripped out from under the pageadaemon, so it felt free to restart > whenever. I think you are absolutely correct in regards to the > clustering code causing nearby-page ripouts. > > I don't have much time available, but let me take a crack at the > problem tonight. While you are at it, would you care and have a look at PR19672. It seems to be at least remotely relevant. ;-) > I don't think we want to add another workaround to code that > already has too many of them. The solution may be to create a > dummy placemarker vm_page_t and to insert it into the pagelist > just after the current page after we've locked it and decided we > have to do something significant to it. We would then be able to > pick the scan up where we left off using the placemarker. > > This would allow us to get rid of the restart code entirely, or at > least devolve it back into its original design (i.e. something > that would not happen very often). Since we already have cache > locality of reference for the list node, the placemarker idea > ought to be quite fast. > > I'll take a crack at implementing the openbsd (or was it netbsd?) > partial fsync() code as well, to prevent the update daemon from > locking up large files that have lots of dirty pages for long > periods of time. Cheers, %Anton. -- and would be a nice addition to HTML specification. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm_pageout_scan badness
> > I'll take a crack at implementing the openbsd (or was it netbsd?) partial > > fsync() code as well, to prevent the update daemon from locking up large > > files that have lots of dirty pages for long periods of time. > > Making the partial fsync would help some people but probably not > these folks. I think this would be better handled as a per file working set quota, which could not be exceeded, unless changed by root. Consider that a file with a huge number of pages outstanding should probably be stealing pages from its own LRU list, and not the system, to satisfy new requests. This is particularly true of files that are demanding resources on a resource-bound system. > The people getting hit by this are Yahoo! boxes, they have giant areas > of NOSYNC mmap'd data, the issue is that for them the first scan through > the loop always sees dirty data that needs to be written out. I think > they also need a _lot_ more than 32 pages cleaned per pass because all > of thier pages need laundering. First principles? What are they doing, such that this situation arises in the first place? Having a clue to the problem they are trying to resolve, which causes this problem as a side effect, would both help to clarify if there were a better soloution for them, as well as what FreeBSD should potentially act like they were asking for instead, when/if the situation arose. > It might be wise to switch to a 'launder mode' if this sort of > usage pattern is detected and figure some better figure to use than > 32, I was hoping you'd have some suggestions for a heuristic to > detect this along the lines of what you have implemented in bufdaemon. This is kind of evil. You could do low and high watermarking, as you suggest, but without any idea of the queue retention time to expect, and how bursty the situation is, there's no way to pick an appropriate algorithm. Terry Lambert [EMAIL PROTECTED] --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm_pageout_scan badness
:Ok, now I feel pretty lost, how is there a relationship between :max_page_launder and async writes? If increasing max_page_launder :increases the amount of async writes, isn't that a good thing? The async writes are competing against the rest of the system for disk resources. While it is ok for an async write to stall, the fact that it will cause other processes read() or page-in (which is nominally synchronous) requests to stall can result in seriously degraded operation for those processes. Piling on the number of async writes running in parallel is not going to improve the performance of page-out daemon, but it will degrade the performance of I/O issued by other processes in the system. The only two reasons the pageout daemon is not doing synchronous writes are: (1) because it can't afford to stall on a slow device (or NFS, etc.) and (2) so it can parallelize I/O across different devices. But since the pageout daemon isn't really all that smart and doesn't track what it does, the whole algorithm devolves into issueing a certain number of asynchronous I/O's all at once governed by max_page_launder. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm_pageout_scan badness
* Matt Dillon <[EMAIL PROTECTED]> [001024 15:32] wrote: > > :The people getting hit by this are Yahoo! boxes, they have giant areas > :of NOSYNC mmap'd data, the issue is that for them the first scan through > :the loop always sees dirty data that needs to be written out. I think > :they also need a _lot_ more than 32 pages cleaned per pass because all > :of thier pages need laundering. > : > :Perhaps if you detected how often the routine was being called you > :could slowly raise max_page_launder to compensate and lower it > :after some time without a shortage. Perhaps adding a quarter of > :'should_have_laundered' to maxlaunder for a short interval. > : > :It might be wise to switch to a 'launder mode' if this sort of > :usage pattern is detected and figure some better figure to use than > :32, I was hoping you'd have some suggestions for a heuristic to > :detect this along the lines of what you have implemented in bufdaemon. > > We definitely don't want to increase max_page_launder too much... the > problem is that there is a relationship between it and the number of > simultanious async writes that can be queued in one go, and that can > interfere with normal I/O. But perhaps we should decouple it from the > I/O count and have it count clusters instead of pages. i.e. this line: Ok, now I feel pretty lost, how is there a relationship between max_page_launder and async writes? If increasing max_page_launder increases the amount of async writes, isn't that a good thing? > > written = vm_pageout_clean(m); > if (vp) > vput(vp) > maxlaunder -= written; > > Can turn into: > > if (vm_pageout_clean(m)) > --maxlaunder; > if (vp) > vput(vp); > > In regards to speeding up paging, perhaps we can implement a heuristic > similar to what buf_daemon() does. We could wake the pageout daemon up > more often. I'll experiment with it a bit. We certainly have enough > statistical information to come up with something good. That looks like it would help by ignoring the clustered data which probably got written out pretty quickly and reducing the negative cost/gain to a single page. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk." To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm_pageout_scan badness
:The people getting hit by this are Yahoo! boxes, they have giant areas :of NOSYNC mmap'd data, the issue is that for them the first scan through :the loop always sees dirty data that needs to be written out. I think :they also need a _lot_ more than 32 pages cleaned per pass because all :of thier pages need laundering. : :Perhaps if you detected how often the routine was being called you :could slowly raise max_page_launder to compensate and lower it :after some time without a shortage. Perhaps adding a quarter of :'should_have_laundered' to maxlaunder for a short interval. : :It might be wise to switch to a 'launder mode' if this sort of :usage pattern is detected and figure some better figure to use than :32, I was hoping you'd have some suggestions for a heuristic to :detect this along the lines of what you have implemented in bufdaemon. We definitely don't want to increase max_page_launder too much... the problem is that there is a relationship between it and the number of simultanious async writes that can be queued in one go, and that can interfere with normal I/O. But perhaps we should decouple it from the I/O count and have it count clusters instead of pages. i.e. this line: written = vm_pageout_clean(m); if (vp) vput(vp) maxlaunder -= written; Can turn into: if (vm_pageout_clean(m)) --maxlaunder; if (vp) vput(vp); In regards to speeding up paging, perhaps we can implement a heuristic similar to what buf_daemon() does. We could wake the pageout daemon up more often. I'll experiment with it a bit. We certainly have enough statistical information to come up with something good. -Matt :-Alfred : To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm_pageout_scan badness
* Matt Dillon <[EMAIL PROTECTED]> [001024 13:11] wrote: > Ouch. The original VM code assumed that pages would not often be > ripped out from under the pageadaemon, so it felt free to restart > whenever. I think you are absolutely correct in regards to the > clustering code causing nearby-page ripouts. Yes, it would make sense to me that if you did a sequential write to a file after some time it would be likely that those pages would be put in order on the inactive queue and when cluster written 'next' would be on a different queue as it was written along with the preceeding page. > I don't have much time available, but let me take a crack at the > problem tonight. I don't think we want to add another workaround to > code that already has too many of them. The solution may be > to create a dummy placemarker vm_page_t and to insert it into the pagelist > just after the current page after we've locked it and decided we have > to do something significant to it. We would then be able to pick the > scan up where we left off using the placemarker. > > This would allow us to get rid of the restart code entirely, or at least > devolve it back into its original design (i.e. something that would not > happen very often). Since we already have cache locality of reference for > the list node, the placemarker idea ought to be quite fast. > > I'll take a crack at implementing the openbsd (or was it netbsd?) partial > fsync() code as well, to prevent the update daemon from locking up large > files that have lots of dirty pages for long periods of time. Making the partial fsync would help some people but probably not these folks. The people getting hit by this are Yahoo! boxes, they have giant areas of NOSYNC mmap'd data, the issue is that for them the first scan through the loop always sees dirty data that needs to be written out. I think they also need a _lot_ more than 32 pages cleaned per pass because all of thier pages need laundering. Perhaps if you detected how often the routine was being called you could slowly raise max_page_launder to compensate and lower it after some time without a shortage. Perhaps adding a quarter of 'should_have_laundered' to maxlaunder for a short interval. It might be wise to switch to a 'launder mode' if this sort of usage pattern is detected and figure some better figure to use than 32, I was hoping you'd have some suggestions for a heuristic to detect this along the lines of what you have implemented in bufdaemon. What you could also do is count the amount of pages that could/should have been laundered during the first pass and if it exceeds a certain threshold passing the amount of pages that were free'd via: if (m->object->ref_count == 0) { and: if (m->valid == 0) { and: } else if (m->dirty == 0) { basically if maxlaunder is equal to zero and we miss all those tests you might want to bump up a counter and if it exceeds a threshold immediately start rescanning and double(?) maxlaunder. -Alfred To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: vm_pageout_scan badness
Ouch. The original VM code assumed that pages would not often be ripped out from under the pageadaemon, so it felt free to restart whenever. I think you are absolutely correct in regards to the clustering code causing nearby-page ripouts. I don't have much time available, but let me take a crack at the problem tonight. I don't think we want to add another workaround to code that already has too many of them. The solution may be to create a dummy placemarker vm_page_t and to insert it into the pagelist just after the current page after we've locked it and decided we have to do something significant to it. We would then be able to pick the scan up where we left off using the placemarker. This would allow us to get rid of the restart code entirely, or at least devolve it back into its original design (i.e. something that would not happen very often). Since we already have cache locality of reference for the list node, the placemarker idea ought to be quite fast. I'll take a crack at implementing the openbsd (or was it netbsd?) partial fsync() code as well, to prevent the update daemon from locking up large files that have lots of dirty pages for long periods of time. -Matt : :Matt, I'm not sure if Paul mailed you yet so I thought I'd take the :initiative of bugging you about some reported problems (lockups) :when dealing with machines that have substantial MAP_NOSYNC'd :data along with a page shortage. : :What seems to happen is that vm_pageout_scan (src/sys/vm/vm_pageout.c :line 618) keeps rescanning the inactive queue. : :My guess is that it just doesn't expect someone to have hosed themselves :by having so many pages that need laundering (maxlaunder/launder_loop) :along with the fact that the comment and code here are doing the wrong :thing for the situation: : : /* :* Figure out what to do with dirty pages when they are encountered. :* Assume that 1/3 of the pages on the inactive list are clean. If :* we think we can reach our target, disable laundering (do not :* clean any dirty pages). If we miss the target we will loop back :* up and do a laundering run. :*/ : : if (cnt.v_inactive_count / 3 > page_shortage) { : maxlaunder = 0; : launder_loop = 0; : } else { : maxlaunder = : (cnt.v_inactive_target > max_page_launder) ? : max_page_launder : cnt.v_inactive_target; : launder_loop = 1; : } : :The problem is that there's a chance that nearly all the pages on :the inactive queue need laundering and the loop that starts with :the lable 'rescan0:' is never able to clean enough pages before :stumbling across a block that has moved to another queue. This :forces a jump back to the 'rescan0' lable with effectively nothing :accomplished. : :Here's a patch that may help, it's untested, but shows what I'm :hoping to achieve which is force a maximum on the amount of times :rescan0 will be jumped to while not laundering. :... : :I'm pretty sure that there's yet another problem here, when paging :out a vnode's pages the output routine attempts to cluster them, :this could easily make 'next' point to a page that is cleaned and :put on the FREE queue, when the loop then tests it for :'m->queue != PQ_INACTIVE' it forces 'rescan0' to happen. : :I think one could fix this by keeping a pointer to the previous :page then the 'goto rescan0;' test might become something like :this: :... : :Of course we need to set 'prev' properly, but I need to get back :to my database stuff right now. :) : :Also... I wish there was a good hueristic to make max_page_launder :a bit more adaptive, you've done some wonders with bufdaemon so :I'm wondering if you had some ideas about that. : :-- :-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] :"I have the heart of a child; I keep it in a jar on my desk." To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
vm_pageout_scan badness
Matt, I'm not sure if Paul mailed you yet so I thought I'd take the initiative of bugging you about some reported problems (lockups) when dealing with machines that have substantial MAP_NOSYNC'd data along with a page shortage. What seems to happen is that vm_pageout_scan (src/sys/vm/vm_pageout.c line 618) keeps rescanning the inactive queue. My guess is that it just doesn't expect someone to have hosed themselves by having so many pages that need laundering (maxlaunder/launder_loop) along with the fact that the comment and code here are doing the wrong thing for the situation: /* * Figure out what to do with dirty pages when they are encountered. * Assume that 1/3 of the pages on the inactive list are clean. If * we think we can reach our target, disable laundering (do not * clean any dirty pages). If we miss the target we will loop back * up and do a laundering run. */ if (cnt.v_inactive_count / 3 > page_shortage) { maxlaunder = 0; launder_loop = 0; } else { maxlaunder = (cnt.v_inactive_target > max_page_launder) ? max_page_launder : cnt.v_inactive_target; launder_loop = 1; } The problem is that there's a chance that nearly all the pages on the inactive queue need laundering and the loop that starts with the lable 'rescan0:' is never able to clean enough pages before stumbling across a block that has moved to another queue. This forces a jump back to the 'rescan0' lable with effectively nothing accomplished. Here's a patch that may help, it's untested, but shows what I'm hoping to achieve which is force a maximum on the amount of times rescan0 will be jumped to while not laundering. Index: vm_pageout.c === RCS file: /home/ncvs/src/sys/vm/vm_pageout.c,v retrieving revision 1.151.2.4 diff -u -u -r1.151.2.4 vm_pageout.c --- vm_pageout.c2000/08/04 22:31:11 1.151.2.4 +++ vm_pageout.c2000/10/24 07:31:39 @@ -618,7 +618,7 @@ vm_pageout_scan() { vm_page_t m, next; - int page_shortage, maxscan, pcount; + int page_shortage, maxscan, maxtotscan, pcount; int addl_page_shortage, addl_page_shortage_init; int maxlaunder; int launder_loop = 0; @@ -672,13 +672,23 @@ * we have scanned the entire inactive queue. */ +rescantot: + /* make sure we don't hit rescan0 too many times */ + maxtotscan = cnt.v_inactive_count; rescan0: addl_page_shortage = addl_page_shortage_init; maxscan = cnt.v_inactive_count; + if (maxtotscan < 1) { + maxlaunder = + (cnt.v_inactive_target > max_page_launder) ? + max_page_launder : cnt.v_inactive_target; + } for (m = TAILQ_FIRST(&vm_page_queues[PQ_INACTIVE].pl); m != NULL && maxscan-- > 0 && page_shortage > 0; m = next) { + --maxtotscan; + cnt.v_pdpages++; if (m->queue != PQ_INACTIVE) { @@ -930,7 +940,7 @@ maxlaunder = (cnt.v_inactive_target > max_page_launder) ? max_page_launder : cnt.v_inactive_target; - goto rescan0; + goto rescantot; } /* (still talking about vm_pageout_scan()): I'm pretty sure that there's yet another problem here, when paging out a vnode's pages the output routine attempts to cluster them, this could easily make 'next' point to a page that is cleaned and put on the FREE queue, when the loop then tests it for 'm->queue != PQ_INACTIVE' it forces 'rescan0' to happen. I think one could fix this by keeping a pointer to the previous page then the 'goto rescan0;' test might become something like this: /* * We keep a back reference just in case the vm_pageout_clean() * happens to clean the page after the one we just cleaned * via clustering, this would make next point to something not * one the INACTIVE queue, as a stop-gap we keep a pointer * to the previous page and attempt to use it as a fallback * starting point before actually starting at the head of the * INACTIVE queue again */ if (m->queue != PQ_INACTIVE) { if (prev != NULL && prev->queue == PQ_INACTIVE) { m = TAILQ_NEXT(prev, pageq); if (m == NULL || m->queue != PQ_INACTIVE) goto rescan0; } else { goto rescan0; } } Of course we need to set 'prev' properly, but I need to get back to my database stuff right now. :) Also... I wish there was a good hueristic to make max_page_launder a bit more adaptive, you've done some wonders with bufdaemon so