Tony Finch wrote: [ ... Terry describes non-blocking I/O on page-not-present on SVR4, and how it behaves better than BSD ... ]
> How does it deal with the situation that the machine's > working set has exceeded memory? If the web server is dealing > with lots of concurrent connections it may have to block in > poll() waiting for (say) a thousand pages to be brought in, > then in the process of dealing with them after the poll() > returns some of the pages may get re-used for something else > making read() on a "readable" fd return EWOULDBLOCK again. It doesn't deal with this at all. SVR4 doesn't deal with overcommit of the working set at all well. I've pointed this out many times since 1992: the SVR4 linker maps all the involved object files into the linker, and then seeks all over heck (effectively) to do the relocation and the symbol table fixup. In this process, it thrashes out all the pages for the X server (among other things, but that's the most noticeable), and you lose control of the system until the link is done. FreeBSD has a slightly less difficult time with processes which are pigs in this way, mostly because it has a unified VM and buffer cache, so the contention between the VM and buffer cache allocations is no longer there. However, a "correctly" written program, designed to exercise the code path for the degenerate case will, well... exercise the code path for the degenerate case. SVR4 dealt with this issue by throwing the CPU at it: they have modular scheduler classes, and one of the ones they provide is called "fixed", where a certain percentage of the CPU is dedicated to that task, whether it needs it or not. Thus the X server runs at a "fixed" class, and it thrashes the pages it needs back in in the time allotted to it, and the net effect is that when you move the mouse, the cursor wiggles, just like it's supposed to. This approach works for SVR4 _because_ the VM and buffer cache are not unified, and _cannot_ work for FreeBSD, because they are, since it can't attribute demand back to the demander. The correct fix for this problem is to set a high watermark for the amount of available system memory, and a high watermark per vnode. When you hit the high watermark for the system, then you are in a resource starvation situation; knowing this, then if you are asked to page a page in on a vnode, you check its page count, and, if it is over the second high watermark, then instead of taking an LRU page from the system, you steal the page from the page list on the vnode instead. The net effect of this approach is, in starvation situations (and _only_ in starvation situations!), you limit the per vnode working set size. Obviously, the VMS approach of limiting the per process working set size (via a working set quota) would be better, if you could enforce it, and if you delayed enforcement until starvation set in. But doing this per process is not possible with a unified VM and buffer cache, unless all file I/O occurs via mmap(), rather than kernel read/write calls (not possible to do, because of struct fileops, since not all vnodes are created equal in FreeBSD; sockets are a particularly problematic area). You could make this approach even more complex, in an attempt to ensure "fairness" by raising the quota on a per reference basis, but that's exploitable. If we are talking web traffic here, then the enforcement of working set size should probably take into account how close the quota is to the file size: if the quota is 800k, and the file is 801k, then it probably makes sense to give in to the process, and load the extra page to avoid thrashing. This is probably calculable as a percentage of the remaining system resources, once the system is over the high watermark, but below total starvation. Realize that the normal approach to this problem is to simply trust LRU and the locality of reference model. I don't think that anything you can think of (short of packing in more RAM) can possibly prevent at least _some_ elbow in the performance at starvation. > But on the other hand the OS can't lock the pages in memory > until they are read, since passing the fd to poll() isn't > a promise to read since this may lead to DOS attacks, or > alternatively processes being unexpectedly killed for hitting > RLIMIT_MEMLOCK. Yes. I _am_ assuming that your access model used by your application is relatively uniform. If everything doesn't go through the same access path, all bets are off. This is going to be true of any asymmetric application under resource starvation conditions, though, sho I view this as a problem og you shooting yourself in the foot: it's your foot, and you can do what you want with it, including shooting it. This should mean that it's robust in the face of a DOS attack (given that the code path is uniform), but not robust in the face of being badly implemented. I'd have to say that there, too, you have no defense, but since it's your own fault, "as ye sow, so shall ye reap". 8-). > I can't see a good way of avoiding these semantic problems > without changing to a completely different kernel API like KSE. No. A minor change in the working set management algorithm, away from a simple global LRU, and then only in the starvation case, can make a significant positive difference. Personally, I'd do a simple watermark of 90% using integer math, and then enforce a per vnode LRU policy, rather than using the global LRU table, unless the vnode is not over its quota. There's another advantage to this, in that it avoids the global LRU lock, if it starts enforcing on a per vnode basis. This would add three sysctls: 1) The global watermark level 2) The per vnode weighting algorithm selection for deriving the per vnode watermark 3) Enable/disable It would be relatively simple to make the per vnode watermark dynamic, based on the number of vnodes in use in the system, or the number of "eligible vnodes", if you want to make special exception for executables, since they must be in core in order to create demand in the first place (but by that token, the current global LRU policy is broken, since it takes no notice of executables as special entities). -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message