Re: the new VMt
Hello, On Wed, Sep 27, 2000 at 01:55:52PM +0100, Hugh Dickins wrote: > On Wed, 27 Sep 2000, Andrey Savochkin wrote: > > > > It's a waste of resources to reserve memory+swap for the case that every > > running process decides to modify libc code (and, thus, should receive its > > private copy of the pages). A real waste! > > A real waste indeed, but a bad example: libc code is mapped read-only, > so nobody would recommend reserving memory+swap for private mods to it. > Of course, a process might choose to mprotect it writable at some time, > that would be when to refuse if overcommitted. Returning error from mprotect() call for private mappings? It wouldn't be what people expect... The other example where overcommit makes sense is fork() (not vfork) and immediate exec in one of the threads. Best regards Andrey - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Wed, 27 Sep 2000, Andrey Savochkin wrote: > > It's a waste of resources to reserve memory+swap for the case that every > running process decides to modify libc code (and, thus, should receive its > private copy of the pages). A real waste! A real waste indeed, but a bad example: libc code is mapped read-only, so nobody would recommend reserving memory+swap for private mods to it. Of course, a process might choose to mprotect it writable at some time, that would be when to refuse if overcommitted. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
Horst von Brand wrote: > I'd call emacs consistently not being able to start an ls on a 16Mb machine > much worse than a surprise... > > Hint: Think about how emacs would go about doing that... vfork ;-) -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Tue, Sep 26, 2000 at 11:45:02AM -0600, Erik Andersen wrote: [snip] > "Overcommit" to me is the same things as Mark Hemment stated earlier in this > thread -- the "fact that the system has over committed its memory resources. > ie. it has sold too many tickets for the number of seats in the plane, and all > the passengers have turned up." Basically any case where too many tickets > have been sold (applied to the entire system, and all subsystems). [snip] > If the Beancounter patch lets the kernel count "passengers", classify them > (with user hinting) so the pilot and flight attendants (init, X, or whatever) > always stay on the plane, and has some sane predictable mechanism for booting > non-priveledged passengers, then I am all for it. That's exactly what I'm doing. > How does one provide the kernel with hints as to which processes are sacred? > Where does one find this beancounter patch? How much weight does it add to > the kernel? ftp://ftp.sw.com.sg/pub/Linux/people/saw/kernel/user_beancounter/UserBeancounter.html The current version has some drawbacks, and one of them is the performance. Memory accounting is implemented as a kernel thread which goes through page tables of processes (similar to kswapd), and it appears to consume 1-5% of CPU (depending on number of processes). I consider it unacceptable, and have started reimplementation of the process memory accounting from the beginning. Best regards Andrey - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
Hello, On Tue, Sep 26, 2000 at 01:10:30PM +0100, Mark Hemment wrote: > > On Mon, 25 Sep 2000, Stephen C. Tweedie wrote: > > So you have run out of physical memory --- what do you do about it? > > Why let the system get into the state where it is neccessary to kill a > process? > Per-user/task resource counters should prevent unprivileged users from > soaking up too many resources. That is the DoS protection. > [snip] > It is possible to do true, system wide, resource counting of physical > memory and swap space, and to deny a fork() or mmap() which would cause > over committing of memoy resources if everyone cashed in their > requirements. [snip] People use overcommitting not because they are fans of the idea. Overcommitting simply is the _efficient_ way of resource sharing. It's a waste of resources to reserve memory+swap for the case that every running process decides to modify libc code (and, thus, should receive its private copy of the pages). A real waste! I always agree to take the risk of some applications being killed in such a case of all processes turning crazy. The approach I believe in is: - ensure that accidental or intentional madness of applications of one user may cause only limited damage to other users; and - introduce a way to tell the kernel that some applications should be saved longer than others when troubles begin and ways to set up some guaranteed amounts for important processes. Certainly, a lot of processes may consume more than their guarantee until bad things start to happen. Then the rules of user protection and killing order apply. That's how I develop the resource control in the beancounter patch ftp://ftp.sw.com.sg/pub/Linux/people/saw/kernel/user_beancounter/UserBeancounter.html#s7 Best regards Andrey - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
In message <[EMAIL PROTECTED]> you write: > I suspect that the proper way to do this is to just make another gfp_flag, > which is basically another hint to the mm layer that we're doing a multi- > page allocation and that the MM layer should not try forever to handle it. > > In fact, that's independent of whether it is a multi-page allocation or > not. It might be something like __GFP_SOFT - you could use it with single > pages too. That'd be a lovely interface, now wouldn't it? *yecch* Please consider at least: /* Never fails. */ #define trivial_kmalloc(s) \ ((void)((s) > PAGE_SIZE ? bad_size_##s : __kmalloc((s), GFP_KERNEL))) /* Can fail */ #define kmalloc(s, pri) __kmalloc((s), (pri)|__GFP_SOFT) Rusty. -- Hacking time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
Horst von Brand wrote: I'd call emacs consistently not being able to start an ls on a 16Mb machine much worse than a surprise... Hint: Think about how emacs would go about doing that... vfork ;-) -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Wed, 27 Sep 2000, Andrey Savochkin wrote: It's a waste of resources to reserve memory+swap for the case that every running process decides to modify libc code (and, thus, should receive its private copy of the pages). A real waste! A real waste indeed, but a bad example: libc code is mapped read-only, so nobody would recommend reserving memory+swap for private mods to it. Of course, a process might choose to mprotect it writable at some time, that would be when to refuse if overcommitted. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
Hello, On Tue, Sep 26, 2000 at 01:10:30PM +0100, Mark Hemment wrote: On Mon, 25 Sep 2000, Stephen C. Tweedie wrote: So you have run out of physical memory --- what do you do about it? Why let the system get into the state where it is neccessary to kill a process? Per-user/task resource counters should prevent unprivileged users from soaking up too many resources. That is the DoS protection. [snip] It is possible to do true, system wide, resource counting of physical memory and swap space, and to deny a fork() or mmap() which would cause over committing of memoy resources if everyone cashed in their requirements. [snip] People use overcommitting not because they are fans of the idea. Overcommitting simply is the _efficient_ way of resource sharing. It's a waste of resources to reserve memory+swap for the case that every running process decides to modify libc code (and, thus, should receive its private copy of the pages). A real waste! I always agree to take the risk of some applications being killed in such a case of all processes turning crazy. The approach I believe in is: - ensure that accidental or intentional madness of applications of one user may cause only limited damage to other users; and - introduce a way to tell the kernel that some applications should be saved longer than others when troubles begin and ways to set up some guaranteed amounts for important processes. Certainly, a lot of processes may consume more than their guarantee until bad things start to happen. Then the rules of user protection and killing order apply. That's how I develop the resource control in the beancounter patch ftp://ftp.sw.com.sg/pub/Linux/people/saw/kernel/user_beancounter/UserBeancounter.html#s7 Best regards Andrey - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Tue, Sep 26, 2000 at 11:45:02AM -0600, Erik Andersen wrote: [snip] "Overcommit" to me is the same things as Mark Hemment stated earlier in this thread -- the "fact that the system has over committed its memory resources. ie. it has sold too many tickets for the number of seats in the plane, and all the passengers have turned up." Basically any case where too many tickets have been sold (applied to the entire system, and all subsystems). [snip] If the Beancounter patch lets the kernel count "passengers", classify them (with user hinting) so the pilot and flight attendants (init, X, or whatever) always stay on the plane, and has some sane predictable mechanism for booting non-priveledged passengers, then I am all for it. That's exactly what I'm doing. How does one provide the kernel with hints as to which processes are sacred? Where does one find this beancounter patch? How much weight does it add to the kernel? ftp://ftp.sw.com.sg/pub/Linux/people/saw/kernel/user_beancounter/UserBeancounter.html The current version has some drawbacks, and one of them is the performance. Memory accounting is implemented as a kernel thread which goes through page tables of processes (similar to kswapd), and it appears to consume 1-5% of CPU (depending on number of processes). I consider it unacceptable, and have started reimplementation of the process memory accounting from the beginning. Best regards Andrey - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
Erik Andersen <[EMAIL PROTECTED]> said: [...] > Another approach would be to let user space turn off overcommit. > That way, user space can be assured there will be no surprises... I'd call emacs consistently not being able to start an ls on a 16Mb machine much worse than a surprise... Hint: Think about how emacs would go about doing that... Also, to ensure there is /no/ overcommit /anywhere/ amounts to a rigurous audit of the whole kernel, and of each single patchlet that goes in. You are certainly welcome to do the job... -- Horst von Brand [EMAIL PROTECTED] Casilla 9G, Vin~a del Mar, Chile +56 32 672616 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
Hello, > > Another approach would be to let user space turn off overcommit. > > No. Overcommit only applies to pageable memory. Beancounter is > really needed for non-pageable resources such as page tables and > mlock()ed pages. > In addition to beancounter, do you think pageable page tables are something we want to tackle in 2.5.x? 4MB page mappings on x86 could be cool too, as an option... -- Eric Lowe FibreChannel Software Engineer, Systran Corporation [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Tue Sep 26, 2000 at 06:08:20PM +0100, Stephen C. Tweedie wrote: > Hi, > > On Tue, Sep 26, 2000 at 11:02:48AM -0600, Erik Andersen wrote: > > > Another approach would be to let user space turn off overcommit. > > No. Overcommit only applies to pageable memory. Beancounter is > really needed for non-pageable resources such as page tables and > mlock()ed pages. I think we do agree here, though we are having problems with semantics. "Overcommit" to me is the same things as Mark Hemment stated earlier in this thread -- the "fact that the system has over committed its memory resources. ie. it has sold too many tickets for the number of seats in the plane, and all the passengers have turned up." Basically any case where too many tickets have been sold (applied to the entire system, and all subsystems). To extend the airplane metaphor a bit past credibility... When an airline sells too many tickets, it bribes people to get off the plane. For the kernel, it tends to fall over, or starts kicking off pilots and flight attendants. If the Beancounter patch lets the kernel count "passengers", classify them (with user hinting) so the pilot and flight attendants (init, X, or whatever) always stay on the plane, and has some sane predictable mechanism for booting non-priveledged passengers, then I am all for it. How does one provide the kernel with hints as to which processes are sacred? Where does one find this beancounter patch? How much weight does it add to the kernel? -Erik -- Erik B. Andersen email: [EMAIL PROTECTED] --This message was written using 73% post-consumer electrons-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
Hi, On Tue, Sep 26, 2000 at 11:02:48AM -0600, Erik Andersen wrote: > Another approach would be to let user space turn off overcommit. No. Overcommit only applies to pageable memory. Beancounter is really needed for non-pageable resources such as page tables and mlock()ed pages. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Tue Sep 26, 2000 at 05:04:06PM +0100, Stephen C. Tweedie wrote: > Hi, > > On Tue, Sep 26, 2000 at 09:17:44AM -0600, [EMAIL PROTECTED] wrote: > > > Operating systems cannot make more memory appear by magic. > > The question is really about the best strategy for dealing with low memory. In my > > opinion, the OS should not try to out-think physical limitations. Instead, the OS > > should take as little space as possible and provide the ability for user level > > clever management of space. In a truly embedded system, there can easily be a user >level > > root process that watches memory usage and prevents DOS attacks -- if the OS >provides > > settable enforced quotas etc. > > Agreed, absolutely. The beancounter is one approach to those quotas, > and has the advantage of allowing per-user as well as per-process > quotas. Another approach would be to let user space turn off overcommit. That way, user space can be assured there will be no surprises... -Erik -- Erik B. Andersen email: [EMAIL PROTECTED] --This message was written using 73% post-consumer electrons-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
Hi, On Tue, Sep 26, 2000 at 09:17:44AM -0600, [EMAIL PROTECTED] wrote: > Operating systems cannot make more memory appear by magic. > The question is really about the best strategy for dealing with low memory. In my > opinion, the OS should not try to out-think physical limitations. Instead, the OS > should take as little space as possible and provide the ability for user level > clever management of space. In a truly embedded system, there can easily be a user >level > root process that watches memory usage and prevents DOS attacks -- if the OS provides > settable enforced quotas etc. Agreed, absolutely. The beancounter is one approach to those quotas, and has the advantage of allowing per-user as well as per-process quotas. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 05:14:11PM -0600, Erik Andersen wrote: > On Mon Sep 25, 2000 at 02:04:19PM -0600, [EMAIL PROTECTED] wrote: > > > > > all of the pending requests just as long as they are serialised, is > > > this a problem? > > > > I think you are solving the wrong problem. On a small memory machine, the kernel, > > utilities, and applications should be configured to use little memory. > > BusyBox is better than BeanCount. > > > > Granted that smaller apps can help -- for a particular workload. But while I > am very partial to BusyBox (in fact I am about to cut a new release) I can > assure you that OOM is easily possible even when your user space is tiny. I do > it all the time. There are mallocs in busybox and when under memory pressure, > the kernel still tends to fall over... Operating systems cannot make more memory appear by magic. The question is really about the best strategy for dealing with low memory. In my opinion, the OS should not try to out-think physical limitations. Instead, the OS should take as little space as possible and provide the ability for user level clever management of space. In a truly embedded system, there can easily be a user level root process that watches memory usage and prevents DOS attacks -- if the OS provides settable enforced quotas etc. -- - Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Tue, Sep 26, 2000 at 11:07:36AM +0100, Stephen C. Tweedie wrote: > Hi, > > On Mon, Sep 25, 2000 at 03:12:50PM -0600, [EMAIL PROTECTED] wrote: > > > > > > > > I'm not too sure of what you have in mind, but if it is > > > > "process creates vast virtual space to generate many page table > > > > entries -- using mmap" > > > > the answer is, virtual address space quotas and mmap should kill > > > > the process on low mem for page tables. > > > > > > No. Page tables are not freed after munmap (and for good reason). The > > > counting of page table "beans" is critical. > > > > I've seen the assertion before, reasons would be interesting. > > Reason 1: under DoS attack, you want to target not the process using > the most resources, but the *user* using the most resources (else a > fork-bomb style attack can work around your OOM-killer algorithms). Ok. if(over_allocated_page_tables(task->uid) ) return ENOMEM; makes sense in "fork". I guess the argument here is not about whether accounting is good, it's about where the accounting should be done. To me the alternatives of if(preallocate_pages(page_table_size_for_this_process()) == -1)return error then actually allocate making sure to adjust counts if some other error turns up and with something taking care of how the pre-allocation works while we are sleeping waiting for possibly unrelated resources. or just kmalloc with kmalloc magically juggling resources in some safe way seem less clear. > Reason 2: if you've got tasks stuck in low-level page allocation > routines, then you can't immediately kill -9 them, so reactive OOM > killing always has vulnerabilities --- to be robust in preventing > resource exhaustion you want limits on the use of those resources > before they are exhausted --- the necessary accounting being part of > what we refer to as "beancounter". doesn't the problem really come from low level page allocation at too high a level? That is, if instead of select doing get_free_page, it maybe should do get_per_process_page(myprocess) or even get_per_process_file_use_page(myprocess) Then we could have a config-optional per-process pinned page accounting with the possibility of doing something sensible in a user-space daemon when memory is low. > > --Stephen -- - Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Tue, Sep 26, 2000 at 10:54:23AM +0100, Stephen C. Tweedie wrote: > Beancounter is a framework for user-level accounting. _What_ you > account is up to the callers. Maybe this has been a miscommunication, > but beancounter is all about allowing callers to account for stuff > before allocation, not about having the page allocation functions > themselves enforce quotas. per-user and system-wide and per-process quotas are one thing, a pre-allocate-and-then-allocate generic scheme seems to me to be a error prone way of getting there. In particular, I think it is dangerous to have a pre-count that is approximately tethered to the thing it is counting -- in the memory allocation we were discussing, you need to make sure that the pre-allocations are for memory that is really going to be allocated soon and that it is later correlated with free in some way. So, to me, a quota bounded allocate_page_table(process_id) makes much more sense then pre-allocate counting, or, even worse, a "smart" kmalloc that never fails. If the problem is unaccounted for page-tables then account for page tables and return a -EYOURPROCESSISOUTOFCONTROL so that calling kernel code can take the responsible action. -- - Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
Hi, On Mon, 25 Sep 2000, Stephen C. Tweedie wrote: > So you have run out of physical memory --- what do you do about it? Why let the system get into the state where it is neccessary to kill a process? Per-user/task resource counters should prevent unprivileged users from soaking up too many resources. That is the DoS protection. So an OOM is possibly; 1) A privileged, legally resource hungry, app(s) has taken all the memory. Could be too important to simply kill (it should exit gracefully). 2) Simply too many tasks*(memory-requirements-of-each-task). Ignoring allocations done by the kernel, the suitation comes down to the fact that the system has over committed its memory resources. ie. it has sold too many tickets for the number of seats in the plane, and all the passengers have turned up. (note, I use the term "memory" and not "physical memory", I'm including swap space). Why not protect the system from over committing its memory resources? It is possible to do true, system wide, resource counting of physical memory and swap space, and to deny a fork() or mmap() which would cause over committing of memoy resources if everyone cashed in their requirements. Named pages (those which came from a file) are the simplest to handle. If dirty, they already have allocated backing store, so we know there is somewhere to put them when memory is low. How many named pages need to be held in physical memory at any one instance for the system to function? Only a few, although if you reach that state, the system will be thrashing itself to death. Anonymous and copied (those faulted from a write to an MAP_PRIVATE|MAP_WRITE mapping) pages can be stored in either physical memory or on swap. To avoid getting into the OOM suitation, when these mappings are created the system needs to check that it has (and will have, in the future) space for every page that _could_ be allocated for the mapping - ie. work out the worst case (including page-tables). This space could be on swap or in physical memory. It is the accounting which needs to be done, not the actual allocation (and not even the decision of where to store the page when allocated - that is made much later, when it needs to be). If a machine has 2GB of RAM, a 1MB swap, and 1GB of dirty anon or copied pages, that is fine. I'm stressing this point, as the scheme of reserving space for an (as yet) unallocated page is sometimes refered to as "eager swap allocation" (or some such similar term). This is confusing. People then start to believe they need backing store for each anon/copied pages. You don't. You simply need somewhere to store it, and that could be a physical page. It is all in the accounting. :) Allocations made by the kernel, for the kernel, are (obviously) pinned memory. To ensure kernel allocations do not completely exhaust physical memory (or cause phyiscal memory to be over committed if the worst case occurs), they need to be limited. How to limit? As I first guess (and this is only a guess); 1) don't let kernel allocations exceed 25% of physical memory (tunable) 2) don't let kernel allocations succeed if they would cause over commitment. Both conditions would need to pass before an allocation could succeed. This does need much more thought. Should some tuning be per subsystem? I don't know Perhaps 1) isn't needed. I'm not sure. Because of 2), the total physical memory accounted for anon/copied pages needs to have a high watermark. Otherwise, in the accounting, the system could allow too much physical memory to be reserved for these types of pages (there doesn't need to be space on swap for each anon/copied page, just space somewhere - a watermark would prevent too much of this being physical memory). Note, this doesn't mean start swapping earlier - remember, this is accounting of anon/copied pages to avoid over commitment. For named pages, the page cache needs to have a reserved number of physical pages (ie. how small is it allowed to get, before pruning stops). Again, these reserved pages are in the accounting. mlock()ed pages need to have accouting also to prevent over commitment of physical memory. All fun. The disadvantages; 1) Extra code to do the accouting. This shouldn't be too heavy. 2) mmap(MAP_ANON)/mmap(MAP_PRIVATE|MAP_SHARED) can fail more readily. Programs which expect to memory map areas (which would created anon/copied pages when written to) will see an increased failure rate in mmap(). This can be very annoying, espically when you know the mapping will be used sparsely. One solution is to add a new mmap() flag, which tells the kernel to let this mmap() exceed the actually resources. With such a flag, the mmap() will be allowed, but the task should expected to be killed if memory is exhausted. (It could be
Re: the new VMt
Hi, On Mon, Sep 25, 2000 at 03:12:50PM -0600, [EMAIL PROTECTED] wrote: > > > > > > I'm not too sure of what you have in mind, but if it is > > > "process creates vast virtual space to generate many page table > > > entries -- using mmap" > > > the answer is, virtual address space quotas and mmap should kill > > > the process on low mem for page tables. > > > > No. Page tables are not freed after munmap (and for good reason). The > > counting of page table "beans" is critical. > > I've seen the assertion before, reasons would be interesting. Reason 1: under DoS attack, you want to target not the process using the most resources, but the *user* using the most resources (else a fork-bomb style attack can work around your OOM-killer algorithms). Reason 2: if you've got tasks stuck in low-level page allocation routines, then you can't immediately kill -9 them, so reactive OOM killing always has vulnerabilities --- to be robust in preventing resource exhaustion you want limits on the use of those resources before they are exhausted --- the necessary accounting being part of what we refer to as "beancounter". --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
Hi, On Mon, Sep 25, 2000 at 03:07:44PM -0600, [EMAIL PROTECTED] wrote: > On Mon, Sep 25, 2000 at 09:46:35PM +0100, Alan Cox wrote: > > > I'm not too sure of what you have in mind, but if it is > > > "process creates vast virtual space to generate many page table > > > entries -- using mmap" > > > the answer is, virtual address space quotas and mmap should kill > > > the process on low mem for page tables. > > > > Those quotas being exactly what beancounter is > > But that is a function specific counter, not a counter in the > alloc code. Beancounter is a framework for user-level accounting. _What_ you account is up to the callers. Maybe this has been a miscommunication, but beancounter is all about allowing callers to account for stuff before allocation, not about having the page allocation functions themselves enforce quotas. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
> "Ingo" == Ingo Molnar <[EMAIL PROTECTED]> writes: Ingo> On 26 Sep 2000, Jes Sorensen wrote: >> 9.5KB blocks is common for people running Gigabit Ethernet with >> Jumbo frames at least. Ingo> yep, although this is more of a Linux limitation, the cards Ingo> themselves are happy to DMA fragmented buffers as well. (sans Ingo> some small penalty per new fragment.) Hence the reason I have been pushing for the kiobufifying of the skbs ;-) It's even more important for HIPPI with the 65280 bytes MTU. Jes - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On 26 Sep 2000, Jes Sorensen wrote: > 9.5KB blocks is common for people running Gigabit Ethernet with Jumbo > frames at least. yep, although this is more of a Linux limitation, the cards themselves are happy to DMA fragmented buffers as well. (sans some small penalty per new fragment.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
> "Ingo" == Ingo Molnar <[EMAIL PROTECTED]> writes: Ingo> On Mon, 25 Sep 2000, Andrea Arcangeli wrote: >> > ie. 99.45% of all allocations are single-page! 0.50% is the 8kb >> >> You're right. That's why it's a waste to have so many order in the >> buddy allocator. [...] Ingo> yep, i agree. I'm not sure what the biggest allocation is, some Ingo> drivers might use megabytes or contiguous RAM? 9.5KB blocks is common for people running Gigabit Ethernet with Jumbo frames at least. Jes - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
Hello, Another approach would be to let user space turn off overcommit. No. Overcommit only applies to pageable memory. Beancounter is really needed for non-pageable resources such as page tables and mlock()ed pages. In addition to beancounter, do you think pageable page tables are something we want to tackle in 2.5.x? 4MB page mappings on x86 could be cool too, as an option... -- Eric Lowe FibreChannel Software Engineer, Systran Corporation [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
Erik Andersen [EMAIL PROTECTED] said: [...] Another approach would be to let user space turn off overcommit. That way, user space can be assured there will be no surprises... I'd call emacs consistently not being able to start an ls on a 16Mb machine much worse than a surprise... Hint: Think about how emacs would go about doing that... Also, to ensure there is /no/ overcommit /anywhere/ amounts to a rigurous audit of the whole kernel, and of each single patchlet that goes in. You are certainly welcome to do the job... -- Horst von Brand [EMAIL PROTECTED] Casilla 9G, Vin~a del Mar, Chile +56 32 672616 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
Hi, On Mon, 25 Sep 2000, Stephen C. Tweedie wrote: So you have run out of physical memory --- what do you do about it? Why let the system get into the state where it is neccessary to kill a process? Per-user/task resource counters should prevent unprivileged users from soaking up too many resources. That is the DoS protection. So an OOM is possibly; 1) A privileged, legally resource hungry, app(s) has taken all the memory. Could be too important to simply kill (it should exit gracefully). 2) Simply too many tasks*(memory-requirements-of-each-task). Ignoring allocations done by the kernel, the suitation comes down to the fact that the system has over committed its memory resources. ie. it has sold too many tickets for the number of seats in the plane, and all the passengers have turned up. (note, I use the term "memory" and not "physical memory", I'm including swap space). Why not protect the system from over committing its memory resources? It is possible to do true, system wide, resource counting of physical memory and swap space, and to deny a fork() or mmap() which would cause over committing of memoy resources if everyone cashed in their requirements. Named pages (those which came from a file) are the simplest to handle. If dirty, they already have allocated backing store, so we know there is somewhere to put them when memory is low. How many named pages need to be held in physical memory at any one instance for the system to function? Only a few, although if you reach that state, the system will be thrashing itself to death. Anonymous and copied (those faulted from a write to an MAP_PRIVATE|MAP_WRITE mapping) pages can be stored in either physical memory or on swap. To avoid getting into the OOM suitation, when these mappings are created the system needs to check that it has (and will have, in the future) space for every page that _could_ be allocated for the mapping - ie. work out the worst case (including page-tables). This space could be on swap or in physical memory. It is the accounting which needs to be done, not the actual allocation (and not even the decision of where to store the page when allocated - that is made much later, when it needs to be). If a machine has 2GB of RAM, a 1MB swap, and 1GB of dirty anon or copied pages, that is fine. I'm stressing this point, as the scheme of reserving space for an (as yet) unallocated page is sometimes refered to as "eager swap allocation" (or some such similar term). This is confusing. People then start to believe they need backing store for each anon/copied pages. You don't. You simply need somewhere to store it, and that could be a physical page. It is all in the accounting. :) Allocations made by the kernel, for the kernel, are (obviously) pinned memory. To ensure kernel allocations do not completely exhaust physical memory (or cause phyiscal memory to be over committed if the worst case occurs), they need to be limited. How to limit? As I first guess (and this is only a guess); 1) don't let kernel allocations exceed 25% of physical memory (tunable) 2) don't let kernel allocations succeed if they would cause over commitment. Both conditions would need to pass before an allocation could succeed. This does need much more thought. Should some tuning be per subsystem? I don't know Perhaps 1) isn't needed. I'm not sure. Because of 2), the total physical memory accounted for anon/copied pages needs to have a high watermark. Otherwise, in the accounting, the system could allow too much physical memory to be reserved for these types of pages (there doesn't need to be space on swap for each anon/copied page, just space somewhere - a watermark would prevent too much of this being physical memory). Note, this doesn't mean start swapping earlier - remember, this is accounting of anon/copied pages to avoid over commitment. For named pages, the page cache needs to have a reserved number of physical pages (ie. how small is it allowed to get, before pruning stops). Again, these reserved pages are in the accounting. mlock()ed pages need to have accouting also to prevent over commitment of physical memory. All fun. The disadvantages; 1) Extra code to do the accouting. This shouldn't be too heavy. 2) mmap(MAP_ANON)/mmap(MAP_PRIVATE|MAP_SHARED) can fail more readily. Programs which expect to memory map areas (which would created anon/copied pages when written to) will see an increased failure rate in mmap(). This can be very annoying, espically when you know the mapping will be used sparsely. One solution is to add a new mmap() flag, which tells the kernel to let this mmap() exceed the actually resources. With such a flag, the mmap() will be allowed, but the task should expected to be killed if memory is exhausted. (It could be
Re: the new VMt
On Tue, Sep 26, 2000 at 10:54:23AM +0100, Stephen C. Tweedie wrote: Beancounter is a framework for user-level accounting. _What_ you account is up to the callers. Maybe this has been a miscommunication, but beancounter is all about allowing callers to account for stuff before allocation, not about having the page allocation functions themselves enforce quotas. per-user and system-wide and per-process quotas are one thing, a pre-allocate-and-then-allocate generic scheme seems to me to be a error prone way of getting there. In particular, I think it is dangerous to have a pre-count that is approximately tethered to the thing it is counting -- in the memory allocation we were discussing, you need to make sure that the pre-allocations are for memory that is really going to be allocated soon and that it is later correlated with free in some way. So, to me, a quota bounded allocate_page_table(process_id) makes much more sense then pre-allocate counting, or, even worse, a "smart" kmalloc that never fails. If the problem is unaccounted for page-tables then account for page tables and return a -EYOURPROCESSISOUTOFCONTROL so that calling kernel code can take the responsible action. -- - Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Tue, Sep 26, 2000 at 11:07:36AM +0100, Stephen C. Tweedie wrote: Hi, On Mon, Sep 25, 2000 at 03:12:50PM -0600, [EMAIL PROTECTED] wrote: I'm not too sure of what you have in mind, but if it is "process creates vast virtual space to generate many page table entries -- using mmap" the answer is, virtual address space quotas and mmap should kill the process on low mem for page tables. No. Page tables are not freed after munmap (and for good reason). The counting of page table "beans" is critical. I've seen the assertion before, reasons would be interesting. Reason 1: under DoS attack, you want to target not the process using the most resources, but the *user* using the most resources (else a fork-bomb style attack can work around your OOM-killer algorithms). Ok. if(over_allocated_page_tables(task-uid) ) return ENOMEM; makes sense in "fork". I guess the argument here is not about whether accounting is good, it's about where the accounting should be done. To me the alternatives of if(preallocate_pages(page_table_size_for_this_process()) == -1)return error then actually allocate making sure to adjust counts if some other error turns up and with something taking care of how the pre-allocation works while we are sleeping waiting for possibly unrelated resources. or just kmalloc with kmalloc magically juggling resources in some safe way seem less clear. Reason 2: if you've got tasks stuck in low-level page allocation routines, then you can't immediately kill -9 them, so reactive OOM killing always has vulnerabilities --- to be robust in preventing resource exhaustion you want limits on the use of those resources before they are exhausted --- the necessary accounting being part of what we refer to as "beancounter". doesn't the problem really come from low level page allocation at too high a level? That is, if instead of select doing get_free_page, it maybe should do get_per_process_page(myprocess) or even get_per_process_file_use_page(myprocess) Then we could have a config-optional per-process pinned page accounting with the possibility of doing something sensible in a user-space daemon when memory is low. --Stephen -- - Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 05:14:11PM -0600, Erik Andersen wrote: On Mon Sep 25, 2000 at 02:04:19PM -0600, [EMAIL PROTECTED] wrote: all of the pending requests just as long as they are serialised, is this a problem? I think you are solving the wrong problem. On a small memory machine, the kernel, utilities, and applications should be configured to use little memory. BusyBox is better than BeanCount. Granted that smaller apps can help -- for a particular workload. But while I am very partial to BusyBox (in fact I am about to cut a new release) I can assure you that OOM is easily possible even when your user space is tiny. I do it all the time. There are mallocs in busybox and when under memory pressure, the kernel still tends to fall over... Operating systems cannot make more memory appear by magic. The question is really about the best strategy for dealing with low memory. In my opinion, the OS should not try to out-think physical limitations. Instead, the OS should take as little space as possible and provide the ability for user level clever management of space. In a truly embedded system, there can easily be a user level root process that watches memory usage and prevents DOS attacks -- if the OS provides settable enforced quotas etc. -- - Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
Hi, On Tue, Sep 26, 2000 at 09:17:44AM -0600, [EMAIL PROTECTED] wrote: Operating systems cannot make more memory appear by magic. The question is really about the best strategy for dealing with low memory. In my opinion, the OS should not try to out-think physical limitations. Instead, the OS should take as little space as possible and provide the ability for user level clever management of space. In a truly embedded system, there can easily be a user level root process that watches memory usage and prevents DOS attacks -- if the OS provides settable enforced quotas etc. Agreed, absolutely. The beancounter is one approach to those quotas, and has the advantage of allowing per-user as well as per-process quotas. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
Hi, On Tue, Sep 26, 2000 at 11:02:48AM -0600, Erik Andersen wrote: Another approach would be to let user space turn off overcommit. No. Overcommit only applies to pageable memory. Beancounter is really needed for non-pageable resources such as page tables and mlock()ed pages. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Tue Sep 26, 2000 at 05:04:06PM +0100, Stephen C. Tweedie wrote: Hi, On Tue, Sep 26, 2000 at 09:17:44AM -0600, [EMAIL PROTECTED] wrote: Operating systems cannot make more memory appear by magic. The question is really about the best strategy for dealing with low memory. In my opinion, the OS should not try to out-think physical limitations. Instead, the OS should take as little space as possible and provide the ability for user level clever management of space. In a truly embedded system, there can easily be a user level root process that watches memory usage and prevents DOS attacks -- if the OS provides settable enforced quotas etc. Agreed, absolutely. The beancounter is one approach to those quotas, and has the advantage of allowing per-user as well as per-process quotas. Another approach would be to let user space turn off overcommit. That way, user space can be assured there will be no surprises... -Erik -- Erik B. Andersen email: [EMAIL PROTECTED] --This message was written using 73% post-consumer electrons-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon Sep 25, 2000 at 02:04:19PM -0600, [EMAIL PROTECTED] wrote: > > > all of the pending requests just as long as they are serialised, is > > this a problem? > > I think you are solving the wrong problem. On a small memory machine, the kernel, > utilities, and applications should be configured to use little memory. > BusyBox is better than BeanCount. > Granted that smaller apps can help -- for a particular workload. But while I am very partial to BusyBox (in fact I am about to cut a new release) I can assure you that OOM is easily possible even when your user space is tiny. I do it all the time. There are mallocs in busybox and when under memory pressure, the kernel still tends to fall over... -Erik -- Erik B. Andersen email: [EMAIL PROTECTED] --This message was written using 73% post-consumer electrons-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 04:47:21PM -0400, Benjamin C.R. LaHaise wrote: > On Mon, 25 Sep 2000 [EMAIL PROTECTED] wrote: > > > On Mon, Sep 25, 2000 at 09:23:48PM +0100, Alan Cox wrote: > > > > my prediction is that if you show me an example of > > > > DoS vulnerability, I can show you fix that does not require bean counting. > > > > Am I wrong? > > > > > > I think so. Page tables are a good example > > > > I'm not too sure of what you have in mind, but if it is > > "process creates vast virtual space to generate many page table > > entries -- using mmap" > > the answer is, virtual address space quotas and mmap should kill > > the process on low mem for page tables. > > No. Page tables are not freed after munmap (and for good reason). The > counting of page table "beans" is critical. I've seen the assertion before, reasons would be interesting. -- - Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 09:46:35PM +0100, Alan Cox wrote: > > I'm not too sure of what you have in mind, but if it is > > "process creates vast virtual space to generate many page table > > entries -- using mmap" > > the answer is, virtual address space quotas and mmap should kill > > the process on low mem for page tables. > > Those quotas being exactly what beancounter is But that is a function specific counter, not a counter in the alloc code. -- - Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
> I'm not too sure of what you have in mind, but if it is > "process creates vast virtual space to generate many page table > entries -- using mmap" > the answer is, virtual address space quotas and mmap should kill > the process on low mem for page tables. Those quotas being exactly what beancounter is - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, 25 Sep 2000 [EMAIL PROTECTED] wrote: > On Mon, Sep 25, 2000 at 09:23:48PM +0100, Alan Cox wrote: > > > my prediction is that if you show me an example of > > > DoS vulnerability, I can show you fix that does not require bean counting. > > > Am I wrong? > > > > I think so. Page tables are a good example > > I'm not too sure of what you have in mind, but if it is > "process creates vast virtual space to generate many page table > entries -- using mmap" > the answer is, virtual address space quotas and mmap should kill > the process on low mem for page tables. No. Page tables are not freed after munmap (and for good reason). The counting of page table "beans" is critical. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 09:23:48PM +0100, Alan Cox wrote: > > my prediction is that if you show me an example of > > DoS vulnerability, I can show you fix that does not require bean counting. > > Am I wrong? > > I think so. Page tables are a good example I'm not too sure of what you have in mind, but if it is "process creates vast virtual space to generate many page table entries -- using mmap" the answer is, virtual address space quotas and mmap should kill the process on low mem for page tables. > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > Please read the FAQ at http://www.tux.org/lkml/ -- - Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
Hi, On Mon, Sep 25, 2000 at 02:04:19PM -0600, [EMAIL PROTECTED] wrote: > > Right, but if the alternative is spurious ENOMEM when we can satisfy > > An ENOMEM is not spurious if there is not enough memory. UNIX does not ask the > OS to do impossible tricks. Yes, but the ENOMEM _is_ spurious if you actually meant EAGAIN, and if the OS was perfectly capable of doing the retry itself. > > all of the pending requests just as long as they are serialised, is > > this a problem? > > I think you are solving the wrong problem. On a small memory machine, the kernel, > utilities, and applications should be configured to use little memory. > BusyBox is better than BeanCount. Any box is a small memory machine if you get the wrong workload on it, and the DoS attacks which are possible without beancounting let any user bring even a large system to its knees right now. If solving that problem also means that small memory machines do the right thing on their own rather than requiring specific manual configuration, then it sounds like a good aim. > > However, you just can't escape from the fact that on low memory > > machinnes, we *need* beancounter-style accounting of pinned pages or > > we'll be in Deep Trouble (TM). We already have nasty DoS situations > > What we need is simple kernel code that does not hold resources > into a possible deadlock situation. > On general principles, I don't see any substitute for clean code in the kernel and > my prediction is that if you show me an example of > DoS vulnerability, I can show you fix that does not require bean counting. > Am I wrong? If you have a user forking multiple processes and exhausting some resource, then at some point you have to do something about it. Let's say it's page tables, just for argument's sake, because those are currently non-swappable, but even if you make those swappable there are plenty of other resources it might be (eg. data shoved down unix domain sockets if you want another example). So you have run out of physical memory --- what do you do about it? The important observation here is that in a multi-user environment, simply denying further allocations isn't good enough --- unless you revoke those existing allocations you have DoS. And you can't fairly revoke existing allocations without knowing WHICH user has exhausted the memory (which requires beancounter-style resource tracking), AND having mechanisms in place to revoke all of the possible resources which might be involved (eg unix domain socket datagrams). kill -9 might solve that latter problem but it doesn't help in identifying who to kill. --Stephen > > > > > > -- > - > Victor Yodaiken > Finite State Machine Labs: The RTLinux Company. > www.fsmlabs.com www.rtlinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
> my prediction is that if you show me an example of > DoS vulnerability, I can show you fix that does not require bean counting. > Am I wrong? I think so. Page tables are a good example - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 08:25:49PM +0100, Stephen C. Tweedie wrote: > Hi, > > On Mon, Sep 25, 2000 at 12:34:56PM -0600, [EMAIL PROTECTED] wrote: > > > > > Process 1,2 and 3 all start allocating 20 pages > > > > now 57 pages are locked up in non-swapable kernel space and the system >deadlocks OOM. > > > > > > Or go the beancounter route: process 1 asks "can I pin 20 pages", gets > > > told "yes", and goes allocating them, blocking as necessary until it > > > > So you have a "pre-allocation allocator"? Leads to interesting and hard to detect > > bugs with old code that does not pre-allocate or with code that incorrectly >pre-allocates > > or that blocks on something unrelated > > Right, but if the alternative is spurious ENOMEM when we can satisfy An ENOMEM is not spurious if there is not enough memory. UNIX does not ask the OS to do impossible tricks. > all of the pending requests just as long as they are serialised, is > this a problem? I think you are solving the wrong problem. On a small memory machine, the kernel, utilities, and applications should be configured to use little memory. BusyBox is better than BeanCount. > However, you just can't escape from the fact that on low memory > machinnes, we *need* beancounter-style accounting of pinned pages or > we'll be in Deep Trouble (TM). We already have nasty DoS situations What we need is simple kernel code that does not hold resources into a possible deadlock situation. > which are embarassingly easy to reproduce. If we need such > beancounter protection, AND such protection can prevent the situation > you describe, then do we need to go looking for another way of > achieving the same protection? On general principles, I don't see any substitute for clean code in the kernel and my prediction is that if you show me an example of DoS vulnerability, I can show you fix that does not require bean counting. Am I wrong? -- - Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt [4MB+ blocks]
[EMAIL PROTECTED] said: > Sometimes allocating such monster memory blocks could be supported, > but it should not be expected to be *fast*. E.g. if doing it in > "reliable" way needs possibly moving currently allocated pages > away from memory to create such a hole(s), so be it. [EMAIL PROTECTED] said: > Anybody here who can describe those M$ API calls ? > Are they kernel/DDK-only, or userspace ones, or both ? NT does indeed support allocating contiguous buffers of memory, which is useful when the hardware in question doesn't do scatter-gather. I have on occasion been compelled to use these routines. (Paradoxically, the requirements in my case came from broken NT mmap support and not from the hardware. Blech!) Anyhow, these routines are indeed slow. And judging by the amount of disk noise I hear when they are called, they do try to kick out pages to make an allocation work. However, even so the M$ calls will eventually fail due to lack of large enough holes, so fragmentation takes its toll. So, they are both slow and unreliable under NT. But drivers that use them tend to be loaded once at boot time, and that's it. -- Steve Williams"The woods are lovely, dark and deep. [EMAIL PROTECTED] But I have promises to keep, [EMAIL PROTECTED]and lines to code before I sleep, http://www.picturel.com And lines to code before I sleep." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
Hi, On Mon, Sep 25, 2000 at 12:34:56PM -0600, [EMAIL PROTECTED] wrote: > > > Process 1,2 and 3 all start allocating 20 pages > > > now 57 pages are locked up in non-swapable kernel space and the system >deadlocks OOM. > > > > Or go the beancounter route: process 1 asks "can I pin 20 pages", gets > > told "yes", and goes allocating them, blocking as necessary until it > > So you have a "pre-allocation allocator"? Leads to interesting and hard to detect > bugs with old code that does not pre-allocate or with code that incorrectly >pre-allocates > or that blocks on something unrelated Right, but if the alternative is spurious ENOMEM when we can satisfy all of the pending requests just as long as they are serialised, is this a problem? If you want, wrap it in a "get_free_pagev" call which returns a vector of pointers to free pages, doing whatever accounting is needed. You don't have to push all of it to the callers. However, you just can't escape from the fact that on low memory machinnes, we *need* beancounter-style accounting of pinned pages or we'll be in Deep Trouble (TM). We already have nasty DoS situations which are embarassingly easy to reproduce. If we need such beancounter protection, AND such protection can prevent the situation you describe, then do we need to go looking for another way of achieving the same protection? --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
Hi, On Mon, Sep 25, 2000 at 08:09:31PM +0100, Alan Cox wrote: > > > Indeed. But we wont fail the kmalloc with a NULL return > > > > Isn't that the preferred behaviour, though? If we are completely out > > of VM on a no-swap machine, we should be killing one of the existing > > processes rather than preventing any progress and keeping all of the > > old tasks alive but deadlocked. > > Unless Im missing something we wont kill any task in that condition - even > a SIGKILL will make no odds as everyone is asleep in kmalloc Right. Eeek. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
> > Indeed. But we wont fail the kmalloc with a NULL return > > Isn't that the preferred behaviour, though? If we are completely out > of VM on a no-swap machine, we should be killing one of the existing > processes rather than preventing any progress and keeping all of the > old tasks alive but deadlocked. Unless Im missing something we wont kill any task in that condition - even a SIGKILL will make no odds as everyone is asleep in kmalloc - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt [4MB+ blocks]
[Chopped the recipient list radically] On Mon, Sep 25, 2000 at 06:06:11PM +0100, Alan Cox wrote: > > > > Stupidity has no limits... > > > Unfortunately its frequently wired into the hardware to save a few cents on > > > scatter gather logic. > > > > Since when hardware folks became exempt from the rule above? 128K is > > almost tolerable, there were requests for 64 _mega_bytes... > > Most cheap ass PCI hardware is built on the basis you can do linear 4Mb > allocations. There is a reason for this. You can do that 4Mb allocation on > NT or Windows 9x Sure, but intel processors have this neat 4 MB "super-page" feature in the MMU... (as we all well know) Sometimes allocating such monster memory blocks could be supported, but it should not be expected to be *fast*. E.g. if doing it in "reliable" way needs possibly moving currently allocated pages away from memory to create such a hole(s), so be it.. Anybody here who can describe those M$ API calls ? Are they kernel/DDK-only, or userspace ones, or both ? /Matti Aarnio - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 07:24:53PM +0100, Stephen C. Tweedie wrote: > Hi, > > On Mon, Sep 25, 2000 at 12:13:15PM -0600, [EMAIL PROTECTED] wrote: > > > > Definitely not. GFP_ATOMIC is reserved for things that really can't > > > swap or schedule right now. Use GFP_ATOMIC indiscriminately and you'll > > > have to increase the number of atomic-allocatable pages. > > > > Process 1,2 and 3 all start allocating 20 pages > > process 1 stalls after allocating 19 > > some memory is freed and process 2 runs and stall after allocating 19 > > some memory is free and process 3 runs and stalls after allocating 19 > > > > now 57 pages are locked up in non-swapable kernel space and the system >deadlocks OOM. > > Or go the beancounter route: process 1 asks "can I pin 20 pages", gets > told "yes", and goes allocating them, blocking as necessary until it So you have a "pre-allocation allocator"? Leads to interesting and hard to detect bugs with old code that does not pre-allocate or with code that incorrectly pre-allocates or that blocks on something unrelated preallocte 20 pages get first ask for an inode -- block waiting for an inode or preallocate 20 pages if(checkuserpath())return -ENOWAY; /* stranding my pre-allocate */ else get them pages What's nice about these is they don't cause errors on test and seem more difficult to spot than looking for cases where allocated memory gets stranded. Doesn't the alloc_vec method seem simpler to you? > gets them. Process 2 asks "can *I* pin 20 pages" and the answer is > either "not right now", in which case it waits for process 1 to > release its reservation, or "no, you've exceeded your user quota" in Or for someone else to free more pages ... > which case it fails with ENOMEM. (That latter case can protect us > against a lot of DoS attacks from local users.) I like ENOMEM anyways. > > The same accounting really needs to be done for page tables, as that > represents one of the biggest sources of unaccounted, unswappable > pages which user processes can cause to be created right now. -- - Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
Hi, On Mon, Sep 25, 2000 at 12:13:15PM -0600, [EMAIL PROTECTED] wrote: > > Definitely not. GFP_ATOMIC is reserved for things that really can't > > swap or schedule right now. Use GFP_ATOMIC indiscriminately and you'll > > have to increase the number of atomic-allocatable pages. > > Process 1,2 and 3 all start allocating 20 pages > process 1 stalls after allocating 19 > some memory is freed and process 2 runs and stall after allocating 19 > some memory is free and process 3 runs and stalls after allocating 19 > > now 57 pages are locked up in non-swapable kernel space and the system deadlocks >OOM. Or go the beancounter route: process 1 asks "can I pin 20 pages", gets told "yes", and goes allocating them, blocking as necessary until it gets them. Process 2 asks "can *I* pin 20 pages" and the answer is either "not right now", in which case it waits for process 1 to release its reservation, or "no, you've exceeded your user quota" in which case it fails with ENOMEM. (That latter case can protect us against a lot of DoS attacks from local users.) The same accounting really needs to be done for page tables, as that represents one of the biggest sources of unaccounted, unswappable pages which user processes can cause to be created right now. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
> there is no swap. If there is truly nothing kswapd can do to recover > here, then we are truly OOM. Otherwise, kswapd should be able to free Indeed. But we wont fail the kmalloc with a NULL return - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 08:04:54PM +0200, Jamie Lokier wrote: > [EMAIL PROTECTED] wrote: > > > [EMAIL PROTECTED] wrote: > > > >walk = out; > > > > while(nfds > 0) { > > > > poll_table *tmp = (poll_table *) __get_free_page(GFP_KERNEL); > > > > if (!tmp) { > > > > > > Shouldn't this be GFP_USER? (Which would also conveniently fix the > > > problem Victor's pointing out...) > > > > It should probably be GFP_ATOMIC, if I understand the mm right. > > Definitely not. GFP_ATOMIC is reserved for things that really can't > swap or schedule right now. Use GFP_ATOMIC indiscriminately and you'll > have to increase the number of atomic-allocatable pages. Process 1,2 and 3 all start allocating 20 pages process 1 stalls after allocating 19 some memory is freed and process 2 runs and stall after allocating 19 some memory is free and process 3 runs and stalls after allocating 19 now 57 pages are locked up in non-swapable kernel space and the system deadlocks OOM. > > The algorithm for requesting a collection of reources and freeing all > > of them on failure is simple, fast, and robust. > > Allocation is just as fast with GFP_KERNEL/USER, just less likely to It's not speed, it's deadlock avoidance. > fail and less likely to break something else that really needs > GFP_ATOMIC allocations. My point here is simply that error returns in memory allocation allow higher level kernel operations to safely marshal a collection of resources following a safe algorithm that is optimized for the case when there is no memory shortage and that only starts going to the slow case when the system is stalling due to memory shortages anyways. > > -- Jamie -- - Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 11:51:39AM -0600, [EMAIL PROTECTED] wrote: > It should probably be GFP_ATOMIC, if I understand the mm right. poll_wait is called from the f_op->poll callback from select just before a sleep and since it's allowed to sleep too it should be a GFP_KERNEL (not ATOMIC). Using GFP_ATOMIC where GFP_KERNEL can be used is a bug and it can lead to failed allocations even while there's huge amount of freeable/recyclable cache. The reason it isn't GFP_USER but it's a GFP_KERNEL is because the memory isn't allocated in userspace. On a solid VM the only difference between GFP_USER and GFP_KERNEL happens to be when the machine runs truly out of memory. In 2.4.x GFP_KERNEL should probably be changed not to short the PF_MEMALLOC atomic queue when memory balancing fails (then they would be equal). > The algorithm for requesting a collection of reources and freeing all of them > on failure is simple, fast, and robust. Yes, I tend to like that style too because it's obviously safe and it obviously can't dealdock during oom. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
[EMAIL PROTECTED] wrote: > > [EMAIL PROTECTED] wrote: > > >walk = out; > > > while(nfds > 0) { > > > poll_table *tmp = (poll_table *) __get_free_page(GFP_KERNEL); > > > if (!tmp) { > > > > Shouldn't this be GFP_USER? (Which would also conveniently fix the > > problem Victor's pointing out...) > > It should probably be GFP_ATOMIC, if I understand the mm right. Definitely not. GFP_ATOMIC is reserved for things that really can't swap or schedule right now. Use GFP_ATOMIC indiscriminately and you'll have to increase the number of atomic-allocatable pages. > The algorithm for requesting a collection of reources and freeing all > of them on failure is simple, fast, and robust. Allocation is just as fast with GFP_KERNEL/USER, just less likely to fail and less likely to break something else that really needs GFP_ATOMIC allocations. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 07:18:29PM +0200, Jamie Lokier wrote: > [EMAIL PROTECTED] wrote: > >walk = out; > > while(nfds > 0) { > > poll_table *tmp = (poll_table *) __get_free_page(GFP_KERNEL); > > if (!tmp) { > > Shouldn't this be GFP_USER? (Which would also conveniently fix the > problem Victor's pointing out...) It should probably be GFP_ATOMIC, if I understand the mm right. The algorithm for requesting a collection of reources and freeing all of them on failure is simple, fast, and robust. -- - Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, 25 Sep 2000, Oliver Xymoron wrote: > Sure about that? It's been a while, but I seem to recall NT enforcing a > scatter-gather framework on all drivers because it only gave them virtual > allocations. For the cheaper cards, the s-g was done by software issuing > single span requests to the card. The Matrox framegrabber guys use some API under NT to allocate megabytes upon megabytes of contiguous memory for DMA. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
Hi, On Mon, Sep 25, 2000 at 05:51:49PM +0100, Alan Cox wrote: > > > 2 active processes, no swap > > > > > > #1#2 > > > kmalloc 32K kmalloc 16K > > > OKOK > > > kmalloc 16K kmalloc 32K > > > block block > > > > > > > ... and we get two wakeup_kswapd()s. kswapd has PF_MEMALLOC and so is > > able to eat memory which processes #1 and #2 are not allowed to touch. > > 'no swap' kswapd is perfectly capable of evicting clean pages and triggering any necessary writeback of dirty filesystem data at this point, even if there is no swap. If there is truly nothing kswapd can do to recover here, then we are truly OOM. Otherwise, kswapd should be able to free the required memory, providing that the PF_MEMALLOC flag allows it to eat into a reserved set of free pages which nobody else can allocate once physical free pages gets below a certain threshold. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, 25 Sep 2000, Alan Cox wrote: > > > > Stupidity has no limits... > > > > > > Unfortunately its frequently wired into the hardware to save a few cents on > > > scatter gather logic. > > > > Since when hardware folks became exempt from the rule above? 128K is > > almost tolerable, there were requests for 64 _mega_bytes... > > Most cheap ass PCI hardware is built on the basis you can do linear 4Mb > allocations. There is a reason for this. You can do that 4Mb allocation on > NT or Windows 9x Sure about that? It's been a while, but I seem to recall NT enforcing a scatter-gather framework on all drivers because it only gave them virtual allocations. For the cheaper cards, the s-g was done by software issuing single span requests to the card. -- "Love the dolphins," she advised him. "Write by W.A.S.T.E.." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
Hi, On Mon, Sep 25, 2000 at 06:05:00PM +0200, Andrea Arcangeli wrote: > On Mon, Sep 25, 2000 at 04:42:49PM +0100, Stephen C. Tweedie wrote: > > Progress is made, clean pages are discarded and dirty ones queued for > > How can you make progress if there isn't swap avaiable and all the > freeable page/buffer cache is just been freed? The deadlock happens > in OOM condition (not when we can make progress). Agreed --- this assumes that all pinned, nonswappable pages are subject to resource limiting to prevent them from exhausting the whole of memory. For things like page tables, that means we need beancounter in place for us to be 100% safe. For the no-swap case, that requires an OOM killer. The problem of avoiding filling memory with pinned pages is orthogonal to the problem of managing the unpinned memory. Both are obviously required for a stable system. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
[EMAIL PROTECTED] wrote: >walk = out; > while(nfds > 0) { > poll_table *tmp = (poll_table *) __get_free_page(GFP_KERNEL); > if (!tmp) { Shouldn't this be GFP_USER? (Which would also conveniently fix the problem Victor's pointing out...) -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 02:10:07PM -0300, Rik van Riel wrote: > Not really. We could fix this by making the page freeing > functions smarter and only free the pages we need. That's what I proposed in first place infact. To free large chunk of memory you may have to throw away lots of cache. We're not freeing contigous cache as we do in 2.2.x. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 04:42:49PM +0100, Stephen C. Tweedie wrote: > Hi, > > On Mon, Sep 25, 2000 at 04:16:56PM +0100, Alan Cox wrote: > > > > Unless Im missing something here think about this case > > > > 2 active processes, no swap > > > > #1 #2 > > kmalloc 32K kmalloc 16K > > OK OK > > kmalloc 16K kmalloc 32K > > block block > > > > ... and we get two wakeup_kswapd()s. kswapd has PF_MEMALLOC and so is > able to eat memory which processes #1 and #2 are not allowed to touch. > Progress is made, clean pages are discarded and dirty ones queued for > write, memory becomes free again and the world is a better place. > > Or so goes the theory, at least. from fs/select.c walk = out; while(nfds > 0) { poll_table *tmp = (poll_table *) __get_free_page(GFP_KERNEL); if (!tmp) { while(out != NULL) { tmp = out->next; free_page((unsigned long)out); out = tmp; } return NULL; } tmp->nr = 0; tmp->entry = (struct poll_table_entry *)(tmp + 1); tmp->next = NULL; walk->next = tmp; walk = tmp; nfds -=__MAX_POLL_TABLE_ENTRIES; } > > --Stephen > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > Please read the FAQ at http://www.tux.org/lkml/ -- - Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
> > > Stupidity has no limits... > > > > Unfortunately its frequently wired into the hardware to save a few cents on > > scatter gather logic. > > Since when hardware folks became exempt from the rule above? 128K is > almost tolerable, there were requests for 64 _mega_bytes... Most cheap ass PCI hardware is built on the basis you can do linear 4Mb allocations. There is a reason for this. You can do that 4Mb allocation on NT or Windows 9x - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 07:03:46PM +0200, Ingo Molnar wrote: > [..] __GFP_SOFT solves this all very nicely [..] s/very nicely/throwing away lots of useful cache for no one good reason/ Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 09:49:46AM -0700, Linus Torvalds wrote: > [..] I > don't think the balancing has to take the order of the allocation into > account [..] Why do you prefer to throw away most of the cache (potentially at fork time) instead of freeing only the few contigous bits that we need? Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, 25 Sep 2000, Alan Cox wrote: > > > yep, i agree. I'm not sure what the biggest allocation is, some drivers > > > might use megabytes or contiguous RAM? > > > > Stupidity has no limits... > > Unfortunately its frequently wired into the hardware to save a few cents on > scatter gather logic. Since when hardware folks became exempt from the rule above? 128K is almost tolerable, there were requests for 64 _mega_bytes... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
> > yep, i agree. I'm not sure what the biggest allocation is, some drivers > > might use megabytes or contiguous RAM? > > Stupidity has no limits... Unfortunately its frequently wired into the hardware to save a few cents on scatter gather logic. We need 128K blocks for sound DMA buffers and most sound cards they need to be linear (but not the newer ones thankfully). Some video capture hardware needs 4Mb but that needs to use bootmem (in 2.2 they use bigmem hacks) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
> > kmalloc 16K kmalloc 32K > > block block > > > 2) set PF_MEMALLOC on the task you're killing for OOM, >that way this task will either get the memory or >fail (note that PF_MEMALLOC tasks don't wait) Nobody is out of memory at this point. Everyone is in kernel space blocking for someone else. There is also no further allocation after this deadlock point to cause a kill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
> Frankly, how often do we allocate multi-order pages? I've just made quick > statistics wrt. how allocation orders are distributed on a more or less > typical system: Enough that failures on this crashed older 2.2 kernels because the tcp code ended up looping trying to get memory and the slab allocator couldnt get a new multipage block. Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, 25 Sep 2000, Linus Torvalds wrote: > Yes, I'm inclined to agree. Or at least not disagree. I'm more arguing > that the order itself may not be the most interesting thing, and that > I don't think the balancing has to take the order of the allocation > into account - because it should be equivalent to just tell that it's > a soft allocation (whether though the current !__GFP_HIGH or through a > new __GFP_SOFT with slightly different logic). yep, and there is another problem with pure order-based distinction: if i do kmalloc(5k), and write the code on Alpha and expect it to never fail, shouldnt i expect this to never fail on x86 as well? Along with the fork() failure. __GFP_SOFT solves this all very nicely - the *allocator* decides what allocation policy to follow. Great! Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
> > 2 active processes, no swap > > > > #1 #2 > > kmalloc 32K kmalloc 16K > > OK OK > > kmalloc 16K kmalloc 32K > > block block > > > > ... and we get two wakeup_kswapd()s. kswapd has PF_MEMALLOC and so is > able to eat memory which processes #1 and #2 are not allowed to touch. 'no swap' - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, 25 Sep 2000, Linus Torvalds wrote: > On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > > > But I'd much prefer to pass not only the classzone from allocator > > to memory balancing, but _also_ the order of the allocation, > > and then shrink_mmap will know it doesn't worth to free anything > > that isn't contigous on the order of the allocation that we need. > > I suspect that the proper way to do this is to just make another gfp_flag, > which is basically another hint to the mm layer that we're doing a multi- > page allocation and that the MM layer should not try forever to handle it. > > In fact, that's independent of whether it is a multi-page > allocation or not. It might be something like __GFP_SOFT - you > could use it with single pages too. > > Thinking about it, we do have it already. It's called > !__GFP_HIGH, and it used by all the GFP_USER allocations. Hmm, I think these two are orthagonal. __GFP_HIGH means that we are allowed to eat deeper into the free list (maybe needed to avoid a deadlock freeing pages) __GFP_SOFT would mean "don't bother waiting for free pages", which is something very different... (I wouldn't want a user process to get killed simply because kswapd is waiting for IO to finish on a swapout, in that case we really do want to sleep for a while) regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > But I'd much prefer to pass not only the classzone from allocator > to memory balancing, but _also_ the order of the allocation, > and then shrink_mmap will know it doesn't worth to free anything > that isn't contigous on the order of the allocation that we need. I suspect that the proper way to do this is to just make another gfp_flag, which is basically another hint to the mm layer that we're doing a multi- page allocation and that the MM layer should not try forever to handle it. In fact, that's independent of whether it is a multi-page allocation or not. It might be something like __GFP_SOFT - you could use it with single pages too. Thinking about it, we do have it already. It's called !__GFP_HIGH, and it used by all the GFP_USER allocations. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, 25 Sep 2000, Alexander Viro wrote: > On Mon, 25 Sep 2000, Ingo Molnar wrote: > > yep, i agree. I'm not sure what the biggest allocation is, some drivers > > might use megabytes or contiguous RAM? > Stupidity has no limits... Blame the hardware designers... and give me my big allocations. :) Sounds drivers (not mine though, ) do stuff like order = 20; /* just a made-up high number*/ while ((order-- > 0) && (mem == NULL)) { mem = __get_free_pages (GFP_KERNEL, order); } /* use sound buffer 'mem' */ Older or modern, less-than-cool framegrabbers need tons of contiguous memory too... Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, 25 Sep 2000, Ingo Molnar wrote: > On Mon, 25 Sep 2000, Andi Kleen wrote: > > > Another thing I would worry about are ports with multiple user page > > sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages > > but may likely need a 16K kernel stack due to the 64bit stack bloat. > > yep, but these cases are not affected, i think in the order != 0 > case we should return NULL if a certain number of iterations did > not yield any free page. Indeed. You're right here. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 06:18:17PM +0200, Andi Kleen wrote: > On Mon, Sep 25, 2000 at 06:19:07PM +0200, Ingo Molnar wrote: > > > Another thing I would worry about are ports with multiple user page > > > sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages > > > but may likely need a 16K kernel stack due to the 64bit stack bloat. > > > > yep, but these cases are not affected, i think in the order != 0 case we > > should return NULL if a certain number of iterations did not yield any > > free page. > > Ok, that would just break fork() Not sure if I have the whole context (I've not yet received Ingo's email that you're replying to). Currently we do a memory balancing pass indipendently by the order of the allocation. Thus we don't do any iteraction and the memory balancing is completly order blind (unfortunately it's also zone blind, while at least in 2.2.x the memory balancing known which zone it had to allocate memory from). If Ingo suggested more iteractions of memory balancing for those cases that should only make things better with respect to fragmentation. But I'd much prefer to pass not only the classzone from allocator to memory balancing, but _also_ the order of the allocation, and then shrink_mmap will know it doesn't worth to free anything that isn't contigous on the order of the allocation that we need. classzone haven't reached this point yet. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > On Mon, Sep 25, 2000 at 04:42:49PM +0100, Stephen C. Tweedie wrote: > > Progress is made, clean pages are discarded and dirty ones queued for > > How can you make progress if there isn't swap avaiable and all the > freeable page/buffer cache is just been freed? The deadlock happens > in OOM condition (not when we can make progress). This is exactly why integrating the OOM killer is on my TODO list. The important difference between the new VM and the old one is that we can't fail while we are not OOM, whereas the old allocator could break down even when we still had enough swap free regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 06:22:42PM +0200, Ingo Molnar wrote: > yep, i agree. I'm not sure what the biggest allocation is, some drivers > might use megabytes or contiguous RAM? I'm not sure (we should grep all the drivers to be sure...) but I bet the old 2.2.0 MAX_ORDER #define will work for everything. The fact is that over a certain order there's no hope anyway at runtime and the only big allocations done through the init sequence are for the hashtable. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, 25 Sep 2000, Alan Cox wrote: > > > GFP_KERNEL has to be able to fail for 2.4. Otherwise you can get > > > everything jammed in kernel space waiting on GFP_KERNEL and if the > > > swapper cannot make space you die. > > > > if one can get everything jammed waiting for GFP_KERNEL, and not being > > able to deallocate anything, thats a VM or resource-limit bug. This > > situation is just 1% RAM away from the 'root cannot log in', situation. > > Unless Im missing something here think about this case > > 2 active processes, no swap > > #1#2 > kmalloc 32K kmalloc 16K > OKOK > kmalloc 16K kmalloc 32K > block block > > so GFP_KERNEL has to be able to fail - it can wait for I/O in > some cases with care, but when we have no pages left something > has to give The trick here is to: 1) keep some reserved pages around for PF_MEMALLOC tasks (we need this anyway) 2) set PF_MEMALLOC on the task you're killing for OOM, that way this task will either get the memory or fail (note that PF_MEMALLOC tasks don't wait) This way the OOM-killed task will be able to exit quickly and the rest of the system will not get killed as a side effect. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 06:19:07PM +0200, Ingo Molnar wrote: > > Another thing I would worry about are ports with multiple user page > > sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages > > but may likely need a 16K kernel stack due to the 64bit stack bloat. > > yep, but these cases are not affected, i think in the order != 0 case we > should return NULL if a certain number of iterations did not yield any > free page. Ok, that would just break fork() -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, 25 Sep 2000, Ingo Molnar wrote: > On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > > > ie. 99.45% of all allocations are single-page! 0.50% is the 8kb > > > > You're right. That's why it's a waste to have so many order in the > > buddy allocator. [...] > > yep, i agree. I'm not sure what the biggest allocation is, some drivers > might use megabytes or contiguous RAM? Stupidity has no limits... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > ie. 99.45% of all allocations are single-page! 0.50% is the 8kb > > You're right. That's why it's a waste to have so many order in the > buddy allocator. [...] yep, i agree. I'm not sure what the biggest allocation is, some drivers might use megabytes or contiguous RAM? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, 25 Sep 2000, Andi Kleen wrote: > An important exception in 2.2/2.4 is NFS with bigger rsize (will be fixed > in 2.5, but 2.4 does it this way). For an 8K r/wsize you need reliable > (=GFP_ATOMIC) 16K allocations. the discussion does not affect GFP_ATOMIC - GFP_ATOMIC allocators *must* be prepared to handle occasional oom situations gracefully. > Another thing I would worry about are ports with multiple user page > sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages > but may likely need a 16K kernel stack due to the 64bit stack bloat. yep, but these cases are not affected, i think in the order != 0 case we should return NULL if a certain number of iterations did not yield any free page. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 06:02:18PM +0200, Ingo Molnar wrote: > Frankly, how often do we allocate multi-order pages? I've just made quick > statistics wrt. how allocation orders are distributed on a more or less > typical system: > > (ALLOC ORDER) > 0: 167081 > 1: 850 > 2: 16 > 3: 25 > 4: 0 > 5: 1 > 6: 0 > 7: 2 > 8: 13 > 9: 5 > > ie. 99.45% of all allocations are single-page! 0.50% is the 8kb > task-structure. The rest is 0.05%. An important exception in 2.2/2.4 is NFS with bigger rsize (will be fixed in 2.5, but 2.4 does it this way). For an 8K r/wsize you need reliable (=GFP_ATOMIC) 16K allocations. Another thing I would worry about are ports with multiple user page sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages but may likely need a 16K kernel stack due to the 64bit stack bloat. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 06:02:18PM +0200, Ingo Molnar wrote: > Frankly, how often do we allocate multi-order pages? I've just made quick The deadlock Alan pointed out can happen also with single page allocation if we in 2.4.x-current put a loop in GFP_KERNEL. > ie. 99.45% of all allocations are single-page! 0.50% is the 8kb You're right. That's why it's a waste to have so many order in the buddy allocator. Even more now that the hashtables should be allocated with the bootmem allocator! :) Chuck seen the slowdown of increasing the highest order allocation in his bench. But of course in 2.2.x we can't avoid that. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 04:42:49PM +0100, Stephen C. Tweedie wrote: > Progress is made, clean pages are discarded and dirty ones queued for How can you make progress if there isn't swap avaiable and all the freeable page/buffer cache is just been freed? The deadlock happens in OOM condition (not when we can make progress). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > Ingo's point is that the underlined line won't ever happen in the > first place please dont misinterpret my point ... Frankly, how often do we allocate multi-order pages? I've just made quick statistics wrt. how allocation orders are distributed on a more or less typical system: (ALLOC ORDER) 0: 167081 1: 850 2: 16 3: 25 4: 0 5: 1 6: 0 7: 2 8: 13 9: 5 ie. 99.45% of all allocations are single-page! 0.50% is the 8kb task-structure. The rest is 0.05%. i'm not talking about 4MB contiguous physical allocations having to succeed on a 8MB box. I'm talking about 99% of the simple allocation points not having to worry about a NULL pointer. (not checking for NULL is one of the most common allocation-related bug that beats low-RAM systems.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 04:16:56PM +0100, Alan Cox wrote: > Unless Im missing something here think about this case > > 2 active processes, no swap > > #1#2 > kmalloc 32K kmalloc 16K > OKOK > kmalloc 16K kmalloc 32K ^ > block block Yep, you're not missing anything. That was my complain about the fact GFP_KERNEL not failing will obviously dealdock the kernel all over the place. Ingo's point is that the underlined line won't ever happen in the first place because of the resource accounting that will tell the upper layer that they can't try to allocate anything, so they won't enter kmalloc at all. But he's obviously not talking about 2.4.x. (and I'm not sure if that's the right way to go in the general case but certainly it's the right way to go for special cases like skbs with gigabit ethernet) In 2.4.x GFP_KERNEL not failing is a deadlock as you said. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
> > GFP_KERNEL has to be able to fail for 2.4. Otherwise you can get > > everything jammed in kernel space waiting on GFP_KERNEL and if the > > swapper cannot make space you die. > > if one can get everything jammed waiting for GFP_KERNEL, and not being > able to deallocate anything, thats a VM or resource-limit bug. This > situation is just 1% RAM away from the 'root cannot log in', situation. Unless Im missing something here think about this case 2 active processes, no swap #1 #2 kmalloc 32K kmalloc 16K OK OK kmalloc 16K kmalloc 32K block block so GFP_KERNEL has to be able to fail - it can wait for I/O in some cases with care, but when we have no pages left something has to give - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
GFP_KERNEL has to be able to fail for 2.4. Otherwise you can get everything jammed in kernel space waiting on GFP_KERNEL and if the swapper cannot make space you die. if one can get everything jammed waiting for GFP_KERNEL, and not being able to deallocate anything, thats a VM or resource-limit bug. This situation is just 1% RAM away from the 'root cannot log in', situation. Unless Im missing something here think about this case 2 active processes, no swap #1 #2 kmalloc 32K kmalloc 16K OK OK kmalloc 16K kmalloc 32K block block so GFP_KERNEL has to be able to fail - it can wait for I/O in some cases with care, but when we have no pages left something has to give - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 04:16:56PM +0100, Alan Cox wrote: Unless Im missing something here think about this case 2 active processes, no swap #1#2 kmalloc 32K kmalloc 16K OKOK kmalloc 16K kmalloc 32K ^ block block Yep, you're not missing anything. That was my complain about the fact GFP_KERNEL not failing will obviously dealdock the kernel all over the place. Ingo's point is that the underlined line won't ever happen in the first place because of the resource accounting that will tell the upper layer that they can't try to allocate anything, so they won't enter kmalloc at all. But he's obviously not talking about 2.4.x. (and I'm not sure if that's the right way to go in the general case but certainly it's the right way to go for special cases like skbs with gigabit ethernet) In 2.4.x GFP_KERNEL not failing is a deadlock as you said. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: Ingo's point is that the underlined line won't ever happen in the first place please dont misinterpret my point ... Frankly, how often do we allocate multi-order pages? I've just made quick statistics wrt. how allocation orders are distributed on a more or less typical system: (ALLOC ORDER) 0: 167081 1: 850 2: 16 3: 25 4: 0 5: 1 6: 0 7: 2 8: 13 9: 5 ie. 99.45% of all allocations are single-page! 0.50% is the 8kb task-structure. The rest is 0.05%. i'm not talking about 4MB contiguous physical allocations having to succeed on a 8MB box. I'm talking about 99% of the simple allocation points not having to worry about a NULL pointer. (not checking for NULL is one of the most common allocation-related bug that beats low-RAM systems.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 04:42:49PM +0100, Stephen C. Tweedie wrote: Progress is made, clean pages are discarded and dirty ones queued for How can you make progress if there isn't swap avaiable and all the freeable page/buffer cache is just been freed? The deadlock happens in OOM condition (not when we can make progress). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 06:02:18PM +0200, Ingo Molnar wrote: Frankly, how often do we allocate multi-order pages? I've just made quick The deadlock Alan pointed out can happen also with single page allocation if we in 2.4.x-current put a loop in GFP_KERNEL. ie. 99.45% of all allocations are single-page! 0.50% is the 8kb You're right. That's why it's a waste to have so many order in the buddy allocator. Even more now that the hashtables should be allocated with the bootmem allocator! :) Chuck seen the slowdown of increasing the highest order allocation in his bench. But of course in 2.2.x we can't avoid that. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 06:02:18PM +0200, Ingo Molnar wrote: Frankly, how often do we allocate multi-order pages? I've just made quick statistics wrt. how allocation orders are distributed on a more or less typical system: (ALLOC ORDER) 0: 167081 1: 850 2: 16 3: 25 4: 0 5: 1 6: 0 7: 2 8: 13 9: 5 ie. 99.45% of all allocations are single-page! 0.50% is the 8kb task-structure. The rest is 0.05%. An important exception in 2.2/2.4 is NFS with bigger rsize (will be fixed in 2.5, but 2.4 does it this way). For an 8K r/wsize you need reliable (=GFP_ATOMIC) 16K allocations. Another thing I would worry about are ports with multiple user page sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages but may likely need a 16K kernel stack due to the 64bit stack bloat. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, 25 Sep 2000, Andi Kleen wrote: An important exception in 2.2/2.4 is NFS with bigger rsize (will be fixed in 2.5, but 2.4 does it this way). For an 8K r/wsize you need reliable (=GFP_ATOMIC) 16K allocations. the discussion does not affect GFP_ATOMIC - GFP_ATOMIC allocators *must* be prepared to handle occasional oom situations gracefully. Another thing I would worry about are ports with multiple user page sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages but may likely need a 16K kernel stack due to the 64bit stack bloat. yep, but these cases are not affected, i think in the order != 0 case we should return NULL if a certain number of iterations did not yield any free page. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: ie. 99.45% of all allocations are single-page! 0.50% is the 8kb You're right. That's why it's a waste to have so many order in the buddy allocator. [...] yep, i agree. I'm not sure what the biggest allocation is, some drivers might use megabytes or contiguous RAM? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, 25 Sep 2000, Ingo Molnar wrote: On Mon, 25 Sep 2000, Andrea Arcangeli wrote: ie. 99.45% of all allocations are single-page! 0.50% is the 8kb You're right. That's why it's a waste to have so many order in the buddy allocator. [...] yep, i agree. I'm not sure what the biggest allocation is, some drivers might use megabytes or contiguous RAM? Stupidity has no limits... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, Sep 25, 2000 at 06:19:07PM +0200, Ingo Molnar wrote: Another thing I would worry about are ports with multiple user page sizes in 2.5. Another ugly case is the x86-64 port which has 4K pages but may likely need a 16K kernel stack due to the 64bit stack bloat. yep, but these cases are not affected, i think in the order != 0 case we should return NULL if a certain number of iterations did not yield any free page. Ok, that would just break fork() -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
On Mon, 25 Sep 2000, Alan Cox wrote: GFP_KERNEL has to be able to fail for 2.4. Otherwise you can get everything jammed in kernel space waiting on GFP_KERNEL and if the swapper cannot make space you die. if one can get everything jammed waiting for GFP_KERNEL, and not being able to deallocate anything, thats a VM or resource-limit bug. This situation is just 1% RAM away from the 'root cannot log in', situation. Unless Im missing something here think about this case 2 active processes, no swap #1#2 kmalloc 32K kmalloc 16K OKOK kmalloc 16K kmalloc 32K block block so GFP_KERNEL has to be able to fail - it can wait for I/O in some cases with care, but when we have no pages left something has to give The trick here is to: 1) keep some reserved pages around for PF_MEMALLOC tasks (we need this anyway) 2) set PF_MEMALLOC on the task you're killing for OOM, that way this task will either get the memory or fail (note that PF_MEMALLOC tasks don't wait) This way the OOM-killed task will be able to exit quickly and the rest of the system will not get killed as a side effect. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/