Re: OOM fixes 2/5
On Fri, Jan 21, 2005 at 08:08:21AM +0100, Andi Kleen wrote: > So at least for GFP_DMA it seems to be definitely needed. Indeed. Plus if you add pci32 zone, it'll be needed for it too on x86-64, like for the normal zone on x86, since ptes will go in highmem while pci32 allocations will not. So while floppy might be fixed, this issue would be for brand new pci32 zone needed by some device (i.e. nvidia, so not such a unlikely corner case). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
On Fri, Jan 21, 2005 at 06:04:25PM +1100, Nick Piggin wrote: > OK this is a fairly lame example... but the current code is more or > less just lucky that ZONE_DMA doesn't usually fill up with pinned mem > on machines that need explicit ZONE_DMA allocations. Yep. For the DMA zone all slab cache will be a memory pin (like ptes for highmem, but not that many people runs with 3G of ram in ptes, and I guess the ones doing it aren't normally using a mainline kernel in the first place so they're likely not running into it either). While slab cache pinning the normal zone has more probability of being reproduced on l-k in random usages. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
On Thu, Jan 20, 2005 at 11:00:16PM -0800, Andrew Morton wrote: > Last time we dicsussed this you pointed out that reserving more lowmem from > highmem-capable allocations may actually *help* things. (Tries to remember > why) By reducing inode/dentry eviction rates? I asked Martin Bligh if he > could test that on a big NUMA box but iirc the results were inconclusive. This is correct, guaranteeing more memory to be freeable in lowmem (ptes aren't freeable without a sigkill for example) the icache/dcache will at least have a margin where it can grow indipendently from highmem allocations. > Maybe it just won't make much difference. Hard to say. I don't know myself if it makes a performance difference, all old benchmarks have been run with this applied. This was applied for correcntess (i.e. to avoid sigkills or lockups), it wasn't applied for performance. But I don't see how it could hurt performance (especially given current code already does the check at runtime, which is pratically the only fast-path cost ;). > > The sysctl name had to change to lowmem_reserve_ratio because its > > semantics are completely different now. > > That reminds me. Documentation/filesystems/proc.txt ;) Woops, forgotten about it ;) > I'll cook something up for that. Thanks. If you prefer I can write it too to relieve you from this load, it's up to you. If you want to fix it yourself go ahead of course ;) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
On Thu, 2005-01-20 at 22:46 -0800, Andrew Morton wrote: > Nick Piggin <[EMAIL PROTECTED]> wrote: > > It does turn on lowmem protection by default. We never reached > > an agreement about doing this though, but Andrea has shown that > > it fixes trivial OOM cases. > > > > I think it should be turned on by default. I can't recall what > > your reservations were...? > > > > Just that it throws away a bunch of potentially usable memory. In three > years I've seen zero reports of any problems which would have been solved > by increasing the protection ratio. > > Thus empirically, it appears that the number of machines which need a > non-zero protection ratio is exceedingly small. Why change the setting on > all machines for the benefit of the tiny few? Seems weird. Especially > when this problem could be solved with a few-line initscript. Ho hum. That is true, but it should not reserve a great deal of memory on small memory machines. ZONE_NORMAL reservation may not even be too noticeable as you'll usually have ZONE_NORMAL allocations during the course of normal running. Although it is true that there haven't been many problems attributed to this, one example I can remember is when we fixed the __alloc_pages watermark code, we fixed a bug that was reserving much more ZONE_DMA than it was supposed to. This cased all those page allocation failure problems. So we raised the atomic reserve, but that didn't bring ZONE_DMA reservation back to its previous levels. "So the buffer between GFP_KERNEL and GFP_ATOMIC allocations is: 2.6.8 | 465 dma, 117 norm, 582 tot = 2328K 2.6.10-rc | 2 dma, 146 norm, 148 tot = 592K patch | 12 dma, 500 norm, 512 tot = 2048K" So we were still seeing GFP_DMA allocation failures in the sound code. You recently had to make that NOWARN to shut it up. OK this is a fairly lame example... but the current code is more or less just lucky that ZONE_DMA doesn't usually fill up with pinned mem on machines that need explicit ZONE_DMA allocations. Find local movie times and trailers on Yahoo! Movies. http://au.movies.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
Andrew Morton <[EMAIL PROTECTED]> writes: > Just that it throws away a bunch of potentially usable memory. In three > years I've seen zero reports of any problems which would have been solved > by increasing the protection ratio. We ran into a big problem with this on x86-64. The SUSE installer would load the floppy driver during installation. Floppy driver would try to allocate some pages with GFP_DMA and on a small memory x86-64 system (256-512MB) the OOM killer would always start to kill things trying to free some DMA pages. This was quite a show stopper because you effectively couldn't install. So at least for GFP_DMA it seems to be definitely needed. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
On Thu, Jan 20, 2005 at 10:46:45PM -0800, Andrew Morton wrote: > Thus empirically, it appears that the number of machines which need a > non-zero protection ratio is exceedingly small. Why change the setting on > all machines for the benefit of the tiny few? Seems weird. Especially > when this problem could be solved with a few-line initscript. Ho hum. It's up to you, IMHO you're doing a mistake, but I don't mind as long as our customers aren't at risk of early oom kills (or worse kernel crashes) with some db load (especially without swap the risk is huge for all users, since all anonymous memory will be pinned like ptes, but with ~3G of pagetables they're at risk even with swap). At least you *must* admit that without my patch applied as I posted, there's a >0 probabity of running out of normal zone which will lead to an oom-kill or a deadlock despite 10G of highmem might still be freeeable (like with clean cache). And my patch obviously cannot make it impossible to run out of normal zone, since there's only 800m of normal zone and one can open more files than what fits in normal zone, but at least it gives the user the security that a certain workload can run reliably. Without this patch there's no guarantee at all that any workload will run when >1G of ptes is allocated. This below fix as well is needed and you won't find reports of people reproducing this race condition. Please apply. CC'ed Hugh. Sorry Hugh, I know you were working on it (you said not in the weekend IIRC), but I've been upgraded to latest bk so I had to fixup quickly or I would have to run the racy code on my smp systems to test new kernels. From: Andrea Arcangeli <[EMAIL PROTECTED]> Subject: fixup smp race introduced in 2.6.11-rc1 Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]> --- x/mm/memory.c.~1~ 2005-01-21 06:58:14.747335048 +0100 +++ x/mm/memory.c 2005-01-21 07:16:15.318063328 +0100 @@ -1555,8 +1555,17 @@ void unmap_mapping_range(struct address_ spin_lock(>i_mmap_lock); + /* serialize i_size write against truncate_count write */ + smp_wmb(); /* Protect against page faults, and endless unmapping loops */ mapping->truncate_count++; + /* +* For archs where spin_lock has inclusive semantics like ia64 +* this smp_mb() will prevent to read pagetable contents +* before the truncate_count increment is visible to +* other cpus. +*/ + smp_mb(); if (unlikely(is_restart_addr(mapping->truncate_count))) { if (mapping->truncate_count == 0) reset_vma_truncate_counts(mapping); @@ -1864,10 +1873,18 @@ do_no_page(struct mm_struct *mm, struct if (vma->vm_file) { mapping = vma->vm_file->f_mapping; sequence = mapping->truncate_count; + smp_rmb(); /* serializes i_size against truncate_count */ } retry: cond_resched(); new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, ); + /* +* No smp_rmb is needed here as long as there's a full +* spin_lock/unlock sequence inside the ->nopage callback +* (for the pagecache lookup) that acts as an implicit +* smp_mb() and prevents the i_size read to happen +* after the next truncate_count read. +*/ /* no page was available -- either SIGBUS or OOM */ if (new_page == NOPAGE_SIGBUS) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
Andrea Arcangeli <[EMAIL PROTECTED]> wrote: > > Anyway if you leave it off by default I don't mind, with my new code > forward ported stright from 2.4 mainline, it's possible for the first > time to set it from userspace without having to embed knowledge on the > kernel min_kbytes settings at boot time. Last time we dicsussed this you pointed out that reserving more lowmem from highmem-capable allocations may actually *help* things. (Tries to remember why) By reducing inode/dentry eviction rates? I asked Martin Bligh if he could test that on a big NUMA box but iirc the results were inconclusive. Maybe it just won't make much difference. Hard to say. > The sysctl name had to change to lowmem_reserve_ratio because its > semantics are completely different now. That reminds me. Documentation/filesystems/proc.txt ;) I'll cook something up for that. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
On Fri, Jan 21, 2005 at 05:36:14PM +1100, Nick Piggin wrote: > I think it should be turned on by default. I can't recall what I think it too, since the number of people that can be bitten by this is certainly higher than the number of people who knows the VM internals and for what kind of workloads they need to enable this by hand to avoid risking lockups (notably with boxes without swap or with heavy pagetable allocations all the time which is not uncommon with db usage). This is needed on x86-64 too to avoid pagetables to lockup the dma zone. Or anyways it's needed also on x86 for the dma zone on <1G boxes too. Anyway if you leave it off by default I don't mind, with my new code forward ported stright from 2.4 mainline, it's possible for the first time to set it from userspace without having to embed knowledge on the kernel min_kbytes settings at boot time. So if you want it down by default it simply means we'll guarantee it on our distro with userland. Setting a sysctl at boot time is no big deal for us (of course leaving it enabled by default in kernel space is older distro where userland isn't yet aware about it). So it's pretty much up to you, as long as we can easily fixup in userland is fine with me and I already tried a dozen times to push mainline in what I believe to be the right direction (like I already did in 2.4 mainline since that same code is enabled by default in 2.4). The sysctl name had to change to lowmem_reserve_ratio because its semantics are completely different now. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
Nick Piggin <[EMAIL PROTECTED]> wrote: > > On Thu, 2005-01-20 at 22:20 -0800, Andrew Morton wrote: > > Andrea Arcangeli <[EMAIL PROTECTED]> wrote: > > > > > > This is the forward port to 2.6 of the lowmem_reserved algorithm I > > > invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads > > > like google (especially without swap) on x86 with >1G of ram, but it's > > > needed in all sort of workloads with lots of ram on x86, it's also > > > needed on x86-64 for dma allocations. This brings 2.6 in sync with > > > latest 2.4.2x. > > > > But this patch doesn't change anything at all in the page allocation path > > apart from renaming lots of things, does it? > > > > AFAICT all it does is to change the default values in the protection map. > > It does it via a simplification, which is nice, but I can't see how it > > fixes anything. > > > > Confused. > > > It does turn on lowmem protection by default. We never reached > an agreement about doing this though, but Andrea has shown that > it fixes trivial OOM cases. > > I think it should be turned on by default. I can't recall what > your reservations were...? > Just that it throws away a bunch of potentially usable memory. In three years I've seen zero reports of any problems which would have been solved by increasing the protection ratio. Thus empirically, it appears that the number of machines which need a non-zero protection ratio is exceedingly small. Why change the setting on all machines for the benefit of the tiny few? Seems weird. Especially when this problem could be solved with a few-line initscript. Ho hum. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
On Thu, 2005-01-20 at 22:20 -0800, Andrew Morton wrote: > Andrea Arcangeli <[EMAIL PROTECTED]> wrote: > > > > This is the forward port to 2.6 of the lowmem_reserved algorithm I > > invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads > > like google (especially without swap) on x86 with >1G of ram, but it's > > needed in all sort of workloads with lots of ram on x86, it's also > > needed on x86-64 for dma allocations. This brings 2.6 in sync with > > latest 2.4.2x. > > But this patch doesn't change anything at all in the page allocation path > apart from renaming lots of things, does it? > > AFAICT all it does is to change the default values in the protection map. > It does it via a simplification, which is nice, but I can't see how it > fixes anything. > > Confused. It does turn on lowmem protection by default. We never reached an agreement about doing this though, but Andrea has shown that it fixes trivial OOM cases. I think it should be turned on by default. I can't recall what your reservations were...? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
On Thu, Jan 20, 2005 at 10:20:56PM -0800, Andrew Morton wrote: > Andrea Arcangeli <[EMAIL PROTECTED]> wrote: > > > > This is the forward port to 2.6 of the lowmem_reserved algorithm I > > invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads > > like google (especially without swap) on x86 with >1G of ram, but it's > > needed in all sort of workloads with lots of ram on x86, it's also > > needed on x86-64 for dma allocations. This brings 2.6 in sync with > > latest 2.4.2x. > > But this patch doesn't change anything at all in the page allocation path > apart from renaming lots of things, does it? In the allocation path not, but it rewrites the setting algorithm, so from somebody watching it from userspace it's a completely different thing, usable for the first time ever in 2.6. Otherwise userspace would be required to have knowledge about the kernel internals to be able to set it to a sane value. Plus the new init code is much cleaner too. > AFAICT all it does is to change the default values in the protection map. > It does it via a simplification, which is nice, but I can't see how it > fixes anything. Having this patch applied is a major fix. See again the google fix thread in 2.4.1x. 2.6 is vulnerable to it again. This patch makes the feature usable and enables the feature as well, which is definitely a fix as far as an end user is concerned (google was the user in this case). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
Andrea Arcangeli <[EMAIL PROTECTED]> wrote: > > This is the forward port to 2.6 of the lowmem_reserved algorithm I > invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads > like google (especially without swap) on x86 with >1G of ram, but it's > needed in all sort of workloads with lots of ram on x86, it's also > needed on x86-64 for dma allocations. This brings 2.6 in sync with > latest 2.4.2x. But this patch doesn't change anything at all in the page allocation path apart from renaming lots of things, does it? AFAICT all it does is to change the default values in the protection map. It does it via a simplification, which is nice, but I can't see how it fixes anything. Confused. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
Andrea Arcangeli [EMAIL PROTECTED] wrote: This is the forward port to 2.6 of the lowmem_reserved algorithm I invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads like google (especially without swap) on x86 with 1G of ram, but it's needed in all sort of workloads with lots of ram on x86, it's also needed on x86-64 for dma allocations. This brings 2.6 in sync with latest 2.4.2x. But this patch doesn't change anything at all in the page allocation path apart from renaming lots of things, does it? AFAICT all it does is to change the default values in the protection map. It does it via a simplification, which is nice, but I can't see how it fixes anything. Confused. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
On Thu, Jan 20, 2005 at 10:20:56PM -0800, Andrew Morton wrote: Andrea Arcangeli [EMAIL PROTECTED] wrote: This is the forward port to 2.6 of the lowmem_reserved algorithm I invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads like google (especially without swap) on x86 with 1G of ram, but it's needed in all sort of workloads with lots of ram on x86, it's also needed on x86-64 for dma allocations. This brings 2.6 in sync with latest 2.4.2x. But this patch doesn't change anything at all in the page allocation path apart from renaming lots of things, does it? In the allocation path not, but it rewrites the setting algorithm, so from somebody watching it from userspace it's a completely different thing, usable for the first time ever in 2.6. Otherwise userspace would be required to have knowledge about the kernel internals to be able to set it to a sane value. Plus the new init code is much cleaner too. AFAICT all it does is to change the default values in the protection map. It does it via a simplification, which is nice, but I can't see how it fixes anything. Having this patch applied is a major fix. See again the google fix thread in 2.4.1x. 2.6 is vulnerable to it again. This patch makes the feature usable and enables the feature as well, which is definitely a fix as far as an end user is concerned (google was the user in this case). - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
On Thu, 2005-01-20 at 22:20 -0800, Andrew Morton wrote: Andrea Arcangeli [EMAIL PROTECTED] wrote: This is the forward port to 2.6 of the lowmem_reserved algorithm I invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads like google (especially without swap) on x86 with 1G of ram, but it's needed in all sort of workloads with lots of ram on x86, it's also needed on x86-64 for dma allocations. This brings 2.6 in sync with latest 2.4.2x. But this patch doesn't change anything at all in the page allocation path apart from renaming lots of things, does it? AFAICT all it does is to change the default values in the protection map. It does it via a simplification, which is nice, but I can't see how it fixes anything. Confused. It does turn on lowmem protection by default. We never reached an agreement about doing this though, but Andrea has shown that it fixes trivial OOM cases. I think it should be turned on by default. I can't recall what your reservations were...? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
Nick Piggin [EMAIL PROTECTED] wrote: On Thu, 2005-01-20 at 22:20 -0800, Andrew Morton wrote: Andrea Arcangeli [EMAIL PROTECTED] wrote: This is the forward port to 2.6 of the lowmem_reserved algorithm I invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads like google (especially without swap) on x86 with 1G of ram, but it's needed in all sort of workloads with lots of ram on x86, it's also needed on x86-64 for dma allocations. This brings 2.6 in sync with latest 2.4.2x. But this patch doesn't change anything at all in the page allocation path apart from renaming lots of things, does it? AFAICT all it does is to change the default values in the protection map. It does it via a simplification, which is nice, but I can't see how it fixes anything. Confused. It does turn on lowmem protection by default. We never reached an agreement about doing this though, but Andrea has shown that it fixes trivial OOM cases. I think it should be turned on by default. I can't recall what your reservations were...? Just that it throws away a bunch of potentially usable memory. In three years I've seen zero reports of any problems which would have been solved by increasing the protection ratio. Thus empirically, it appears that the number of machines which need a non-zero protection ratio is exceedingly small. Why change the setting on all machines for the benefit of the tiny few? Seems weird. Especially when this problem could be solved with a few-line initscript. Ho hum. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
On Fri, Jan 21, 2005 at 05:36:14PM +1100, Nick Piggin wrote: I think it should be turned on by default. I can't recall what I think it too, since the number of people that can be bitten by this is certainly higher than the number of people who knows the VM internals and for what kind of workloads they need to enable this by hand to avoid risking lockups (notably with boxes without swap or with heavy pagetable allocations all the time which is not uncommon with db usage). This is needed on x86-64 too to avoid pagetables to lockup the dma zone. Or anyways it's needed also on x86 for the dma zone on 1G boxes too. Anyway if you leave it off by default I don't mind, with my new code forward ported stright from 2.4 mainline, it's possible for the first time to set it from userspace without having to embed knowledge on the kernel min_kbytes settings at boot time. So if you want it down by default it simply means we'll guarantee it on our distro with userland. Setting a sysctl at boot time is no big deal for us (of course leaving it enabled by default in kernel space is older distro where userland isn't yet aware about it). So it's pretty much up to you, as long as we can easily fixup in userland is fine with me and I already tried a dozen times to push mainline in what I believe to be the right direction (like I already did in 2.4 mainline since that same code is enabled by default in 2.4). The sysctl name had to change to lowmem_reserve_ratio because its semantics are completely different now. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
Andrea Arcangeli [EMAIL PROTECTED] wrote: Anyway if you leave it off by default I don't mind, with my new code forward ported stright from 2.4 mainline, it's possible for the first time to set it from userspace without having to embed knowledge on the kernel min_kbytes settings at boot time. Last time we dicsussed this you pointed out that reserving more lowmem from highmem-capable allocations may actually *help* things. (Tries to remember why) By reducing inode/dentry eviction rates? I asked Martin Bligh if he could test that on a big NUMA box but iirc the results were inconclusive. Maybe it just won't make much difference. Hard to say. The sysctl name had to change to lowmem_reserve_ratio because its semantics are completely different now. That reminds me. Documentation/filesystems/proc.txt ;) I'll cook something up for that. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
On Thu, Jan 20, 2005 at 10:46:45PM -0800, Andrew Morton wrote: Thus empirically, it appears that the number of machines which need a non-zero protection ratio is exceedingly small. Why change the setting on all machines for the benefit of the tiny few? Seems weird. Especially when this problem could be solved with a few-line initscript. Ho hum. It's up to you, IMHO you're doing a mistake, but I don't mind as long as our customers aren't at risk of early oom kills (or worse kernel crashes) with some db load (especially without swap the risk is huge for all users, since all anonymous memory will be pinned like ptes, but with ~3G of pagetables they're at risk even with swap). At least you *must* admit that without my patch applied as I posted, there's a 0 probabity of running out of normal zone which will lead to an oom-kill or a deadlock despite 10G of highmem might still be freeeable (like with clean cache). And my patch obviously cannot make it impossible to run out of normal zone, since there's only 800m of normal zone and one can open more files than what fits in normal zone, but at least it gives the user the security that a certain workload can run reliably. Without this patch there's no guarantee at all that any workload will run when 1G of ptes is allocated. This below fix as well is needed and you won't find reports of people reproducing this race condition. Please apply. CC'ed Hugh. Sorry Hugh, I know you were working on it (you said not in the weekend IIRC), but I've been upgraded to latest bk so I had to fixup quickly or I would have to run the racy code on my smp systems to test new kernels. From: Andrea Arcangeli [EMAIL PROTECTED] Subject: fixup smp race introduced in 2.6.11-rc1 Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED] --- x/mm/memory.c.~1~ 2005-01-21 06:58:14.747335048 +0100 +++ x/mm/memory.c 2005-01-21 07:16:15.318063328 +0100 @@ -1555,8 +1555,17 @@ void unmap_mapping_range(struct address_ spin_lock(mapping-i_mmap_lock); + /* serialize i_size write against truncate_count write */ + smp_wmb(); /* Protect against page faults, and endless unmapping loops */ mapping-truncate_count++; + /* +* For archs where spin_lock has inclusive semantics like ia64 +* this smp_mb() will prevent to read pagetable contents +* before the truncate_count increment is visible to +* other cpus. +*/ + smp_mb(); if (unlikely(is_restart_addr(mapping-truncate_count))) { if (mapping-truncate_count == 0) reset_vma_truncate_counts(mapping); @@ -1864,10 +1873,18 @@ do_no_page(struct mm_struct *mm, struct if (vma-vm_file) { mapping = vma-vm_file-f_mapping; sequence = mapping-truncate_count; + smp_rmb(); /* serializes i_size against truncate_count */ } retry: cond_resched(); new_page = vma-vm_ops-nopage(vma, address PAGE_MASK, ret); + /* +* No smp_rmb is needed here as long as there's a full +* spin_lock/unlock sequence inside the -nopage callback +* (for the pagecache lookup) that acts as an implicit +* smp_mb() and prevents the i_size read to happen +* after the next truncate_count read. +*/ /* no page was available -- either SIGBUS or OOM */ if (new_page == NOPAGE_SIGBUS) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
Andrew Morton [EMAIL PROTECTED] writes: Just that it throws away a bunch of potentially usable memory. In three years I've seen zero reports of any problems which would have been solved by increasing the protection ratio. We ran into a big problem with this on x86-64. The SUSE installer would load the floppy driver during installation. Floppy driver would try to allocate some pages with GFP_DMA and on a small memory x86-64 system (256-512MB) the OOM killer would always start to kill things trying to free some DMA pages. This was quite a show stopper because you effectively couldn't install. So at least for GFP_DMA it seems to be definitely needed. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
On Thu, 2005-01-20 at 22:46 -0800, Andrew Morton wrote: Nick Piggin [EMAIL PROTECTED] wrote: It does turn on lowmem protection by default. We never reached an agreement about doing this though, but Andrea has shown that it fixes trivial OOM cases. I think it should be turned on by default. I can't recall what your reservations were...? Just that it throws away a bunch of potentially usable memory. In three years I've seen zero reports of any problems which would have been solved by increasing the protection ratio. Thus empirically, it appears that the number of machines which need a non-zero protection ratio is exceedingly small. Why change the setting on all machines for the benefit of the tiny few? Seems weird. Especially when this problem could be solved with a few-line initscript. Ho hum. That is true, but it should not reserve a great deal of memory on small memory machines. ZONE_NORMAL reservation may not even be too noticeable as you'll usually have ZONE_NORMAL allocations during the course of normal running. Although it is true that there haven't been many problems attributed to this, one example I can remember is when we fixed the __alloc_pages watermark code, we fixed a bug that was reserving much more ZONE_DMA than it was supposed to. This cased all those page allocation failure problems. So we raised the atomic reserve, but that didn't bring ZONE_DMA reservation back to its previous levels. So the buffer between GFP_KERNEL and GFP_ATOMIC allocations is: 2.6.8 | 465 dma, 117 norm, 582 tot = 2328K 2.6.10-rc | 2 dma, 146 norm, 148 tot = 592K patch | 12 dma, 500 norm, 512 tot = 2048K So we were still seeing GFP_DMA allocation failures in the sound code. You recently had to make that NOWARN to shut it up. OK this is a fairly lame example... but the current code is more or less just lucky that ZONE_DMA doesn't usually fill up with pinned mem on machines that need explicit ZONE_DMA allocations. Find local movie times and trailers on Yahoo! Movies. http://au.movies.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
On Thu, Jan 20, 2005 at 11:00:16PM -0800, Andrew Morton wrote: Last time we dicsussed this you pointed out that reserving more lowmem from highmem-capable allocations may actually *help* things. (Tries to remember why) By reducing inode/dentry eviction rates? I asked Martin Bligh if he could test that on a big NUMA box but iirc the results were inconclusive. This is correct, guaranteeing more memory to be freeable in lowmem (ptes aren't freeable without a sigkill for example) the icache/dcache will at least have a margin where it can grow indipendently from highmem allocations. Maybe it just won't make much difference. Hard to say. I don't know myself if it makes a performance difference, all old benchmarks have been run with this applied. This was applied for correcntess (i.e. to avoid sigkills or lockups), it wasn't applied for performance. But I don't see how it could hurt performance (especially given current code already does the check at runtime, which is pratically the only fast-path cost ;). The sysctl name had to change to lowmem_reserve_ratio because its semantics are completely different now. That reminds me. Documentation/filesystems/proc.txt ;) Woops, forgotten about it ;) I'll cook something up for that. Thanks. If you prefer I can write it too to relieve you from this load, it's up to you. If you want to fix it yourself go ahead of course ;) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
On Fri, Jan 21, 2005 at 06:04:25PM +1100, Nick Piggin wrote: OK this is a fairly lame example... but the current code is more or less just lucky that ZONE_DMA doesn't usually fill up with pinned mem on machines that need explicit ZONE_DMA allocations. Yep. For the DMA zone all slab cache will be a memory pin (like ptes for highmem, but not that many people runs with 3G of ram in ptes, and I guess the ones doing it aren't normally using a mainline kernel in the first place so they're likely not running into it either). While slab cache pinning the normal zone has more probability of being reproduced on l-k in random usages. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM fixes 2/5
On Fri, Jan 21, 2005 at 08:08:21AM +0100, Andi Kleen wrote: So at least for GFP_DMA it seems to be definitely needed. Indeed. Plus if you add pci32 zone, it'll be needed for it too on x86-64, like for the normal zone on x86, since ptes will go in highmem while pci32 allocations will not. So while floppy might be fixed, this issue would be for brand new pci32 zone needed by some device (i.e. nvidia, so not such a unlikely corner case). - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/