Re: OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli
On Fri, Jan 21, 2005 at 08:08:21AM +0100, Andi Kleen wrote:
> So at least for GFP_DMA it seems to be definitely needed.

Indeed. Plus if you add pci32 zone, it'll be needed for it too on
x86-64, like for the normal zone on x86, since ptes will go in highmem
while pci32 allocations will not. So while floppy might be fixed, this
issue would be for brand new pci32 zone needed by some device (i.e.
nvidia, so not such a unlikely corner case).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli
On Fri, Jan 21, 2005 at 06:04:25PM +1100, Nick Piggin wrote:
> OK this is a fairly lame example... but the current code is more or
> less just lucky that ZONE_DMA doesn't usually fill up with pinned mem
> on machines that need explicit ZONE_DMA allocations.

Yep. For the DMA zone all slab cache will be a memory pin (like ptes for
highmem, but not that many people runs with 3G of ram in ptes, and I
guess the ones doing it aren't normally using a mainline kernel in the
first place so they're likely not running into it either). While slab
cache pinning the normal zone has more probability of being reproduced
on l-k in random usages.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli
On Thu, Jan 20, 2005 at 11:00:16PM -0800, Andrew Morton wrote:
> Last time we dicsussed this you pointed out that reserving more lowmem from
> highmem-capable allocations may actually *help* things.  (Tries to remember
> why) By reducing inode/dentry eviction rates?  I asked Martin Bligh if he
> could test that on a big NUMA box but iirc the results were inconclusive.

This is correct, guaranteeing more memory to be freeable in lowmem (ptes
aren't freeable without a sigkill for example) the icache/dcache will at
least have a margin where it can grow indipendently from highmem
allocations.

> Maybe it just won't make much difference.  Hard to say.

I don't know myself if it makes a performance difference, all old
benchmarks have been run with this applied. This was applied for
correcntess (i.e.  to avoid sigkills or lockups), it wasn't applied for
performance. But I don't see how it could hurt performance (especially
given current code already does the check at runtime, which is
pratically the only fast-path cost ;).

> >  The sysctl name had to change to lowmem_reserve_ratio because its
> >  semantics are completely different now.
> 
> That reminds me.  Documentation/filesystems/proc.txt ;)

Woops, forgotten about it ;)

> I'll cook something up for that.

Thanks. If you prefer I can write it too to relieve you from this load,
it's up to you. If you want to fix it yourself go ahead of course ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Nick Piggin
On Thu, 2005-01-20 at 22:46 -0800, Andrew Morton wrote:
> Nick Piggin <[EMAIL PROTECTED]> wrote:

> > It does turn on lowmem protection by default. We never reached
> > an agreement about doing this though, but Andrea has shown that
> > it fixes trivial OOM cases.
> > 
> > I think it should be turned on by default. I can't recall what
> > your reservations were...?
> > 
> 
> Just that it throws away a bunch of potentially usable memory.  In three
> years I've seen zero reports of any problems which would have been solved
> by increasing the protection ratio.
> 
> Thus empirically, it appears that the number of machines which need a
> non-zero protection ratio is exceedingly small.  Why change the setting on
> all machines for the benefit of the tiny few?  Seems weird.  Especially
> when this problem could be solved with a few-line initscript.  Ho hum.


That is true, but it should not reserve a great deal of memory on
small memory machines. ZONE_NORMAL reservation may not even be too
noticeable as you'll usually have ZONE_NORMAL allocations during
the course of normal running.

Although it is true that there haven't been many problems attributed
to this, one example I can remember is when we fixed the __alloc_pages
watermark code, we fixed a bug that was reserving much more ZONE_DMA
than it was supposed to. This cased all those page allocation failure
problems. So we raised the atomic reserve, but that didn't bring
ZONE_DMA reservation back to its previous levels.

"So the buffer between GFP_KERNEL and GFP_ATOMIC allocations is:

2.6.8  | 465 dma, 117 norm, 582 tot = 2328K
2.6.10-rc  |   2 dma, 146 norm, 148 tot =  592K
patch  |  12 dma, 500 norm, 512 tot = 2048K"

So we were still seeing GFP_DMA allocation failures in the sound code.
You recently had to make that NOWARN to shut it up.

OK this is a fairly lame example... but the current code is more or
less just lucky that ZONE_DMA doesn't usually fill up with pinned mem
on machines that need explicit ZONE_DMA allocations.



Find local movie times and trailers on Yahoo! Movies.

http://au.movies.yahoo.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Andi Kleen
Andrew Morton <[EMAIL PROTECTED]> writes:

> Just that it throws away a bunch of potentially usable memory.  In three
> years I've seen zero reports of any problems which would have been solved
> by increasing the protection ratio.

We ran into a big problem with this on x86-64. The SUSE installer
would load the floppy driver during installation. Floppy driver would
try to allocate some pages with GFP_DMA and on a small memory x86-64
system (256-512MB) the OOM killer would always start to kill things
trying to free some DMA pages. This was quite a show stopper
because you effectively couldn't install.

So at least for GFP_DMA it seems to be definitely needed.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli
On Thu, Jan 20, 2005 at 10:46:45PM -0800, Andrew Morton wrote:
> Thus empirically, it appears that the number of machines which need a
> non-zero protection ratio is exceedingly small.  Why change the setting on
> all machines for the benefit of the tiny few?  Seems weird.  Especially
> when this problem could be solved with a few-line initscript.  Ho hum.

It's up to you, IMHO you're doing a mistake, but I don't mind as long as our
customers aren't at risk of early oom kills (or worse kernel crashes)
with some db load (especially without swap the risk is huge for all
users, since all anonymous memory will be pinned like ptes, but with ~3G
of pagetables they're at risk even with swap).  At least you *must*
admit that without my patch applied as I posted, there's a >0 probabity
of running out of normal zone which will lead to an oom-kill or a
deadlock despite 10G of highmem might still be freeeable (like with
clean cache). And my patch obviously cannot make it impossible to run
out of normal zone, since there's only 800m of normal zone and one can
open more files than what fits in normal zone, but at least it gives the
user the security that a certain workload can run reliably. Without this
patch there's no guarantee at all that any workload will run when >1G of
ptes is allocated.

This below fix as well is needed and you won't find reports of people
reproducing this race condition. Please apply. CC'ed Hugh. Sorry Hugh, I
know you were working on it (you said not in the weekend IIRC), but I've
been upgraded to latest bk so I had to fixup quickly or I would have to
run the racy code on my smp systems to test new kernels.

From: Andrea Arcangeli <[EMAIL PROTECTED]>
Subject: fixup smp race introduced in 2.6.11-rc1

Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>

--- x/mm/memory.c.~1~   2005-01-21 06:58:14.747335048 +0100
+++ x/mm/memory.c   2005-01-21 07:16:15.318063328 +0100
@@ -1555,8 +1555,17 @@ void unmap_mapping_range(struct address_
 
spin_lock(>i_mmap_lock);
 
+   /* serialize i_size write against truncate_count write */
+   smp_wmb(); 
/* Protect against page faults, and endless unmapping loops */
mapping->truncate_count++;
+   /*
+* For archs where spin_lock has inclusive semantics like ia64
+* this smp_mb() will prevent to read pagetable contents
+* before the truncate_count increment is visible to
+* other cpus.
+*/
+   smp_mb();
if (unlikely(is_restart_addr(mapping->truncate_count))) {
if (mapping->truncate_count == 0)
reset_vma_truncate_counts(mapping);
@@ -1864,10 +1873,18 @@ do_no_page(struct mm_struct *mm, struct 
if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
sequence = mapping->truncate_count;
+   smp_rmb(); /* serializes i_size against truncate_count */
}
 retry:
cond_resched();
new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, );
+   /*
+* No smp_rmb is needed here as long as there's a full
+* spin_lock/unlock sequence inside the ->nopage callback
+* (for the pagecache lookup) that acts as an implicit
+* smp_mb() and prevents the i_size read to happen
+* after the next truncate_count read.
+*/
 
/* no page was available -- either SIGBUS or OOM */
if (new_page == NOPAGE_SIGBUS)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Andrew Morton
Andrea Arcangeli <[EMAIL PROTECTED]> wrote:
>
> Anyway if you leave it off by default I don't mind, with my new code
>  forward ported stright from 2.4 mainline, it's possible for the first
>  time to set it from userspace without having to embed knowledge on the
>  kernel min_kbytes settings at boot time.

Last time we dicsussed this you pointed out that reserving more lowmem from
highmem-capable allocations may actually *help* things.  (Tries to remember
why) By reducing inode/dentry eviction rates?  I asked Martin Bligh if he
could test that on a big NUMA box but iirc the results were inconclusive.

Maybe it just won't make much difference.  Hard to say.

>  The sysctl name had to change to lowmem_reserve_ratio because its
>  semantics are completely different now.

That reminds me.  Documentation/filesystems/proc.txt ;)

I'll cook something up for that.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli
On Fri, Jan 21, 2005 at 05:36:14PM +1100, Nick Piggin wrote:
> I think it should be turned on by default. I can't recall what

I think it too, since the number of people that can be bitten by this is
certainly higher than the number of people who knows the VM internals
and for what kind of workloads they need to enable this by hand to avoid
risking lockups (notably with boxes without swap or with heavy pagetable
allocations all the time which is not uncommon with db usage).

This is needed on x86-64 too to avoid pagetables to lockup the dma zone.
Or anyways it's needed also on x86 for the dma zone on <1G boxes too.

Anyway if you leave it off by default I don't mind, with my new code
forward ported stright from 2.4 mainline, it's possible for the first
time to set it from userspace without having to embed knowledge on the
kernel min_kbytes settings at boot time. So if you want it down by
default it simply means we'll guarantee it on our distro with userland.
Setting a sysctl at boot time is no big deal for us (of course leaving
it enabled by default in kernel space is older distro where userland
isn't yet aware about it). So it's pretty much up to you, as long as we
can easily fixup in userland is fine with me and I already tried a dozen
times to push mainline in what I believe to be the right direction (like
I already did in 2.4 mainline since that same code is enabled by default
in 2.4).

The sysctl name had to change to lowmem_reserve_ratio because its
semantics are completely different now.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Andrew Morton
Nick Piggin <[EMAIL PROTECTED]> wrote:
>
> On Thu, 2005-01-20 at 22:20 -0800, Andrew Morton wrote:
> > Andrea Arcangeli <[EMAIL PROTECTED]> wrote:
> > >
> > >  This is the forward port to 2.6 of the lowmem_reserved algorithm I
> > >  invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads
> > >  like google (especially without swap) on x86 with >1G of ram, but it's
> > >  needed in all sort of workloads with lots of ram on x86, it's also
> > >  needed on x86-64 for dma allocations. This brings 2.6 in sync with
> > >  latest 2.4.2x.
> > 
> > But this patch doesn't change anything at all in the page allocation path
> > apart from renaming lots of things, does it?
> > 
> > AFAICT all it does is to change the default values in the protection map. 
> > It does it via a simplification, which is nice, but I can't see how it
> > fixes anything.
> > 
> > Confused.
> 
> 
> It does turn on lowmem protection by default. We never reached
> an agreement about doing this though, but Andrea has shown that
> it fixes trivial OOM cases.
> 
> I think it should be turned on by default. I can't recall what
> your reservations were...?
> 

Just that it throws away a bunch of potentially usable memory.  In three
years I've seen zero reports of any problems which would have been solved
by increasing the protection ratio.

Thus empirically, it appears that the number of machines which need a
non-zero protection ratio is exceedingly small.  Why change the setting on
all machines for the benefit of the tiny few?  Seems weird.  Especially
when this problem could be solved with a few-line initscript.  Ho hum.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Nick Piggin
On Thu, 2005-01-20 at 22:20 -0800, Andrew Morton wrote:
> Andrea Arcangeli <[EMAIL PROTECTED]> wrote:
> >
> >  This is the forward port to 2.6 of the lowmem_reserved algorithm I
> >  invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads
> >  like google (especially without swap) on x86 with >1G of ram, but it's
> >  needed in all sort of workloads with lots of ram on x86, it's also
> >  needed on x86-64 for dma allocations. This brings 2.6 in sync with
> >  latest 2.4.2x.
> 
> But this patch doesn't change anything at all in the page allocation path
> apart from renaming lots of things, does it?
> 
> AFAICT all it does is to change the default values in the protection map. 
> It does it via a simplification, which is nice, but I can't see how it
> fixes anything.
> 
> Confused.


It does turn on lowmem protection by default. We never reached
an agreement about doing this though, but Andrea has shown that
it fixes trivial OOM cases.

I think it should be turned on by default. I can't recall what
your reservations were...?




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli
On Thu, Jan 20, 2005 at 10:20:56PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[EMAIL PROTECTED]> wrote:
> >
> >  This is the forward port to 2.6 of the lowmem_reserved algorithm I
> >  invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads
> >  like google (especially without swap) on x86 with >1G of ram, but it's
> >  needed in all sort of workloads with lots of ram on x86, it's also
> >  needed on x86-64 for dma allocations. This brings 2.6 in sync with
> >  latest 2.4.2x.
> 
> But this patch doesn't change anything at all in the page allocation path
> apart from renaming lots of things, does it?

In the allocation path not, but it rewrites the setting algorithm, so
from somebody watching it from userspace it's a completely different
thing, usable for the first time ever in 2.6. Otherwise userspace would
be required to have knowledge about the kernel internals to be able to
set it to a sane value. Plus the new init code is much cleaner too.

> AFAICT all it does is to change the default values in the protection map. 
> It does it via a simplification, which is nice, but I can't see how it
> fixes anything.

Having this patch applied is a major fix. See again the google fix
thread in 2.4.1x.  2.6 is vulnerable to it again. This patch makes the
feature usable and enables the feature as well, which is definitely a
fix as far as an end user is concerned (google was the user in this case).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Andrew Morton
Andrea Arcangeli <[EMAIL PROTECTED]> wrote:
>
>  This is the forward port to 2.6 of the lowmem_reserved algorithm I
>  invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads
>  like google (especially without swap) on x86 with >1G of ram, but it's
>  needed in all sort of workloads with lots of ram on x86, it's also
>  needed on x86-64 for dma allocations. This brings 2.6 in sync with
>  latest 2.4.2x.

But this patch doesn't change anything at all in the page allocation path
apart from renaming lots of things, does it?

AFAICT all it does is to change the default values in the protection map. 
It does it via a simplification, which is nice, but I can't see how it
fixes anything.

Confused.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli
From: Andrea Arcangeli <[EMAIL PROTECTED]>
Subject: keep balance between different classzones

This is the forward port to 2.6 of the lowmem_reserved algorithm I
invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads
like google (especially without swap) on x86 with >1G of ram, but it's
needed in all sort of workloads with lots of ram on x86, it's also
needed on x86-64 for dma allocations. This brings 2.6 in sync with
latest 2.4.2x.

Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>

--- mainline-2/include/linux/mmzone.h.orig  2005-01-15 20:45:00.0 
+0100
+++ mainline-2/include/linux/mmzone.h   2005-01-21 05:55:28.644869648 +0100
@@ -112,18 +112,14 @@ struct zone {
unsigned long   free_pages;
unsigned long   pages_min, pages_low, pages_high;
/*
-* protection[] is a pre-calculated number of extra pages that must be
-* available in a zone in order for __alloc_pages() to allocate memory
-* from the zone. i.e., for a GFP_KERNEL alloc of "order" there must
-* be "(1<
--- mainline-2/include/linux/sysctl.h.orig  2005-01-15 20:45:00.0 
+0100
+++ mainline-2/include/linux/sysctl.h   2005-01-21 05:55:28.646869344 +0100
@@ -160,7 +160,7 @@ enum
VM_PAGEBUF=17,  /* struct: Control pagebuf parameters */
VM_HUGETLB_PAGES=18,/* int: Number of available Huge Pages */
VM_SWAPPINESS=19,   /* Tendency to steal mapped memory */
-   VM_LOWER_ZONE_PROTECTION=20,/* Amount of protection of lower zones */
+   VM_LOWMEM_RESERVE_RATIO=20,/* reservation ratio for lower memory zones 
*/
VM_MIN_FREE_KBYTES=21,  /* Minimum free kilobytes to maintain */
VM_MAX_MAP_COUNT=22,/* int: Maximum number of mmaps/address-space */
VM_LAPTOP_MODE=23,  /* vm laptop mode */
--- mainline-2/kernel/sysctl.c.orig 2005-01-15 20:45:00.0 +0100
+++ mainline-2/kernel/sysctl.c  2005-01-21 05:55:28.648869040 +0100
@@ -61,7 +61,6 @@ extern int core_uses_pid;
 extern char core_pattern[];
 extern int cad_pid;
 extern int pid_max;
-extern int sysctl_lower_zone_protection;
 extern int min_free_kbytes;
 extern int printk_ratelimit_jiffies;
 extern int printk_ratelimit_burst;
@@ -745,14 +744,13 @@ static ctl_table vm_table[] = {
 },
 #endif
{
-   .ctl_name   = VM_LOWER_ZONE_PROTECTION,
-   .procname   = "lower_zone_protection",
-   .data   = _lower_zone_protection,
-   .maxlen = sizeof(sysctl_lower_zone_protection),
+   .ctl_name   = VM_LOWMEM_RESERVE_RATIO,
+   .procname   = "lowmem_reserve_ratio",
+   .data   = _lowmem_reserve_ratio,
+   .maxlen = sizeof(sysctl_lowmem_reserve_ratio),
.mode   = 0644,
-   .proc_handler   = _zone_protection_sysctl_handler,
+   .proc_handler   = _reserve_ratio_sysctl_handler,
.strategy   = _intvec,
-   .extra1 = ,
},
{
.ctl_name   = VM_MIN_FREE_KBYTES,
--- mainline-2/mm/page_alloc.c.orig 2005-01-15 20:45:00.0 +0100
+++ mainline-2/mm/page_alloc.c  2005-01-21 05:58:53.338751448 +0100
@@ -44,7 +44,15 @@ struct pglist_data *pgdat_list;
 unsigned long totalram_pages;
 unsigned long totalhigh_pages;
 long nr_swap_pages;
-int sysctl_lower_zone_protection = 0;
+/*
+ * results with 256, 32 in the lowmem_reserve sysctl:
+ * 1G machine -> (16M dma, 800M-16M normal, 1G-800M high)
+ * 1G machine -> (16M dma, 784M normal, 224M high)
+ * NORMAL allocation will leave 784M/256 of ram reserved in the ZONE_DMA
+ * HIGHMEM allocation will leave 224M/32 of ram reserved in ZONE_NORMAL
+ * HIGHMEM allocation will (224M+784M)/256 of ram reserved in ZONE_DMA
+ */
+int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = { 256, 32 };
 
 EXPORT_SYMBOL(totalram_pages);
 EXPORT_SYMBOL(nr_swap_pages);
@@ -654,7 +662,7 @@ buffered_rmqueue(struct zone *zone, int 
  * of the allocation.
  */
 int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
-   int alloc_type, int can_try_harder, int gfp_high)
+ int classzone_idx, int can_try_harder, int gfp_high)
 {
/* free_pages my go negative - that's OK */
long min = mark, free_pages = z->free_pages - (1 << order) + 1;
@@ -665,7 +673,7 @@ int zone_watermark_ok(struct zone *z, in
if (can_try_harder)
min -= min / 4;
 
-   if (free_pages <= min + z->protection[alloc_type])
+   if (free_pages <= min + z->lowmem_reserve[classzone_idx])
return 0;
for (o = 0; o < order; o++) {
/* At the next order, this order's pages become unavailable */
@@ -682,19 +690,6 @@ int zone_watermark_ok(struct zone *z, in
 
 /*
  * This is the 'heart' of the zoned buddy allocator.
- *
- * Herein lies the mysterious 

OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli
From: Andrea Arcangeli [EMAIL PROTECTED]
Subject: keep balance between different classzones

This is the forward port to 2.6 of the lowmem_reserved algorithm I
invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads
like google (especially without swap) on x86 with 1G of ram, but it's
needed in all sort of workloads with lots of ram on x86, it's also
needed on x86-64 for dma allocations. This brings 2.6 in sync with
latest 2.4.2x.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

--- mainline-2/include/linux/mmzone.h.orig  2005-01-15 20:45:00.0 
+0100
+++ mainline-2/include/linux/mmzone.h   2005-01-21 05:55:28.644869648 +0100
@@ -112,18 +112,14 @@ struct zone {
unsigned long   free_pages;
unsigned long   pages_min, pages_low, pages_high;
/*
-* protection[] is a pre-calculated number of extra pages that must be
-* available in a zone in order for __alloc_pages() to allocate memory
-* from the zone. i.e., for a GFP_KERNEL alloc of order there must
-* be (1order) + protection[ZONE_NORMAL] free pages in the zone
-* for us to choose to allocate the page from that zone.
-*
-* It uses both min_free_kbytes and sysctl_lower_zone_protection.
-* The protection values are recalculated if either of these values
-* change.  The array elements are in zonelist order:
-*  [0] == GFP_DMA, [1] == GFP_KERNEL, [2] == GFP_HIGHMEM.
+* We don't know if the memory that we're going to allocate will be 
freeable
+* or/and it will be released eventually, so to avoid totally wasting 
several
+* GB of ram we must reserve some of the lower zone memory (otherwise 
we risk
+* to run OOM on the lower zones despite there's tons of freeable ram
+* on the higher zones). This array is recalculated at runtime if the
+* sysctl_lowmem_reserve_ratio sysctl changes.
 */
-   unsigned long   protection[MAX_NR_ZONES];
+   unsigned long   lowmem_reserve[MAX_NR_ZONES];
 
struct per_cpu_pageset  pageset[NR_CPUS];
 
@@ -368,7 +364,8 @@ struct ctl_table;
 struct file;
 int min_free_kbytes_sysctl_handler(struct ctl_table *, int, struct file *, 
void __user *, size_t *, loff_t *);
-int lower_zone_protection_sysctl_handler(struct ctl_table *, int, struct file 
*,
+extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1];
+int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int, struct file *,
void __user *, size_t *, loff_t *);
 
 #include linux/topology.h
--- mainline-2/include/linux/sysctl.h.orig  2005-01-15 20:45:00.0 
+0100
+++ mainline-2/include/linux/sysctl.h   2005-01-21 05:55:28.646869344 +0100
@@ -160,7 +160,7 @@ enum
VM_PAGEBUF=17,  /* struct: Control pagebuf parameters */
VM_HUGETLB_PAGES=18,/* int: Number of available Huge Pages */
VM_SWAPPINESS=19,   /* Tendency to steal mapped memory */
-   VM_LOWER_ZONE_PROTECTION=20,/* Amount of protection of lower zones */
+   VM_LOWMEM_RESERVE_RATIO=20,/* reservation ratio for lower memory zones 
*/
VM_MIN_FREE_KBYTES=21,  /* Minimum free kilobytes to maintain */
VM_MAX_MAP_COUNT=22,/* int: Maximum number of mmaps/address-space */
VM_LAPTOP_MODE=23,  /* vm laptop mode */
--- mainline-2/kernel/sysctl.c.orig 2005-01-15 20:45:00.0 +0100
+++ mainline-2/kernel/sysctl.c  2005-01-21 05:55:28.648869040 +0100
@@ -61,7 +61,6 @@ extern int core_uses_pid;
 extern char core_pattern[];
 extern int cad_pid;
 extern int pid_max;
-extern int sysctl_lower_zone_protection;
 extern int min_free_kbytes;
 extern int printk_ratelimit_jiffies;
 extern int printk_ratelimit_burst;
@@ -745,14 +744,13 @@ static ctl_table vm_table[] = {
 },
 #endif
{
-   .ctl_name   = VM_LOWER_ZONE_PROTECTION,
-   .procname   = lower_zone_protection,
-   .data   = sysctl_lower_zone_protection,
-   .maxlen = sizeof(sysctl_lower_zone_protection),
+   .ctl_name   = VM_LOWMEM_RESERVE_RATIO,
+   .procname   = lowmem_reserve_ratio,
+   .data   = sysctl_lowmem_reserve_ratio,
+   .maxlen = sizeof(sysctl_lowmem_reserve_ratio),
.mode   = 0644,
-   .proc_handler   = lower_zone_protection_sysctl_handler,
+   .proc_handler   = lowmem_reserve_ratio_sysctl_handler,
.strategy   = sysctl_intvec,
-   .extra1 = zero,
},
{
.ctl_name   = VM_MIN_FREE_KBYTES,
--- mainline-2/mm/page_alloc.c.orig 2005-01-15 20:45:00.0 +0100
+++ mainline-2/mm/page_alloc.c  2005-01-21 05:58:53.338751448 +0100
@@ -44,7 +44,15 @@ struct pglist_data *pgdat_list;
 unsigned long 

Re: OOM fixes 2/5

2005-01-20 Thread Andrew Morton
Andrea Arcangeli [EMAIL PROTECTED] wrote:

  This is the forward port to 2.6 of the lowmem_reserved algorithm I
  invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads
  like google (especially without swap) on x86 with 1G of ram, but it's
  needed in all sort of workloads with lots of ram on x86, it's also
  needed on x86-64 for dma allocations. This brings 2.6 in sync with
  latest 2.4.2x.

But this patch doesn't change anything at all in the page allocation path
apart from renaming lots of things, does it?

AFAICT all it does is to change the default values in the protection map. 
It does it via a simplification, which is nice, but I can't see how it
fixes anything.

Confused.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli
On Thu, Jan 20, 2005 at 10:20:56PM -0800, Andrew Morton wrote:
 Andrea Arcangeli [EMAIL PROTECTED] wrote:
 
   This is the forward port to 2.6 of the lowmem_reserved algorithm I
   invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads
   like google (especially without swap) on x86 with 1G of ram, but it's
   needed in all sort of workloads with lots of ram on x86, it's also
   needed on x86-64 for dma allocations. This brings 2.6 in sync with
   latest 2.4.2x.
 
 But this patch doesn't change anything at all in the page allocation path
 apart from renaming lots of things, does it?

In the allocation path not, but it rewrites the setting algorithm, so
from somebody watching it from userspace it's a completely different
thing, usable for the first time ever in 2.6. Otherwise userspace would
be required to have knowledge about the kernel internals to be able to
set it to a sane value. Plus the new init code is much cleaner too.

 AFAICT all it does is to change the default values in the protection map. 
 It does it via a simplification, which is nice, but I can't see how it
 fixes anything.

Having this patch applied is a major fix. See again the google fix
thread in 2.4.1x.  2.6 is vulnerable to it again. This patch makes the
feature usable and enables the feature as well, which is definitely a
fix as far as an end user is concerned (google was the user in this case).
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Nick Piggin
On Thu, 2005-01-20 at 22:20 -0800, Andrew Morton wrote:
 Andrea Arcangeli [EMAIL PROTECTED] wrote:
 
   This is the forward port to 2.6 of the lowmem_reserved algorithm I
   invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads
   like google (especially without swap) on x86 with 1G of ram, but it's
   needed in all sort of workloads with lots of ram on x86, it's also
   needed on x86-64 for dma allocations. This brings 2.6 in sync with
   latest 2.4.2x.
 
 But this patch doesn't change anything at all in the page allocation path
 apart from renaming lots of things, does it?
 
 AFAICT all it does is to change the default values in the protection map. 
 It does it via a simplification, which is nice, but I can't see how it
 fixes anything.
 
 Confused.


It does turn on lowmem protection by default. We never reached
an agreement about doing this though, but Andrea has shown that
it fixes trivial OOM cases.

I think it should be turned on by default. I can't recall what
your reservations were...?




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Andrew Morton
Nick Piggin [EMAIL PROTECTED] wrote:

 On Thu, 2005-01-20 at 22:20 -0800, Andrew Morton wrote:
  Andrea Arcangeli [EMAIL PROTECTED] wrote:
  
This is the forward port to 2.6 of the lowmem_reserved algorithm I
invented in 2.4.1*, merged in 2.4.2x already and needed to fix workloads
like google (especially without swap) on x86 with 1G of ram, but it's
needed in all sort of workloads with lots of ram on x86, it's also
needed on x86-64 for dma allocations. This brings 2.6 in sync with
latest 2.4.2x.
  
  But this patch doesn't change anything at all in the page allocation path
  apart from renaming lots of things, does it?
  
  AFAICT all it does is to change the default values in the protection map. 
  It does it via a simplification, which is nice, but I can't see how it
  fixes anything.
  
  Confused.
 
 
 It does turn on lowmem protection by default. We never reached
 an agreement about doing this though, but Andrea has shown that
 it fixes trivial OOM cases.
 
 I think it should be turned on by default. I can't recall what
 your reservations were...?
 

Just that it throws away a bunch of potentially usable memory.  In three
years I've seen zero reports of any problems which would have been solved
by increasing the protection ratio.

Thus empirically, it appears that the number of machines which need a
non-zero protection ratio is exceedingly small.  Why change the setting on
all machines for the benefit of the tiny few?  Seems weird.  Especially
when this problem could be solved with a few-line initscript.  Ho hum.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli
On Fri, Jan 21, 2005 at 05:36:14PM +1100, Nick Piggin wrote:
 I think it should be turned on by default. I can't recall what

I think it too, since the number of people that can be bitten by this is
certainly higher than the number of people who knows the VM internals
and for what kind of workloads they need to enable this by hand to avoid
risking lockups (notably with boxes without swap or with heavy pagetable
allocations all the time which is not uncommon with db usage).

This is needed on x86-64 too to avoid pagetables to lockup the dma zone.
Or anyways it's needed also on x86 for the dma zone on 1G boxes too.

Anyway if you leave it off by default I don't mind, with my new code
forward ported stright from 2.4 mainline, it's possible for the first
time to set it from userspace without having to embed knowledge on the
kernel min_kbytes settings at boot time. So if you want it down by
default it simply means we'll guarantee it on our distro with userland.
Setting a sysctl at boot time is no big deal for us (of course leaving
it enabled by default in kernel space is older distro where userland
isn't yet aware about it). So it's pretty much up to you, as long as we
can easily fixup in userland is fine with me and I already tried a dozen
times to push mainline in what I believe to be the right direction (like
I already did in 2.4 mainline since that same code is enabled by default
in 2.4).

The sysctl name had to change to lowmem_reserve_ratio because its
semantics are completely different now.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Andrew Morton
Andrea Arcangeli [EMAIL PROTECTED] wrote:

 Anyway if you leave it off by default I don't mind, with my new code
  forward ported stright from 2.4 mainline, it's possible for the first
  time to set it from userspace without having to embed knowledge on the
  kernel min_kbytes settings at boot time.

Last time we dicsussed this you pointed out that reserving more lowmem from
highmem-capable allocations may actually *help* things.  (Tries to remember
why) By reducing inode/dentry eviction rates?  I asked Martin Bligh if he
could test that on a big NUMA box but iirc the results were inconclusive.

Maybe it just won't make much difference.  Hard to say.

  The sysctl name had to change to lowmem_reserve_ratio because its
  semantics are completely different now.

That reminds me.  Documentation/filesystems/proc.txt ;)

I'll cook something up for that.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli
On Thu, Jan 20, 2005 at 10:46:45PM -0800, Andrew Morton wrote:
 Thus empirically, it appears that the number of machines which need a
 non-zero protection ratio is exceedingly small.  Why change the setting on
 all machines for the benefit of the tiny few?  Seems weird.  Especially
 when this problem could be solved with a few-line initscript.  Ho hum.

It's up to you, IMHO you're doing a mistake, but I don't mind as long as our
customers aren't at risk of early oom kills (or worse kernel crashes)
with some db load (especially without swap the risk is huge for all
users, since all anonymous memory will be pinned like ptes, but with ~3G
of pagetables they're at risk even with swap).  At least you *must*
admit that without my patch applied as I posted, there's a 0 probabity
of running out of normal zone which will lead to an oom-kill or a
deadlock despite 10G of highmem might still be freeeable (like with
clean cache). And my patch obviously cannot make it impossible to run
out of normal zone, since there's only 800m of normal zone and one can
open more files than what fits in normal zone, but at least it gives the
user the security that a certain workload can run reliably. Without this
patch there's no guarantee at all that any workload will run when 1G of
ptes is allocated.

This below fix as well is needed and you won't find reports of people
reproducing this race condition. Please apply. CC'ed Hugh. Sorry Hugh, I
know you were working on it (you said not in the weekend IIRC), but I've
been upgraded to latest bk so I had to fixup quickly or I would have to
run the racy code on my smp systems to test new kernels.

From: Andrea Arcangeli [EMAIL PROTECTED]
Subject: fixup smp race introduced in 2.6.11-rc1

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

--- x/mm/memory.c.~1~   2005-01-21 06:58:14.747335048 +0100
+++ x/mm/memory.c   2005-01-21 07:16:15.318063328 +0100
@@ -1555,8 +1555,17 @@ void unmap_mapping_range(struct address_
 
spin_lock(mapping-i_mmap_lock);
 
+   /* serialize i_size write against truncate_count write */
+   smp_wmb(); 
/* Protect against page faults, and endless unmapping loops */
mapping-truncate_count++;
+   /*
+* For archs where spin_lock has inclusive semantics like ia64
+* this smp_mb() will prevent to read pagetable contents
+* before the truncate_count increment is visible to
+* other cpus.
+*/
+   smp_mb();
if (unlikely(is_restart_addr(mapping-truncate_count))) {
if (mapping-truncate_count == 0)
reset_vma_truncate_counts(mapping);
@@ -1864,10 +1873,18 @@ do_no_page(struct mm_struct *mm, struct 
if (vma-vm_file) {
mapping = vma-vm_file-f_mapping;
sequence = mapping-truncate_count;
+   smp_rmb(); /* serializes i_size against truncate_count */
}
 retry:
cond_resched();
new_page = vma-vm_ops-nopage(vma, address  PAGE_MASK, ret);
+   /*
+* No smp_rmb is needed here as long as there's a full
+* spin_lock/unlock sequence inside the -nopage callback
+* (for the pagecache lookup) that acts as an implicit
+* smp_mb() and prevents the i_size read to happen
+* after the next truncate_count read.
+*/
 
/* no page was available -- either SIGBUS or OOM */
if (new_page == NOPAGE_SIGBUS)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Andi Kleen
Andrew Morton [EMAIL PROTECTED] writes:

 Just that it throws away a bunch of potentially usable memory.  In three
 years I've seen zero reports of any problems which would have been solved
 by increasing the protection ratio.

We ran into a big problem with this on x86-64. The SUSE installer
would load the floppy driver during installation. Floppy driver would
try to allocate some pages with GFP_DMA and on a small memory x86-64
system (256-512MB) the OOM killer would always start to kill things
trying to free some DMA pages. This was quite a show stopper
because you effectively couldn't install.

So at least for GFP_DMA it seems to be definitely needed.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Nick Piggin
On Thu, 2005-01-20 at 22:46 -0800, Andrew Morton wrote:
 Nick Piggin [EMAIL PROTECTED] wrote:

  It does turn on lowmem protection by default. We never reached
  an agreement about doing this though, but Andrea has shown that
  it fixes trivial OOM cases.
  
  I think it should be turned on by default. I can't recall what
  your reservations were...?
  
 
 Just that it throws away a bunch of potentially usable memory.  In three
 years I've seen zero reports of any problems which would have been solved
 by increasing the protection ratio.
 
 Thus empirically, it appears that the number of machines which need a
 non-zero protection ratio is exceedingly small.  Why change the setting on
 all machines for the benefit of the tiny few?  Seems weird.  Especially
 when this problem could be solved with a few-line initscript.  Ho hum.


That is true, but it should not reserve a great deal of memory on
small memory machines. ZONE_NORMAL reservation may not even be too
noticeable as you'll usually have ZONE_NORMAL allocations during
the course of normal running.

Although it is true that there haven't been many problems attributed
to this, one example I can remember is when we fixed the __alloc_pages
watermark code, we fixed a bug that was reserving much more ZONE_DMA
than it was supposed to. This cased all those page allocation failure
problems. So we raised the atomic reserve, but that didn't bring
ZONE_DMA reservation back to its previous levels.

So the buffer between GFP_KERNEL and GFP_ATOMIC allocations is:

2.6.8  | 465 dma, 117 norm, 582 tot = 2328K
2.6.10-rc  |   2 dma, 146 norm, 148 tot =  592K
patch  |  12 dma, 500 norm, 512 tot = 2048K

So we were still seeing GFP_DMA allocation failures in the sound code.
You recently had to make that NOWARN to shut it up.

OK this is a fairly lame example... but the current code is more or
less just lucky that ZONE_DMA doesn't usually fill up with pinned mem
on machines that need explicit ZONE_DMA allocations.



Find local movie times and trailers on Yahoo! Movies.

http://au.movies.yahoo.com

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli
On Thu, Jan 20, 2005 at 11:00:16PM -0800, Andrew Morton wrote:
 Last time we dicsussed this you pointed out that reserving more lowmem from
 highmem-capable allocations may actually *help* things.  (Tries to remember
 why) By reducing inode/dentry eviction rates?  I asked Martin Bligh if he
 could test that on a big NUMA box but iirc the results were inconclusive.

This is correct, guaranteeing more memory to be freeable in lowmem (ptes
aren't freeable without a sigkill for example) the icache/dcache will at
least have a margin where it can grow indipendently from highmem
allocations.

 Maybe it just won't make much difference.  Hard to say.

I don't know myself if it makes a performance difference, all old
benchmarks have been run with this applied. This was applied for
correcntess (i.e.  to avoid sigkills or lockups), it wasn't applied for
performance. But I don't see how it could hurt performance (especially
given current code already does the check at runtime, which is
pratically the only fast-path cost ;).

   The sysctl name had to change to lowmem_reserve_ratio because its
   semantics are completely different now.
 
 That reminds me.  Documentation/filesystems/proc.txt ;)

Woops, forgotten about it ;)

 I'll cook something up for that.

Thanks. If you prefer I can write it too to relieve you from this load,
it's up to you. If you want to fix it yourself go ahead of course ;)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli
On Fri, Jan 21, 2005 at 06:04:25PM +1100, Nick Piggin wrote:
 OK this is a fairly lame example... but the current code is more or
 less just lucky that ZONE_DMA doesn't usually fill up with pinned mem
 on machines that need explicit ZONE_DMA allocations.

Yep. For the DMA zone all slab cache will be a memory pin (like ptes for
highmem, but not that many people runs with 3G of ram in ptes, and I
guess the ones doing it aren't normally using a mainline kernel in the
first place so they're likely not running into it either). While slab
cache pinning the normal zone has more probability of being reproduced
on l-k in random usages.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli
On Fri, Jan 21, 2005 at 08:08:21AM +0100, Andi Kleen wrote:
 So at least for GFP_DMA it seems to be definitely needed.

Indeed. Plus if you add pci32 zone, it'll be needed for it too on
x86-64, like for the normal zone on x86, since ptes will go in highmem
while pci32 allocations will not. So while floppy might be fixed, this
issue would be for brand new pci32 zone needed by some device (i.e.
nvidia, so not such a unlikely corner case).
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/