Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-23 Thread Benjamin Herrenschmidt

> Thanks for the warning, Ben, but I don't see a problem there: that's
> in your separate ioremap_mm, which is rather like init_mm, and won't
> ever go through exit_mmap, and doesn't need its page tables freed -
> isn't that right?

Right.

> But it was worth auditing the different architectures for this: there
> seems to be just one problem, where the x86_64 32-bit vsyscall page
> is mapped into userspace by __map_syscall32 without linking a real
> vma into mm.  Which Andi has already marked with his "RED-PEN".

The ppc64 vDSO is mapped with a real VMA bot not mmap call (the vDMA is
built from scratch from binfmt_elf, or rather from an arch callback
issued by binfmt_elf, like the stack VMA). Though should be fine too
though but you may want to double check.

> That's not something I can fix up in a hurry.  Yes, as the comment
> suggests we should probably allocate a real vma for it, but that might
> have a few consequences (if only /proc//maps showing two vdsos?).
> I'll have to let someone else deal with that.

Why 2 ? we map the 32 bits one for 32 bits processes and the 64 bits one
for 64 bits processes on ppc64 without problem ... In fact, Andi could
even re-use our hook I suppose. The way I do it allows also for free
copy-on-write semantics (with mprotect though, I don't set it writeable
by default) so that gdb can put breakpoints in it, etc...

> For the moment, I think the behaviour of x86_64 32bit-support with
> the freepgt patches will depend on personality - ADDR_LIMIT_32BIT
> should usually work fine (unless the app moves its stack elsewhere
> and munmaps the old one: that might well unmap its vdso too); but
> ADDR_LIMIT_3GB will be liable to leak tables (if get_user_pages or
> its /proc//maps has been examined).  I don't know how common
> ADDR_LIMIT_3GB use is, but whatever: okay for testing, but not for
> including the patches in a release.
> 
> Hugh
-- 
Benjamin Herrenschmidt <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-23 Thread Hugh Dickins
On Wed, 23 Mar 2005, Benjamin Herrenschmidt wrote:
> On Tue, 2005-03-22 at 16:37 +, Hugh Dickins wrote:
> > 
> > I cannot see those arches doing pte_allocs outside their vmas,
> > that of course could cause it.  And nr_ptes is initialized to 0
> > once by memset and again by assignment, so it should be starting
> > out even zeroer than most fields.
> 
> We do funny things in arch/ppc64/mm/init.c in the ioremap_mm, where we
> don't use VMAs but our own mecanism (yah, ugly, but that's some legacy
> we have from the original port, though I do intend to change that at one
> point).

Thanks for the warning, Ben, but I don't see a problem there: that's
in your separate ioremap_mm, which is rather like init_mm, and won't
ever go through exit_mmap, and doesn't need its page tables freed -
isn't that right?

But it was worth auditing the different architectures for this: there
seems to be just one problem, where the x86_64 32-bit vsyscall page
is mapped into userspace by __map_syscall32 without linking a real
vma into mm.  Which Andi has already marked with his "RED-PEN".

That's not something I can fix up in a hurry.  Yes, as the comment
suggests we should probably allocate a real vma for it, but that might
have a few consequences (if only /proc//maps showing two vdsos?).
I'll have to let someone else deal with that.

For the moment, I think the behaviour of x86_64 32bit-support with
the freepgt patches will depend on personality - ADDR_LIMIT_32BIT
should usually work fine (unless the app moves its stack elsewhere
and munmaps the old one: that might well unmap its vdso too); but
ADDR_LIMIT_3GB will be liable to leak tables (if get_user_pages or
its /proc//maps has been examined).  I don't know how common
ADDR_LIMIT_3GB use is, but whatever: okay for testing, but not for
including the patches in a release.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-23 Thread Hugh Dickins
On Wed, 23 Mar 2005, Benjamin Herrenschmidt wrote:
 On Tue, 2005-03-22 at 16:37 +, Hugh Dickins wrote:
  
  I cannot see those arches doing pte_allocs outside their vmas,
  that of course could cause it.  And nr_ptes is initialized to 0
  once by memset and again by assignment, so it should be starting
  out even zeroer than most fields.
 
 We do funny things in arch/ppc64/mm/init.c in the ioremap_mm, where we
 don't use VMAs but our own mecanism (yah, ugly, but that's some legacy
 we have from the original port, though I do intend to change that at one
 point).

Thanks for the warning, Ben, but I don't see a problem there: that's
in your separate ioremap_mm, which is rather like init_mm, and won't
ever go through exit_mmap, and doesn't need its page tables freed -
isn't that right?

But it was worth auditing the different architectures for this: there
seems to be just one problem, where the x86_64 32-bit vsyscall page
is mapped into userspace by __map_syscall32 without linking a real
vma into mm.  Which Andi has already marked with his RED-PEN.

That's not something I can fix up in a hurry.  Yes, as the comment
suggests we should probably allocate a real vma for it, but that might
have a few consequences (if only /proc/pid/maps showing two vdsos?).
I'll have to let someone else deal with that.

For the moment, I think the behaviour of x86_64 32bit-support with
the freepgt patches will depend on personality - ADDR_LIMIT_32BIT
should usually work fine (unless the app moves its stack elsewhere
and munmaps the old one: that might well unmap its vdso too); but
ADDR_LIMIT_3GB will be liable to leak tables (if get_user_pages or
its /proc/pid/maps has been examined).  I don't know how common
ADDR_LIMIT_3GB use is, but whatever: okay for testing, but not for
including the patches in a release.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-23 Thread Benjamin Herrenschmidt

 Thanks for the warning, Ben, but I don't see a problem there: that's
 in your separate ioremap_mm, which is rather like init_mm, and won't
 ever go through exit_mmap, and doesn't need its page tables freed -
 isn't that right?

Right.

 But it was worth auditing the different architectures for this: there
 seems to be just one problem, where the x86_64 32-bit vsyscall page
 is mapped into userspace by __map_syscall32 without linking a real
 vma into mm.  Which Andi has already marked with his RED-PEN.

The ppc64 vDSO is mapped with a real VMA bot not mmap call (the vDMA is
built from scratch from binfmt_elf, or rather from an arch callback
issued by binfmt_elf, like the stack VMA). Though should be fine too
though but you may want to double check.

 That's not something I can fix up in a hurry.  Yes, as the comment
 suggests we should probably allocate a real vma for it, but that might
 have a few consequences (if only /proc/pid/maps showing two vdsos?).
 I'll have to let someone else deal with that.

Why 2 ? we map the 32 bits one for 32 bits processes and the 64 bits one
for 64 bits processes on ppc64 without problem ... In fact, Andi could
even re-use our hook I suppose. The way I do it allows also for free
copy-on-write semantics (with mprotect though, I don't set it writeable
by default) so that gdb can put breakpoints in it, etc...

 For the moment, I think the behaviour of x86_64 32bit-support with
 the freepgt patches will depend on personality - ADDR_LIMIT_32BIT
 should usually work fine (unless the app moves its stack elsewhere
 and munmaps the old one: that might well unmap its vdso too); but
 ADDR_LIMIT_3GB will be liable to leak tables (if get_user_pages or
 its /proc/pid/maps has been examined).  I don't know how common
 ADDR_LIMIT_3GB use is, but whatever: okay for testing, but not for
 including the patches in a release.
 
 Hugh
-- 
Benjamin Herrenschmidt [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Wed, 23 Mar 2005 13:10:42 +1100
Nick Piggin <[EMAIL PROTECTED]> wrote:

> The ugly thing you get with an inclusive ceiling is that your masking
> becomes more difficult I think.

Good point.

> I might try to attack this from another angle and see if I can come up
> with something.

Great, let me know if you want something tested out on sparc64.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Nick Piggin
David S. Miller wrote:
On Tue, 22 Mar 2005 17:10:13 -0800
Andrew Morton <[EMAIL PROTECTED]> wrote:

Hugh Dickins <[EMAIL PROTECTED]> wrote:
On Tue, 22 Mar 2005, Luck, Tony wrote:
> 
> But I'm still confused by all the math on addr/end at each
> level.

You think the rest of us are not ;-?
umm, given the difficulty which you guys are having with this, I get a bit
worried about clarity, simplicity and maintainability of the end result.
We're working on it, trust me :-)
I have a simplification in mind that should take care of the issue
that led us to these problems.  We should simply pass in "ceiling"
as "-1" instead of "0".  Every single test against ceiling is
really done against "ceiling - 1".
Therefore, passing ceiling in as "top - 1" and then adjusting the
tests will clean this up substantially and make is much simpler.

The ugly thing you get with an inclusive ceiling is that your masking
becomes more difficult I think.
I might try to attack this from another angle and see if I can come up
with something.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Wed, 23 Mar 2005 00:51:02 + (GMT)
Hugh Dickins <[EMAIL PROTECTED]> wrote:

> This actual example helped to focus my mind a lot, thank you.

No problem, I needed to work through specific examples
to see things clearly too.

> > and things seem to behave.  I'll try to analyze things
> > further and test this out on a real kernel, but all of
> > these adjustments at the top of free_pgd_range() really
> > start to look like pure spaghetti. :-)
> 
> Well, it's trying to decide in reasonably few steps that it's not
> worth wasting time going down to the deeper levels.  Lots of
> "return"s as it eliminates cases, yes.

Yes, I understand.

But let's recognize (as I mention in another email) that all of
the tests against ceiling are against "ceiling - 1".  If we pass
-1 instead of 0 (and "foo - 1" instead of "foo") as the ceiling
arg, then adjust the tests to be against plain "ceiling", so much
of the special casing disappears.

There are probably other simplifications.

This is kind of what I was hinting at when I said it looks like
spaghetti.  :-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Tue, 22 Mar 2005 17:10:13 -0800
Andrew Morton <[EMAIL PROTECTED]> wrote:

> Hugh Dickins <[EMAIL PROTECTED]> wrote:
> >
> > On Tue, 22 Mar 2005, Luck, Tony wrote:
> >  > 
> >  > But I'm still confused by all the math on addr/end at each
> >  > level.
> > 
> >  You think the rest of us are not ;-?
> 
> umm, given the difficulty which you guys are having with this, I get a bit
> worried about clarity, simplicity and maintainability of the end result.

We're working on it, trust me :-)

I have a simplification in mind that should take care of the issue
that led us to these problems.  We should simply pass in "ceiling"
as "-1" instead of "0".  Every single test against ceiling is
really done against "ceiling - 1".

Therefore, passing ceiling in as "top - 1" and then adjusting the
tests will clean this up substantially and make is much simpler.

I'm even sure that other similar refactoring is possible.

Hugh took the quantum leap we needed by implementing this at all,
some spaghetti code tests do not detract from his work conceptually.
That kind of stuff can be cleaned up.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Andrew Morton
Hugh Dickins <[EMAIL PROTECTED]> wrote:
>
> On Tue, 22 Mar 2005, Luck, Tony wrote:
>  > 
>  > But I'm still confused by all the math on addr/end at each
>  > level.
> 
>  You think the rest of us are not ;-?

umm, given the difficulty which you guys are having with this, I get a bit
worried about clarity, simplicity and maintainability of the end result.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Hugh Dickins
On Tue, 22 Mar 2005, Luck, Tony wrote:
> 
> But I'm still confused by all the math on addr/end at each
> level.

You think the rest of us are not ;-?

> Rounding up/down at each level should presumably be
> based on the size of objects at the next level.  So the pgd
> code should round using PUD_MASK, pud should use PMD_MASK etc.
> Perhaps I missed some updates, but the version of the patch
> that I have (and the simulator) is using PMD_MASK in the
> pgd_free_range() function ... which is surely wrong.

It's confusing but not wrong (in principle).  It's trying to decide
immediately on entry whether it will be worth going down to the lower
levels: if even the lowest level will have no work to do, no point in
proceeding.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Hugh Dickins
On Tue, 22 Mar 2005, David S. Miller wrote:
> On Tue, 22 Mar 2005 21:51:39 + (GMT)
> Hugh Dickins <[EMAIL PROTECTED]> wrote:
> 
> > I still can't see what's wrong with the code that's already
> > there.  My brain is seizing up, I'm taking a break.
> 
> Ok, meanwhile I'll do a brain dump of what I think this
> code should be doing.
> 
> Let's take an example free_pgd_range() call.  Say the
> address parameters are:
> 
> addr  0x1
> end   0xa4000
> floor 0x0
> ceiling   0xb2000

This actual example helped to focus my mind a lot, thank you.

> (This example comes from my exit_mmap() VMA dump earlier
>  in this thread.  If you disable the VMA skipping optimization
>  the first call to free_pgd_range() has these parameters.)
> 
> What ought this free_pgd_range() call do?  This range of
> addresses, from floor to ceiling, is smaller than a PMD_SIZE
> (which on sparc64 is 1 << 23).  Therefore it should clear
> no PGD or PUD entries.

Yup, it ought to decide at the beginning of free_pgd_range
that it simply has no work to do.

> Yet, it does clear them, specifically:
> 
> free_pgd_range():
>   1) mask addr (0x1) to PMD_MASK, addr is now 0
>   2) addr < floor (0x0) test does not pass
>   3) mask ceiling (0xb2000) to PMD_MASK, ceiling is now 0 too
>   4) end - 1 > ceiling - 1 test does not pass
>   5) addr > end - 1 test does not pass either

And now we've gone wrong, yes.

> The source of the problems seems to be how ceiling began
> at the top of the call chain as 0xb2000, but when we
> masked it with PMD_MASK that set it to zero, which means
> "top of address space" in these functions.  That's not
> what we want.
> 
> I added a quick hack to the simulator I posted, where
> we mask ceiling in free_pgd_range(), I do it like this:
> 
>   if (ceiling) {
>   ceiling &= PMD_MASK;
>   if (!ceiling)
>   return;
>   }

At first that just looked like a hack to me.  But on reflection,
no, you're doing exactly what I had to do with addr above: in the
case where we arrive at 0 from non-0 value, have to get out quick
to avoid confusion with the "other" 0.  These wrap issues are hard.

And in other mail I see you found more such checks were needed.
I believe you've got it, thank you so much!

Though frankly, by now, I'm sure of nothing:
will review in the morning.

> and things seem to behave.  I'll try to analyze things
> further and test this out on a real kernel, but all of
> these adjustments at the top of free_pgd_range() really
> start to look like pure spaghetti. :-)

Well, it's trying to decide in reasonably few steps that it's not
worth wasting time going down to the deeper levels.  Lots of
"return"s as it eliminates cases, yes.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Hugh Dickins
On Tue, 22 Mar 2005, Luck, Tony wrote:
> 
> Alternatively you could modify the use of floor/ceiling as they
> are passed down from the top level to indicate the progressively
> greater address ranges that have been dealt with ... but I'm not
> completely convinced that gives you enough information.  You would
> need to do careful extension of the ceiling at each level to make
> sure that you reach out to the end of of each table at each level
> to account for gaps between vmas (which would result in no future
> calldown hitting this part of the table).

That pretty much describes how it does work (when it works!)

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Wed, 23 Mar 2005 11:19:38 +1100
Nick Piggin <[EMAIL PROTECTED]> wrote:

> > dramatically, shell performance is way down on sparc64.
> > I'll post before and after numbers in a bit.  Note, this is
> > just with Hugh's base patch plus bug fixes.
> > 
> 
> That's interesting. The only "extra" stuff Hugh's should be
> doing is the vma traversal.

Like I said in another posting, ignore this anomaly.
It's some difference in the way the shell runs
when done from "init=/bin/sh" single user vs. full
multi-user.  Full sparc64 before/after in that posting,
go check it out it's nice :-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Nick Piggin
David S. Miller wrote:
On Wed, 23 Mar 2005 10:32:10 +1100
Nick Piggin <[EMAIL PROTECTED]> wrote:

I think David's on the right track - I think there's something a
bit wrong at the top. In my reply to Andrew in this thread I
posted a patch which may at least get things working...

We have to do the "if (ceiling)" check in every spot where
we mask it in some way, and if it falls to zero from non-zero
due to the masking, we skip.
That gives me a mostly working kernel.
I see. Tricky.
Bad news is that while lat_proc's fork and exec tests improve
dramatically, shell performance is way down on sparc64.
I'll post before and after numbers in a bit.  Note, this is
just with Hugh's base patch plus bug fixes.
That's interesting. The only "extra" stuff Hugh's should be
doing is the vma traversal.
The single walk patch seems to fit in quite well with Hugh's
updated patchset. I haven't quite worked out how to best do
hugepages, but it is easily possible. But I won't dust that
off again until Hugh's is nicely tested and working.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller

Ok, this patch, on top of Hugh's original freepgt patch, gets
me a working system.  It includes Hugh's bug fix, plus the
ceiling masking roll-over fix of mine.

It should get ppc working too, I bet.

--- mm/memory.c.hugh2005-03-22 16:01:07.0 -0800
+++ mm/memory.c 2005-03-22 16:00:08.0 -0800
@@ -127,11 +127,6 @@
unsigned long next;
unsigned long start;
 
-   if (end - 1 > ceiling - 1)
-   end -= PMD_SIZE;
-   if (addr > end - 1)
-   return;
-
start = addr;
pmd = pmd_offset(pud, addr);
do {
@@ -144,7 +139,11 @@
start &= PUD_MASK;
if (start < floor)
return;
-   ceiling &= PUD_MASK;
+   if (ceiling) {
+   ceiling &= PUD_MASK;
+   if (!ceiling)
+   return;
+   }
if (end - 1 > ceiling - 1)
return;
 
@@ -173,7 +172,11 @@
start &= PGDIR_MASK;
if (start < floor)
return;
-   ceiling &= PGDIR_MASK;
+   if (ceiling) {
+   ceiling &= PGDIR_MASK;
+   if (!ceiling)
+   return;
+   }
if (end - 1 > ceiling - 1)
return;
 
@@ -201,8 +204,14 @@
if (!addr)
return;
}
-   ceiling &= PMD_MASK;
-   if (addr > ceiling - 1)
+   if (ceiling) {
+   ceiling &= PMD_MASK;
+   if (!ceiling)
+   return;
+   }
+   if (end - 1 > ceiling - 1)
+   end -= PMD_SIZE;
+   if (addr > end - 1)
return;
 
start = addr;
@@ -214,7 +223,7 @@
free_pud_range(tlb, pgd, addr, next, floor, ceiling);
} while (pgd++, addr = next, addr != end);
 
-   if (!tlb_is_full_mm(tlb) && start < end)
+   if (!tlb_is_full_mm(tlb))
flush_tlb_pgtables(tlb->mm, start, end);
 }
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller

Ok, here are (finally, I've been debugging this so much purely
to see these things) some lmbench numbers with Hugh's base
patch on sparc64.

Ignore my previous comments about shell performance getting
worse, it's some difference that makes things run more slowly
in single user mode compared to a fully brought up system.

First, for 32-bit tasks.

BEFORE
Process fork+exit: 171.4483 microseconds
Process fork+exit: 171.9688 microseconds
Process fork+exit: 169.2727 microseconds
Process fork+exit: 169.0333 microseconds
Process fork+exit: 165.8065 microseconds
Process fork+execve: 555.7000 microseconds
Process fork+execve: 556.6000 microseconds
Process fork+execve: 552.6000 microseconds
Process fork+execve: 557.1000 microseconds
Process fork+execve: 552. microseconds
Process fork+/bin/sh -c: 2207. microseconds
Process fork+/bin/sh -c: 2183. microseconds
Process fork+/bin/sh -c: 2179.6667 microseconds
Process fork+/bin/sh -c: 2190. microseconds
Process fork+/bin/sh -c: 2197.6667 microseconds

AFTER
Process fork+exit: 142.9487 microseconds
Process fork+exit: 147.8649 microseconds
Process fork+exit: 139.0250 microseconds
Process fork+exit: 138.9250 microseconds
Process fork+exit: 136.9268 microseconds
Process fork+execve: 478. microseconds
Process fork+execve: 479.1667 microseconds
Process fork+execve: 479.9091 microseconds
Process fork+execve: 480.1667 microseconds
Process fork+execve: 479.9091 microseconds
Process fork+/bin/sh -c: 2026. microseconds
Process fork+/bin/sh -c: 2029.6667 microseconds
Process fork+/bin/sh -c: 2044.6667 microseconds
Process fork+/bin/sh -c: 2037.6667 microseconds
Process fork+/bin/sh -c: 2028.6667 microseconds

Pretty good, now for 64-bit processes.

BEFORE
Process fork+exit: 226.5200 microseconds
Process fork+exit: 230.0417 microseconds
Process fork+exit: 223.8800 microseconds
Process fork+exit: 226.4091 microseconds
Process fork+exit: 219.3043 microseconds
Process fork+execve: 799.8571 microseconds
Process fork+execve: 806.1429 microseconds
Process fork+execve: 799.5714 microseconds
Process fork+execve: 800.8571 microseconds
Process fork+execve: 788.7143 microseconds
Process fork+/bin/sh -c: 2655. microseconds
Process fork+/bin/sh -c: 2668.5000 microseconds
Process fork+/bin/sh -c: 2649. microseconds
Process fork+/bin/sh -c: 2662.5000 microseconds
Process fork+/bin/sh -c: 2642. microseconds

AFTER
Process fork+exit: 165.1212 microseconds
Process fork+exit: 159.4571 microseconds
Process fork+exit: 160.3714 microseconds
Process fork+exit: 158.9091 microseconds
Process fork+exit: 157.2188 microseconds
Process fork+execve: 536.4545 microseconds
Process fork+execve: 542.0909 microseconds
Process fork+execve: 536.3000 microseconds
Process fork+execve: 540.6364 microseconds
Process fork+execve: 537.1818 microseconds
Process fork+/bin/sh -c: 2275. microseconds
Process fork+/bin/sh -c: 2272. microseconds
Process fork+/bin/sh -c: 2275.6667 microseconds
Process fork+/bin/sh -c: 2270. microseconds
Process fork+/bin/sh -c: 2284. microseconds

Quite nice.  It makes the 64-bit numbers on par with
the 32-bit numbers.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Tue, 22 Mar 2005 15:53:08 -0800
"Luck, Tony" <[EMAIL PROTECTED]> wrote:

> But I'm still confused by all the math on addr/end at each
> level.  Rounding up/down at each level should presumably be
> based on the size of objects at the next level.  So the pgd
> code should round using PUD_MASK, pud should use PMD_MASK etc.
> Perhaps I missed some updates, but the version of the patch
> that I have (and the simulator) is using PMD_MASK in the
> pgd_free_range() function ... which is surely wrong.

PMD_MASK decides the smallest page table chunk, so we mask
it at the top level.

Look at the next level down in the call chain, the masking
maskes more sense there.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Luck, Tony
>How it works is that it knows the extent in each direction
>where mappings do not exist.
>
>Once we know we have a clear span up to the next PMD_SIZE
>modulo (and PUD_SIZE and so on and so forth) we know we
>can liberate the page table chunks covered by such ranges.

Ok ... I see that now (I was mostly looking at free_pgtables()
and missed the conditional in the arglist that passes the
start of the next vma into free_pgd_range() as the ceiling
until we run out of vmas and pass in "ceiling".

But I'm still confused by all the math on addr/end at each
level.  Rounding up/down at each level should presumably be
based on the size of objects at the next level.  So the pgd
code should round using PUD_MASK, pud should use PMD_MASK etc.
Perhaps I missed some updates, but the version of the patch
that I have (and the simulator) is using PMD_MASK in the
pgd_free_range() function ... which is surely wrong.

-Tony
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Wed, 23 Mar 2005 10:32:10 +1100
Nick Piggin <[EMAIL PROTECTED]> wrote:

> I think David's on the right track - I think there's something a
> bit wrong at the top. In my reply to Andrew in this thread I
> posted a patch which may at least get things working...

We have to do the "if (ceiling)" check in every spot where
we mask it in some way, and if it falls to zero from non-zero
due to the masking, we skip.

That gives me a mostly working kernel.

Bad news is that while lat_proc's fork and exec tests improve
dramatically, shell performance is way down on sparc64.
I'll post before and after numbers in a bit.  Note, this is
just with Hugh's base patch plus bug fixes.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Benjamin Herrenschmidt
On Tue, 2005-03-22 at 12:21 -0800, David S. Miller wrote:
> On Tue, 22 Mar 2005 19:36:46 + (GMT)
> Hugh Dickins <[EMAIL PROTECTED]> wrote:
> 
> > I notice that although both i386 and sparc64 use pgtable-nopud.h, the
> > i386 pud_clear does nothing at all and the sparc64 pud_clear resets to 0.
> 
> Aha!  And ppc does as well via asm-generic/4level-fixup.h which is
> probably why I did it this way as well when I converted sparc64
> over to asm-generic/pgtable-nopud.h
> 
> Hmmm, let me play around with this.

I converted ppc64 to no-pud.h but it seems to introduce some breakage...
X kills the box when launched on the G5, haven't had time to investigate
yet, it might be some bug of mine. Here is my current patch:


Index: linux-work/include/asm-ppc64/pgalloc.h
===
--- linux-work.orig/include/asm-ppc64/pgalloc.h 2005-03-07 14:00:11.0 
+1100
+++ linux-work/include/asm-ppc64/pgalloc.h  2005-03-17 17:00:50.0 
+1100
@@ -30,7 +30,7 @@
kmem_cache_free(zero_cache, pgd);
 }
 
-#define pgd_populate(MM, PGD, PMD) pgd_set(PGD, PMD)
+#define pud_populate(MM, PUD, PMD) pud_set(PUD, PMD)
 
 static inline pmd_t *
 pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
Index: linux-work/include/asm-ppc64/pgtable.h
===
--- linux-work.orig/include/asm-ppc64/pgtable.h 2005-03-07 14:00:11.0 
+1100
+++ linux-work/include/asm-ppc64/pgtable.h  2005-03-17 17:00:50.0 
+1100
@@ -1,8 +1,6 @@
 #ifndef _PPC64_PGTABLE_H
 #define _PPC64_PGTABLE_H
 
-#include 
-
 /*
  * This file contains the functions and defines necessary to modify and use
  * the ppc64 hashed page table.
@@ -17,6 +15,8 @@
 #include 
 #endif /* __ASSEMBLY__ */
 
+#include 
+
 /* PMD_SHIFT determines what a second-level page table entry can map */
 #define PMD_SHIFT  (PAGE_SHIFT + PAGE_SHIFT - 3)
 #define PMD_SIZE   (1UL << PMD_SHIFT)
@@ -216,12 +216,12 @@
 #define pmd_page_kernel(pmd)   \
(__bpn_to_ba(pmd_val(pmd) >> PMD_TO_PTEPAGE_SHIFT))
 #define pmd_page(pmd)  virt_to_page(pmd_page_kernel(pmd))
-#define pgd_set(pgdp, pmdp)(pgd_val(*(pgdp)) = (__ba_to_bpn(pmdp)))
-#define pgd_none(pgd)  (!pgd_val(pgd))
-#define pgd_bad(pgd)   ((pgd_val(pgd)) == 0)
-#define pgd_present(pgd)   (pgd_val(pgd) != 0UL)
-#define pgd_clear(pgdp)(pgd_val(*(pgdp)) = 0UL)
-#define pgd_page(pgd)  (__bpn_to_ba(pgd_val(pgd))) 
+#define pud_set(pgdp, pmdp)(pud_val(*(pgdp)) = (__ba_to_bpn(pmdp)))
+#define pud_none(pgd)  (!pud_val(pgd))
+#define pud_bad(pgd)   ((pud_val(pgd)) == 0)
+#define pud_present(pgd)   (pud_val(pgd) != 0UL)
+#define pud_clear(pgdp)(pud_val(*(pgdp)) = 0UL)
+#define pud_page(pgd)  (__bpn_to_ba(pud_val(pgd))) 
 
 /* 
  * Find an entry in a page-table-directory.  We combine the address region 
@@ -233,8 +233,8 @@
 #define pgd_offset(mm, address) ((mm)->pgd + pgd_index(address))
 
 /* Find an entry in the second-level page table.. */
-#define pmd_offset(dir,addr) \
-  ((pmd_t *) pgd_page(*(dir)) + (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1)))
+#define pmd_offset(pudp,addr) \
+  ((pmd_t *) pud_page(*(pudp)) + (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1)))
 
 /* Find an entry in the third-level page table.. */
 #define pte_offset_kernel(dir,addr) \
@@ -552,20 +552,23 @@
 static inline pte_t *find_linux_pte(pgd_t *pgdir, unsigned long ea)
 {
pgd_t *pg;
+   pud_t *pu;
pmd_t *pm;
pte_t *pt = NULL;
pte_t pte;
 
pg = pgdir + pgd_index(ea);
if (!pgd_none(*pg)) {
-
-   pm = pmd_offset(pg, ea);
-   if (pmd_present(*pm)) { 
-   pt = pte_offset_kernel(pm, ea);
-   pte = *pt;
-   if (!pte_present(pte))
-   pt = NULL;
-   }
+   pu = pud_offset(pg, ea);
+   if (!pud_none(*pu)) {   
+   pm = pmd_offset(pu, ea);
+   if (pmd_present(*pm)) { 
+   pt = pte_offset_kernel(pm, ea);
+   pte = *pt;
+   if (!pte_present(pte))
+   pt = NULL;
+   }
+   }   
}
 
return pt;
Index: linux-work/arch/ppc64/mm/init.c
===
--- linux-work.orig/arch/ppc64/mm/init.c2005-03-15 11:57:29.0 
+1100
+++ linux-work/arch/ppc64/mm/init.c 2005-03-17 17:20:14.0 +1100
@@ -136,14 +136,78 @@
 
 #else
 
+static void unmap_im_area_pte(pmd_t *pmd, unsigned long addr,
+ unsigned long end)
+{
+   pte_t *pte;
+
+   pte = pte_offset_kernel(pmd, addr);
+   do {
+  

Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Nick Piggin
Hugh Dickins wrote:
On Tue, 22 Mar 2005, David S. Miller wrote:
On Tue, 22 Mar 2005 19:36:46 + (GMT)
Hugh Dickins <[EMAIL PROTECTED]> wrote:

I notice that although both i386 and sparc64 use pgtable-nopud.h, the
i386 pud_clear does nothing at all and the sparc64 pud_clear resets to 0.
This was a dead end.  I386 doesn't do anything with pud_clear() in
order to work around a chip erratum.
IA64 does clear in pud_clear() just like sparc64.

My mind kept flipping back and forth on whether it was pud_clear().
I agree, I can't see that it's the issue now.
It shouldn't be.
In the case that pud is folded, free_pud_range will only call into
free_pmd_range once, and that function will loop over the required
range of the pud (ie. the pmd). If it then also falls through to
pud_clear in that function, it will also fall through to pgd_clear
in free_pud_range. So it doesn't _really_ matter which one does the
actual clearing in that case.
I think David's on the right track - I think there's something a
bit wrong at the top. In my reply to Andrew in this thread I
posted a patch which may at least get things working...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Tue, 22 Mar 2005 14:40:55 -0800
"Luck, Tony" <[EMAIL PROTECTED]> wrote:

> Then I don't see how we decide when to clear a pointer at each
> level.  Are there counters of how many entries are active in each
> table at all levels (pgd/pud/pmd/pte)?

No, there are no counters.

How it works is that it knows the extent in each direction
where mappings do not exist.

Once we know we have a clear span up to the next PMD_SIZE
modulo (and PUD_SIZE and so on and so forth) we know we
can liberate the page table chunks covered by such ranges.

Say we are unmapping a page at some address.  The next VMA
in the address space says where the next potentially valid
mapping resides.  The previous VMA says similarly.  If this
is the first or last VMA, we use the beginning or end of
the virtual address space as our value.

Play around with my little simulator I posted, you'll see how
it works ;-)  Actually, this is the second such simulator you
have seen Tony :-)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Benjamin Herrenschmidt
On Tue, 2005-03-22 at 16:37 +, Hugh Dickins wrote:
> On Tue, 22 Mar 2005, Andrew Morton wrote:
> > 
> > With these six patches the ppc64 is hitting the BUG in exit_mmap():
> > 
> > BUG_ON(mm->nr_ptes);/* This is just debugging */
> > 
> > fairly early in boot.
> 
> So ppc64 is in the same boat as sparc64 (yet ia64 okay so far).
> 
> Sorry, I'm still clueless.
> 
> I cannot see those arches doing pte_allocs outside their vmas,
> that of course could cause it.  And nr_ptes is initialized to 0
> once by memset and again by assignment, so it should be starting
> out even zeroer than most fields.

We do funny things in arch/ppc64/mm/init.c in the ioremap_mm, where we
don't use VMAs but our own mecanism (yah, ugly, but that's some legacy
we have from the original port, though I do intend to change that at one
point).

> I should probably be paying more attention to the repellent
> notion that my code is broken.
> 
> If you and David could try the lame patch below,
> it'll at least give us a slight clue of where to be looking -
> every mm exiting with nr_ptes 1 means something different from
> every mm exiting with nr_ptes -1 means something different from
> occasional mms exiting with nr_ptes something positive.
> 
> I'm not sure whether the patch would ever get to show a more
> interesting proc name than "?".
> 
> And does memory leak away into lost pagetables if you continue
> running, or does it actually carry on running fine, and the
> problem appear to be with the BUG_ON itself?
> 
> Thanks,
> Hugh
> 
> --- freepgt6/mm/mmap.c2005-03-22 04:28:40.0 +
> +++ testing/mm/mmap.c 2005-03-22 15:45:00.0 +
> @@ -1896,6 +1896,7 @@ EXPORT_SYMBOL(do_brk);
>  /* Release all mmaps. */
>  void exit_mmap(struct mm_struct *mm)
>  {
> + static unsigned long good_mms, bad_mms;
>   struct mmu_gather *tlb;
>   struct vm_area_struct *vma = mm->mmap;
>   unsigned long nr_accounted = 0;
> @@ -1931,7 +1932,14 @@ void exit_mmap(struct mm_struct *mm)
>   vma = next;
>   }
>  
> - BUG_ON(mm->nr_ptes);/* This is just debugging */
> + if (mm->nr_ptes && bad_mms < 250) {
> + printk(KERN_ERR "exit_mmap: %s nr_ptes %ld good_mms %lu\n",
> + current->mm == mm? current->comm: "?",
> + (long)mm->nr_ptes, good_mms);
> + good_mms = 0;
> + bad_mms++;
> + } else
> + good_mms++;
>  }
>  
>  /* Insert vm structure into process list sorted by address
-- 
Benjamin Herrenschmidt <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Tue, 22 Mar 2005 21:51:39 + (GMT)
Hugh Dickins <[EMAIL PROTECTED]> wrote:

> I still can't see what's wrong with the code that's already
> there.  My brain is seizing up, I'm taking a break.

Ok, meanwhile I'll do a brain dump of what I think this
code should be doing.

Let's take an example free_pgd_range() call.  Say the
address parameters are:

addr0x1
end 0xa4000
floor   0x0
ceiling 0xb2000

(This example comes from my exit_mmap() VMA dump earlier
 in this thread.  If you disable the VMA skipping optimization
 the first call to free_pgd_range() has these parameters.)

What ought this free_pgd_range() call do?  This range of
addresses, from floor to ceiling, is smaller than a PMD_SIZE
(which on sparc64 is 1 << 23).  Therefore it should clear
no PGD or PUD entries.

Yet, it does clear them, specifically:

free_pgd_range():
1) mask addr (0x1) to PMD_MASK, addr is now 0
2) addr < floor (0x0) test does not pass
3) mask ceiling (0xb2000) to PMD_MASK, ceiling is now 0 too
4) end - 1 > ceiling - 1 test does not pass
5) addr > end - 1 test does not pass either
6) We now loop one PGDIR_SIZE at a time from
   addr (0x0) to end (0xa4000), calling
   down into...
free_pud_range():
1) addr=0, end=0xa4000, floor=0, ceiling=0
2) We loop one PUD_SIZE at a time from
   addr (0x0) to end (0xa4000), calling
   down into...
free_pmd_range():
1) addr=0, end=0xa4000, floor=0, ceiling=0
2) We loop one PMD_SIZE at a time from
   addr (0x0) to end (0xa4000), calling
   down into...
free_pte_range():

And later when we finish the loops in free_pmd_range()
and free_pud_range() we do pud_clear() and pgd_clear()
respectively, both wrong.

The source of the problems seems to be how ceiling began
at the top of the call chain as 0xb2000, but when we
masked it with PMD_MASK that set it to zero, which means
"top of address space" in these functions.  That's not
what we want.

I added a quick hack to the simulator I posted, where
we mask ceiling in free_pgd_range(), I do it like this:

if (ceiling) {
ceiling &= PMD_MASK;
if (!ceiling)
return;
}

and things seem to behave.  I'll try to analyze things
further and test this out on a real kernel, but all of
these adjustments at the top of free_pgd_range() really
start to look like pure spaghetti. :-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Luck, Tony
>> be changed to use pgd_addr_end() to gather up all the vma that
>> are mapped by a single pgd instead of just spanning out the next
>> PMD_SIZE?
>
>Oh, I don't think so.  I suppose it could be done at this level,
>but then the lower levels would go back to searching through lots
>of unnecessary cachelines to find the significant entries, and
>we might as well throw out the whole set of patches (which will
>soon happen anyway if we can't find why they're not working!).

Then I don't see how we decide when to clear a pointer at each
level.  Are there counters of how many entries are active in each
table at all levels (pgd/pud/pmd/pte)?

Consider the case where two or more vmas are mapped at addresses that
share a pgd entry, but are far enough apart that you don't want
to coalesce them.  First call down through that pgd entry must
leave the pointer to the pud (or pmd/pte for systems with less
levels) so that the next call can still get to the pud/pmd/pte
to clear out entries for the other vma.  But without counters
(at all levels) you don't know when you are all done with a table,
or whether you need to leave it for the next call.

If you want to pursue this, then there are probably some fields
in the page_t for the page that is being used as a pgd/pud/pmd/pte
that are available (do all architectures allocate whole pages for
all levels of the page tree?)

Alternatively you could modify the use of floor/ceiling as they
are passed down from the top level to indicate the progressively
greater address ranges that have been dealt with ... but I'm not
completely convinced that gives you enough information.  You would
need to do careful extension of the ceiling at each level to make
sure that you reach out to the end of of each table at each level
to account for gaps between vmas (which would result in no future
calldown hitting this part of the table).

-Tony
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Hugh Dickins
On Tue, 22 Mar 2005, David S. Miller wrote:
> On Tue, 22 Mar 2005 19:36:46 + (GMT)
> Hugh Dickins <[EMAIL PROTECTED]> wrote:
> 
> > I notice that although both i386 and sparc64 use pgtable-nopud.h, the
> > i386 pud_clear does nothing at all and the sparc64 pud_clear resets to 0.
> 
> This was a dead end.  I386 doesn't do anything with pud_clear() in
> order to work around a chip erratum.
> 
> IA64 does clear in pud_clear() just like sparc64.

My mind kept flipping back and forth on whether it was pud_clear().
I agree, I can't see that it's the issue now.

> I think it's the floor/ceiling stuff.
> 
> At that pud_clear(), we do it if floor-->ceiling (after masking)
> covers the whole PUD.  Not whether start/end do, which is what
> the code sort of does right now.

Not an explanation I understood ;)

> "start" and "end" say which specific entries we might be purging.

Yes.

> "floor" and "ceiling" say that once that purging is done, the
> extent of the potential address space freed.

Well, they specify the limits beyond which we cannot free,
because there's other stuff still making use of addresses beyond.

> I cooked up a quick patch that changes the logic to:
> 
>   floor &= PUD_MASK;
>   ceiling &= PUD_MASK;
>   if (floor - 1 >= ceiling - 1)
>   return;

That can't be right.  In exit_mmap, for example, floor will be 0
throughout, and the condition will be true for all values of ceiling.

>   pmd = pmd_offset(pud, start);
>   pud_clear(pud);
>   pmd_free_tlb(tlb, pmd);
> 
> and things start to basically work.

Because none of your higher level tables are being freed at all:
eventually you'll run out of memory.

I still can't see what's wrong with the code that's already
there.  My brain is seizing up, I'm taking a break.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller

Hugh, I got tired of rebooting just to get address walking
traces :-)  So I wrote a little simulator.

Basically, it's free_pgtables() with the page table pointer
stuff ripped out.

You run it like this:

./simulator vma_file

Where vma_file is a text file composed of lines of the form:

START END

in hex.  The ordered list of VMA's is built from this text file.
I enclose a sample file I took from  that trace I built earlier
today.

If you want to get fancy, it's probably very easy to make it so
that parse_file() can read in live /proc/${pid}/maps files too :)

It is easy to stick in support for platforms other than SPARC64,
the machinery is all there, just pop the definitions out from
the platforms include/asm-*/*.h and relevant include/asm-generic/*.h
header files.

Enjoy. :-)
#include 
#include 
#include 
#include 
#include 
#include 

#define ARCH_SPARC64	1
#define ARCH_IA64	2
#define ARCH_PPC64	3

#define ARCH	ARCH_SPARC64

#if (ARCH == ARCH_SPARC64)
#define PAGE_SHIFT	13
#define PMD_SHIFT	(PAGE_SHIFT + (PAGE_SHIFT-3))
#define PMD_SIZE	(1UL << PMD_SHIFT)
#define PMD_MASK	(~(PMD_SIZE-1))
#define PMD_BITS	11
#define PGDIR_SHIFT	(PAGE_SHIFT + (PAGE_SHIFT-3) + PMD_BITS)
#define PGDIR_SIZE	(1UL << PGDIR_SHIFT)
#define PGDIR_MASK	(~(PGDIR_SIZE-1))
#define PUD_SHIFT	PGDIR_SHIFT
#define PTRS_PER_PUD	1
#define PUD_SIZE  	(1UL << PUD_SHIFT)
#define PUD_MASK  	(~(PUD_SIZE-1))
#define pgd_addr_end(addr, end)		\
({	unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;	\
	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
})
#define pud_addr_end(addr, end)			(end)
#define pmd_addr_end(addr, end)		\
({	unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;	\
	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
})
#else
#error please add definitions for your architecture
#endif

/*
 * Note: this doesn't free the actual pages themselves. That
 * has been handled earlier when unmapping all the memory regions.
 */
static inline void free_pte_range(unsigned long addr)
{
}

static inline void free_pmd_range(unsigned long addr, unsigned long end,
  unsigned long floor, unsigned long ceiling)
{
	unsigned long next;
	unsigned long start;

	start = addr;
	do {
		next = pmd_addr_end(addr, end);
#if 1
		printf("free_pgtables: free_pte_range() addr(%lx) next(%lx) end(%lx)\n",
		   addr, next, end);
#endif
		free_pte_range(addr);
	} while (addr = next, addr != end);

	start &= PUD_MASK;
	ceiling &= PUD_MASK;
#if 1
	printf("free_pmd_range: start(%lx) ceiling(%lx) ", start, ceiling);
#endif
	if ((start < floor) ||
	(end - 1 > ceiling - 1)) {
#if 1
		printf("SKIP pud_clear()\n");
#endif
		return;
	}
#if 1
	printf("DO pud_clear()\n");
#endif
}

static inline void free_pud_range(unsigned long addr, unsigned long end,
  unsigned long floor, unsigned long ceiling)
{
	unsigned long next;
	unsigned long start;

	start = addr;
	do {
		next = pud_addr_end(addr, end);
#if 1
		printf("free_pud_range: free_pmd_range(%lx,%lx,%lx,%lx)\n",
		   addr, next, floor, ceiling);
#endif
		free_pmd_range(addr, next, floor, ceiling);
	} while (addr = next, addr != end);

	start &= PGDIR_MASK;
	ceiling &= PGDIR_MASK;
#if 1
	printf("free_pud_range: start(%lx) ceiling(%lx) ",
	   start, ceiling);
#endif
	if ((start < floor) ||
	(end - 1 > ceiling - 1)) {
#if 1
		printf("SKIP pgd_clear()\n");
#endif
		return;
	}
#if 1
	printf("DO pgd_clear()\n");
#endif
}

/*
 * This function frees user-level page tables of a process.
 * Unlike other pagetable walks, some memory layouts might give end 0.
 * Must be called with pagetable lock held.
 */
static inline void free_pgd_range(unsigned long addr, unsigned long end,
  unsigned long floor, unsigned long ceiling)
{
	unsigned long next;
	unsigned long start;

	addr &= PMD_MASK;
	if (addr < floor) {
		addr += PMD_SIZE;
		if (!addr)
			return;
	}
	ceiling &= PMD_MASK;
	if (end - 1 > ceiling - 1)
		end -= PMD_SIZE;
	if (addr > end - 1)
		return;

	start = addr;
	do {
		next = pgd_addr_end(addr, end);
#if 1
		printf("free_pgd_range: free_pud_range(%lx,%lx,%lx,%lx)\n",
		   addr, next, floor, ceiling);
#endif
		free_pud_range(addr, next, floor, ceiling);
	} while (addr = next, addr != end);

	if (1 /*!tlb_is_full_mm(tlb)*/) {
#if 1
		printf("free_pgd_range: Doing flush_tlb_pgtables(%lx,%lx)\n",
		   start, end);
#endif
		/*flush_tlb_pgtables(tlb->mm, start, end);*/
	}
}

struct vm_area_struct {
	struct vm_area_struct *vm_next;
	unsigned long vm_start, vm_end;
};

void free_pgtables(struct vm_area_struct *vma,
		   unsigned long floor, unsigned long ceiling)
{
#if 1
	struct vm_area_struct *tmp = vma;

	printf("VMA DUMP: ");
	while (tmp) {
		printf("[%lx:%lx] ", tmp->vm_start, tmp->vm_end);
		tmp = tmp->vm_next;
	}
	printf("\n");
#endif
	while (vma) {
		struct vm_area_struct *next = vma->vm_next;
		unsigned long addr = vma->vm_start;

		/* Optimization: gather nearby vmas into a single call down */
		while (next && next->vm_start <= vma->vm_end + 2*PMD_SIZE) {
			

Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Tue, 22 Mar 2005 19:36:46 + (GMT)
Hugh Dickins <[EMAIL PROTECTED]> wrote:

> I notice that although both i386 and sparc64 use pgtable-nopud.h, the
> i386 pud_clear does nothing at all and the sparc64 pud_clear resets to 0.

Aha!  And ppc does as well via asm-generic/4level-fixup.h which is
probably why I did it this way as well when I converted sparc64
over to asm-generic/pgtable-nopud.h

Hmmm, let me play around with this.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Tue, 22 Mar 2005 19:36:46 + (GMT)
Hugh Dickins <[EMAIL PROTECTED]> wrote:

> I notice that although both i386 and sparc64 use pgtable-nopud.h, the
> i386 pud_clear does nothing at all and the sparc64 pud_clear resets to 0.

This was a dead end.  I386 doesn't do anything with pud_clear() in
order to work around a chip erratum.

IA64 does clear in pud_clear() just like sparc64.

I think it's the floor/ceiling stuff.

At that pud_clear(), we do it if floor-->ceiling (after masking)
covers the whole PUD.  Not whether start/end do, which is what
the code sort of does right now.

"start" and "end" say which specific entries we might be purging.
"floor" and "ceiling" say that once that purging is done, the
extent of the potential address space freed.

I cooked up a quick patch that changes the logic to:

floor &= PUD_MASK;
ceiling &= PUD_MASK;
if (floor - 1 >= ceiling - 1)
return;

pmd = pmd_offset(pud, start);
pud_clear(pud);
pmd_free_tlb(tlb, pmd);

and things start to basically work.

When X started up my machine rebooted, but this was with
all the tracing printk()'s enabled so I'm skeptical as to
the reason.  I'll back out the debugging and play with this
some more.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Hugh Dickins
On Tue, 22 Mar 2005, David S. Miller wrote:
> On Tue, 22 Mar 2005 11:21:25 -0800
> "David S. Miller" <[EMAIL PROTECTED]> wrote:
> 
> > I'm trying to analyze my traces some more.
> 
> I think I see what's going wrong.  On the first
> address space traversal (free_pgd_range()), we
> clear out the pgd, even though there are still
> more PMD's to process in that PGD.
> 
> So the future loops never do anything since the
> PGD is cleared out already.

Yes, that's the conclusion I was coming to from your excellent traces.

I notice that although both i386 and sparc64 use pgtable-nopud.h, the
i386 pud_clear does nothing at all and the sparc64 pud_clear resets to 0.

If that really is the problem, well, I get as easily mixed between
the levels as the next man, and haven't a clue which is the right
way to do it - beyond i386 not seeing these nr_pte bugs.

But if this is the issue, I don't see how it's new to my freepgt
patches - beyond the fact that they add this BUG_ON consistency check.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Tue, 22 Mar 2005 11:21:25 -0800
"David S. Miller" <[EMAIL PROTECTED]> wrote:

> I'm trying to analyze my traces some more.

I think I see what's going wrong.  On the first
address space traversal (free_pgd_range()), we
clear out the pgd, even though there are still
more PMD's to process in that PGD.

So the future loops never do anything since the
PGD is cleared out already.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Tue, 22 Mar 2005 11:01:44 -0800
"David S. Miller" <[EMAIL PROTECTED]> wrote:

> Hmmm...

Thinking some more, one thing that is unique in the PPC64
and SPARC64 cases is that we are executing primarily
32-bit tasks and in such cases one PGD maps the entire
address space.

I wonder if the free_pgtables() loops simply aren't
coping well with something like that.

I'm trying to analyze my traces some more.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller

Ok, here is a full dump of a free_pgtables() run that
fails to clear out all the PMD's.

It gets called with this VMA list (each entry is a
vm_start/vm_end tuple)

[0x0001:0x000a4000]
[0x000b2000:0x000b8000]
[0x000b8000:0x000de000]
[0x7000:0x7001a000]
[0x70028000:0x7002a000]
[0x7002c000:0x7006a000]
[0x7006a000:0x7006c000]
[0x7006c000:0x70084000]
[0x70084000:0x70088000]
[0x70088000:0x70094000]
[0x70094000:0x70098000]
[0x70098000:0x701da000]
[0x701da000:0x701e8000]
[0x701e8000:0x701f2000]
[0x701f2000:0x701f4000]
[0x701f4000:0x701fc000]
[0x701fc000:0x70204000]
[0x70204000:0x7020c000]
[0x7020c000:0x7021e000]
[0x7021e000:0x7022c000]
[0x7022c000:0x7023]
[0x7023:0x70232000]
[0x70234000:0x7023e000]
[0x7023e000:0x70244000]
[0x70244000:0x7024e000]
[0x7025:0x7025a000]
[0x7025a000:0x7026]
[0x7026:0x7026c000]
[0xefbfe000:0xefc28000]

And then we start to iterate, here is the trace I got:

free_pgd_range(addr[0x1000],end[0xde000],floor[0x0],ceiling[0x7000])
free_pud_range(addr[0x0],end[0xde000],floor[0x0],ceiling[0x7000])
free_pmd_range(addr[0],end[0xde000],floor[0x0],ceiling[0x7000])
free_pte_range(addr[0x0],next[0xde000],end[0xde000])   /* nr_ptes-- */

free_pgd_range(addr[0x7000],end[0x7026c000],floor[0x0],ceiling[0xefbfe000])
free_pud_range(addr[0x7000],end[0x7026c000],floor[0x0],ceiling[0xef80])
/* does not do free_pmd_range() for some reason)

free_pgd_range(addr[0xefbfe000],end[0xefc28000],floor[0x0],ceiling[0x0])
free_pud_range(addr[0xef80],end[0xefc28000],floor[0x0],ceiling[0x0])
/* also do not do free_pmd_range() */

Whoa, how does that work?  We are calling free_pgtables() at
exit_mmap() time with a 0 floor _and_ ceiling?

Oh I see, the tests are against "ceiling - 1".
Hmmm...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Hugh Dickins
On Tue, 22 Mar 2005, Luck, Tony wrote:

> >> For example, you may have a single page (start,end) address range
> >> to free, but if this is enclosed by a large enough (floor,ceiling)
> >> then it may free an entire pgd entry.
> >> 
> >> I assume the intention of the API would be to provide the full
> >> pgd width in that case?
> >
> >Yes, that is what should happen if the full PGD entry is liberated.
> >
> >Any time page table chunks are liberated, they have to be included
> >in the range passed to the flush_tlb_pgtables() call.

This now makes it crystal clear to me that Nick's suspicion was right,
my start,end to flush_tlb_pgtables is too narrow.  I'll take another
look there and correct it.

> So should this part of Hugh's code:
> 
> /*
>  * Optimization: gather nearby vmas into one call down
>  */
> while (next && next->vm_start <= vma->vm_end + 
> PMD_SIZE
> && !is_hugepage_only_range(next->vm_start, 
> HPAGE_SIZE)){
> vma = next;
> next = vma->vm_next;
> }
> free_pgd_range(tlb, addr, vma->vm_end,
> floor, next? next->vm_start: ceiling);
>  
> be changed to use pgd_addr_end() to gather up all the vma that
> are mapped by a single pgd instead of just spanning out the next
> PMD_SIZE?

Oh, I don't think so.  I suppose it could be done at this level,
but then the lower levels would go back to searching through lots
of unnecessary cachelines to find the significant entries, and
we might as well throw out the whole set of patches (which will
soon happen anyway if we can't find why they're not working!).

No, we don't have to pass pgd granularity to flush_tlb_pgtables,
we just have to try a bit harder to supply the right span to it,
whatever that is.  It might be in pmd granularity, it might be in
pud granularity (if we freed some at that level), it might be in
pgd granularity (if we freed some at that level).

> On ia64 we can have a vma big enough to require more than one pgd, but

Yes, no problem.

> in the case that we span, we won't cross the problematic pgd boundaries
> where the holes in the address space are lurking.

Yes, it wouldn't be a hole if a vma was allowed into it.

I do assume that on all architectures, the peculiar regions (might be
holes, might be huge-only regions) are separated from the ordinary
ones by pgd boundaries.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Tue, 22 Mar 2005 16:37:09 + (GMT)
Hugh Dickins <[EMAIL PROTECTED]> wrote:

> If you and David could try the lame patch below,
> it'll at least give us a slight clue of where to be looking -
> every mm exiting with nr_ptes 1 means something different from
> every mm exiting with nr_ptes -1 means something different from
> occasional mms exiting with nr_ptes something positive.

I didn't use your debugging patch, I used something slightly
more sophisticated to simply traces nr_ptes counter bumps
then halt the system when the BUG_ON(mm->nr_ptes) triggered
so I could analyze things.

What happens is (all virtual addresses are PMD_MASK'd):

1) We map 3 PTE tables.
   0x
   0x7000
   0xef80
2) We only unmap one PTE table.
   0x
3) Then we of course BUG() because nr_ptes is 2

I'm now adding more debugging to figure out why only the
first PTE table is being unmapped.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Luck, Tony
>> For example, you may have a single page (start,end) address range
>> to free, but if this is enclosed by a large enough (floor,ceiling)
>> then it may free an entire pgd entry.
>> 
>> I assume the intention of the API would be to provide the full
>> pgd width in that case?
>
>Yes, that is what should happen if the full PGD entry is liberated.
>
>Any time page table chunks are liberated, they have to be included
>in the range passed to the flush_tlb_pgtables() call.

So should this part of Hugh's code:

/*
 * Optimization: gather nearby vmas into one call down
 */
while (next && next->vm_start <= vma->vm_end + PMD_SIZE
&& !is_hugepage_only_range(next->vm_start, HPAGE_SIZE)){
vma = next;
next = vma->vm_next;
}
free_pgd_range(tlb, addr, vma->vm_end,
floor, next? next->vm_start: ceiling);
 
be changed to use pgd_addr_end() to gather up all the vma that
are mapped by a single pgd instead of just spanning out the next
PMD_SIZE?

On ia64 we can have a vma big enough to require more than one pgd, but
in the case that we span, we won't cross the problematic pgd boundaries
where the holes in the address space are lurking.

-Tony
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Tue, 22 Mar 2005 06:08:38 + (GMT)
Hugh Dickins <[EMAIL PROTECTED]> wrote:

> > It just wants the range of page tables liberated.  I guess
> > essentially PMD_SIZE is the granularity.
> 
> I _think_ that answer means that my current code is fine in this respect.
> But I'm not entirely convinced.  Since sparc64 is the only architecture
> which implements a flush_tlb_pgtables which actually uses start,end,
> we do need to suit your needs there - informed reassurance welcome!

Ok.  This interface is meant to deal with platforms that virtually
map their page tables, usually for faster TLB miss processing.

As stated, IA64 does this just as sparc64 does, however they flush
their linear page table virtual mappings in a different place.

This by definition means that the granularity is PMD_SIZE.  That
is the smallest chunk of page table, ie. what a pte_t chunk maps.

> > It's funny since this code aparently works fine on ia64 which
> > is fully 3-level too.  Hmm...
> 
> Yes, odd.  I'll have to have another think later on.

I'll play around with some of the patches you posted today and
get back to you.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Tue, 22 Mar 2005 05:47:13 + (GMT)
Hugh Dickins <[EMAIL PROTECTED]> wrote:

> > 1) start --> end straddles sparc64 address space hole
> 
> That's an interesting remark.  I hadn't noticed the signed long type.
> I believe the vma gathering in free_pgtables will have no problem with
> that, but what about the old code?  What happens if an app does a huge
> munmap straddling the address space hole?  Or is all the user address
> space below the hole?

It will BUG(), crap. :(  Good catch, I'll fix this.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Hugh Dickins
On Tue, 22 Mar 2005, Andrew Morton wrote:
> 
> With these six patches the ppc64 is hitting the BUG in exit_mmap():
> 
> BUG_ON(mm->nr_ptes);/* This is just debugging */
> 
> fairly early in boot.

So ppc64 is in the same boat as sparc64 (yet ia64 okay so far).

Sorry, I'm still clueless.

I cannot see those arches doing pte_allocs outside their vmas,
that of course could cause it.  And nr_ptes is initialized to 0
once by memset and again by assignment, so it should be starting
out even zeroer than most fields.

I should probably be paying more attention to the repellent
notion that my code is broken.

If you and David could try the lame patch below,
it'll at least give us a slight clue of where to be looking -
every mm exiting with nr_ptes 1 means something different from
every mm exiting with nr_ptes -1 means something different from
occasional mms exiting with nr_ptes something positive.

I'm not sure whether the patch would ever get to show a more
interesting proc name than "?".

And does memory leak away into lost pagetables if you continue
running, or does it actually carry on running fine, and the
problem appear to be with the BUG_ON itself?

Thanks,
Hugh

--- freepgt6/mm/mmap.c  2005-03-22 04:28:40.0 +
+++ testing/mm/mmap.c   2005-03-22 15:45:00.0 +
@@ -1896,6 +1896,7 @@ EXPORT_SYMBOL(do_brk);
 /* Release all mmaps. */
 void exit_mmap(struct mm_struct *mm)
 {
+   static unsigned long good_mms, bad_mms;
struct mmu_gather *tlb;
struct vm_area_struct *vma = mm->mmap;
unsigned long nr_accounted = 0;
@@ -1931,7 +1932,14 @@ void exit_mmap(struct mm_struct *mm)
vma = next;
}
 
-   BUG_ON(mm->nr_ptes);/* This is just debugging */
+   if (mm->nr_ptes && bad_mms < 250) {
+   printk(KERN_ERR "exit_mmap: %s nr_ptes %ld good_mms %lu\n",
+   current->mm == mm? current->comm: "?",
+   (long)mm->nr_ptes, good_mms);
+   good_mms = 0;
+   bad_mms++;
+   } else
+   good_mms++;
 }
 
 /* Insert vm structure into process list sorted by address
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Nick Piggin
Andrew Morton wrote:
With these six patches the ppc64 is hitting the BUG in exit_mmap():
BUG_ON(mm->nr_ptes);/* This is just debugging */
fairly early in boot.
No doubt Hugh will have this fixed before long... but if you
have time to spare, you may just try hitting it on the head
and making it a bit dumber. It might help show where the problem
is.
- don't span multiple vmas
- don't be so smart about avoiding unfreeable regions
I dunno. Maybe it won't help at all. Maybe it will make things
worse :P

Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2005-03-22 23:04:34.0 +1100
+++ linux-2.6/mm/memory.c   2005-03-22 23:11:31.0 +1100
@@ -110,13 +110,18 @@ void pmd_clear_bad(pmd_t *pmd)
  * Note: this doesn't free the actual pages themselves. That
  * has been handled earlier when unmapping all the memory regions.
  */
-static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd)
+static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
+   unsigned long addr, unsigned long end,
+   unsigned long floor, unsigned long ceiling)
 {
-   struct page *page = pmd_page(*pmd);
-   pmd_clear(pmd);
-   pte_free_tlb(tlb, page);
-   dec_page_state(nr_page_table_pages);
-   tlb->mm->nr_ptes--;
+   if (((addr & PMD_MASK) >= floor)
+   && (end - 1 <= (ceiling & PMD_MASK) - 1)) {
+   struct page *page = pmd_page(*pmd);
+   pmd_clear(pmd);
+   pte_free_tlb(tlb, page);
+   dec_page_state(nr_page_table_pages);
+   tlb->mm->nr_ptes--;
+   }
 }
 
 static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
@@ -133,7 +138,7 @@ static inline void free_pmd_range(struct
next = pmd_addr_end(addr, end);
if (pmd_none_or_clear_bad(pmd))
continue;
-   free_pte_range(tlb, pmd);
+   free_pte_range(tlb, pmd, addr, end, floor, ceiling);
} while (pmd++, addr = next, addr != end);
 
start &= PUD_MASK;
@@ -190,18 +195,6 @@ void free_pgd_range(struct mmu_gather **
unsigned long next;
unsigned long start;
 
-   addr &= PMD_MASK;
-   if (addr < floor) {
-   addr += PMD_SIZE;
-   if (!addr)
-   return;
-   }
-   ceiling &= PMD_MASK;
-   if (end - 1 > ceiling - 1)
-   end -= PMD_SIZE;
-   if (addr > end - 1)
-   return;
-
start = addr;
pgd = pgd_offset((*tlb)->mm, addr);
do {
@@ -226,14 +219,6 @@ void free_pgtables(struct mmu_gather **t
hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
floor, next? next->vm_start: ceiling);
} else {
-   /*
-* Optimization: gather nearby vmas into one call down
-*/
-   while (next && next->vm_start <= vma->vm_end + PMD_SIZE
-   && !is_hugepage_only_range(next->vm_start, HPAGE_SIZE)){
-   vma = next;
-   next = vma->vm_next;
-   }
free_pgd_range(tlb, addr, vma->vm_end,
floor, next? next->vm_start: ceiling);
}


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Andrew Morton

With these six patches the ppc64 is hitting the BUG in exit_mmap():

BUG_ON(mm->nr_ptes);/* This is just debugging */

fairly early in boot.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Andrew Morton

With these six patches the ppc64 is hitting the BUG in exit_mmap():

BUG_ON(mm-nr_ptes);/* This is just debugging */

fairly early in boot.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Nick Piggin
Andrew Morton wrote:
With these six patches the ppc64 is hitting the BUG in exit_mmap():
BUG_ON(mm-nr_ptes);/* This is just debugging */
fairly early in boot.
No doubt Hugh will have this fixed before long... but if you
have time to spare, you may just try hitting it on the head
and making it a bit dumber. It might help show where the problem
is.
- don't span multiple vmas
- don't be so smart about avoiding unfreeable regions
I dunno. Maybe it won't help at all. Maybe it will make things
worse :P

Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c  2005-03-22 23:04:34.0 +1100
+++ linux-2.6/mm/memory.c   2005-03-22 23:11:31.0 +1100
@@ -110,13 +110,18 @@ void pmd_clear_bad(pmd_t *pmd)
  * Note: this doesn't free the actual pages themselves. That
  * has been handled earlier when unmapping all the memory regions.
  */
-static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd)
+static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
+   unsigned long addr, unsigned long end,
+   unsigned long floor, unsigned long ceiling)
 {
-   struct page *page = pmd_page(*pmd);
-   pmd_clear(pmd);
-   pte_free_tlb(tlb, page);
-   dec_page_state(nr_page_table_pages);
-   tlb-mm-nr_ptes--;
+   if (((addr  PMD_MASK) = floor)
+(end - 1 = (ceiling  PMD_MASK) - 1)) {
+   struct page *page = pmd_page(*pmd);
+   pmd_clear(pmd);
+   pte_free_tlb(tlb, page);
+   dec_page_state(nr_page_table_pages);
+   tlb-mm-nr_ptes--;
+   }
 }
 
 static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
@@ -133,7 +138,7 @@ static inline void free_pmd_range(struct
next = pmd_addr_end(addr, end);
if (pmd_none_or_clear_bad(pmd))
continue;
-   free_pte_range(tlb, pmd);
+   free_pte_range(tlb, pmd, addr, end, floor, ceiling);
} while (pmd++, addr = next, addr != end);
 
start = PUD_MASK;
@@ -190,18 +195,6 @@ void free_pgd_range(struct mmu_gather **
unsigned long next;
unsigned long start;
 
-   addr = PMD_MASK;
-   if (addr  floor) {
-   addr += PMD_SIZE;
-   if (!addr)
-   return;
-   }
-   ceiling = PMD_MASK;
-   if (end - 1  ceiling - 1)
-   end -= PMD_SIZE;
-   if (addr  end - 1)
-   return;
-
start = addr;
pgd = pgd_offset((*tlb)-mm, addr);
do {
@@ -226,14 +219,6 @@ void free_pgtables(struct mmu_gather **t
hugetlb_free_pgd_range(tlb, addr, vma-vm_end,
floor, next? next-vm_start: ceiling);
} else {
-   /*
-* Optimization: gather nearby vmas into one call down
-*/
-   while (next  next-vm_start = vma-vm_end + PMD_SIZE
-!is_hugepage_only_range(next-vm_start, HPAGE_SIZE)){
-   vma = next;
-   next = vma-vm_next;
-   }
free_pgd_range(tlb, addr, vma-vm_end,
floor, next? next-vm_start: ceiling);
}


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Hugh Dickins
On Tue, 22 Mar 2005, Andrew Morton wrote:
 
 With these six patches the ppc64 is hitting the BUG in exit_mmap():
 
 BUG_ON(mm-nr_ptes);/* This is just debugging */
 
 fairly early in boot.

So ppc64 is in the same boat as sparc64 (yet ia64 okay so far).

Sorry, I'm still clueless.

I cannot see those arches doing pte_allocs outside their vmas,
that of course could cause it.  And nr_ptes is initialized to 0
once by memset and again by assignment, so it should be starting
out even zeroer than most fields.

I should probably be paying more attention to the repellent
notion that my code is broken.

If you and David could try the lame patch below,
it'll at least give us a slight clue of where to be looking -
every mm exiting with nr_ptes 1 means something different from
every mm exiting with nr_ptes -1 means something different from
occasional mms exiting with nr_ptes something positive.

I'm not sure whether the patch would ever get to show a more
interesting proc name than ?.

And does memory leak away into lost pagetables if you continue
running, or does it actually carry on running fine, and the
problem appear to be with the BUG_ON itself?

Thanks,
Hugh

--- freepgt6/mm/mmap.c  2005-03-22 04:28:40.0 +
+++ testing/mm/mmap.c   2005-03-22 15:45:00.0 +
@@ -1896,6 +1896,7 @@ EXPORT_SYMBOL(do_brk);
 /* Release all mmaps. */
 void exit_mmap(struct mm_struct *mm)
 {
+   static unsigned long good_mms, bad_mms;
struct mmu_gather *tlb;
struct vm_area_struct *vma = mm-mmap;
unsigned long nr_accounted = 0;
@@ -1931,7 +1932,14 @@ void exit_mmap(struct mm_struct *mm)
vma = next;
}
 
-   BUG_ON(mm-nr_ptes);/* This is just debugging */
+   if (mm-nr_ptes  bad_mms  250) {
+   printk(KERN_ERR exit_mmap: %s nr_ptes %ld good_mms %lu\n,
+   current-mm == mm? current-comm: ?,
+   (long)mm-nr_ptes, good_mms);
+   good_mms = 0;
+   bad_mms++;
+   } else
+   good_mms++;
 }
 
 /* Insert vm structure into process list sorted by address
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Tue, 22 Mar 2005 05:47:13 + (GMT)
Hugh Dickins [EMAIL PROTECTED] wrote:

  1) start -- end straddles sparc64 address space hole
 
 That's an interesting remark.  I hadn't noticed the signed long type.
 I believe the vma gathering in free_pgtables will have no problem with
 that, but what about the old code?  What happens if an app does a huge
 munmap straddling the address space hole?  Or is all the user address
 space below the hole?

It will BUG(), crap. :(  Good catch, I'll fix this.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Tue, 22 Mar 2005 06:08:38 + (GMT)
Hugh Dickins [EMAIL PROTECTED] wrote:

  It just wants the range of page tables liberated.  I guess
  essentially PMD_SIZE is the granularity.
 
 I _think_ that answer means that my current code is fine in this respect.
 But I'm not entirely convinced.  Since sparc64 is the only architecture
 which implements a flush_tlb_pgtables which actually uses start,end,
 we do need to suit your needs there - informed reassurance welcome!

Ok.  This interface is meant to deal with platforms that virtually
map their page tables, usually for faster TLB miss processing.

As stated, IA64 does this just as sparc64 does, however they flush
their linear page table virtual mappings in a different place.

This by definition means that the granularity is PMD_SIZE.  That
is the smallest chunk of page table, ie. what a pte_t chunk maps.

  It's funny since this code aparently works fine on ia64 which
  is fully 3-level too.  Hmm...
 
 Yes, odd.  I'll have to have another think later on.

I'll play around with some of the patches you posted today and
get back to you.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Luck, Tony
 For example, you may have a single page (start,end) address range
 to free, but if this is enclosed by a large enough (floor,ceiling)
 then it may free an entire pgd entry.
 
 I assume the intention of the API would be to provide the full
 pgd width in that case?

Yes, that is what should happen if the full PGD entry is liberated.

Any time page table chunks are liberated, they have to be included
in the range passed to the flush_tlb_pgtables() call.

So should this part of Hugh's code:

/*
 * Optimization: gather nearby vmas into one call down
 */
while (next  next-vm_start = vma-vm_end + PMD_SIZE
 !is_hugepage_only_range(next-vm_start, HPAGE_SIZE)){
vma = next;
next = vma-vm_next;
}
free_pgd_range(tlb, addr, vma-vm_end,
floor, next? next-vm_start: ceiling);
 
be changed to use pgd_addr_end() to gather up all the vma that
are mapped by a single pgd instead of just spanning out the next
PMD_SIZE?

On ia64 we can have a vma big enough to require more than one pgd, but
in the case that we span, we won't cross the problematic pgd boundaries
where the holes in the address space are lurking.

-Tony
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Tue, 22 Mar 2005 16:37:09 + (GMT)
Hugh Dickins [EMAIL PROTECTED] wrote:

 If you and David could try the lame patch below,
 it'll at least give us a slight clue of where to be looking -
 every mm exiting with nr_ptes 1 means something different from
 every mm exiting with nr_ptes -1 means something different from
 occasional mms exiting with nr_ptes something positive.

I didn't use your debugging patch, I used something slightly
more sophisticated to simply traces nr_ptes counter bumps
then halt the system when the BUG_ON(mm-nr_ptes) triggered
so I could analyze things.

What happens is (all virtual addresses are PMD_MASK'd):

1) We map 3 PTE tables.
   0x
   0x7000
   0xef80
2) We only unmap one PTE table.
   0x
3) Then we of course BUG() because nr_ptes is 2

I'm now adding more debugging to figure out why only the
first PTE table is being unmapped.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Hugh Dickins
On Tue, 22 Mar 2005, Luck, Tony wrote:

  For example, you may have a single page (start,end) address range
  to free, but if this is enclosed by a large enough (floor,ceiling)
  then it may free an entire pgd entry.
  
  I assume the intention of the API would be to provide the full
  pgd width in that case?
 
 Yes, that is what should happen if the full PGD entry is liberated.
 
 Any time page table chunks are liberated, they have to be included
 in the range passed to the flush_tlb_pgtables() call.

This now makes it crystal clear to me that Nick's suspicion was right,
my start,end to flush_tlb_pgtables is too narrow.  I'll take another
look there and correct it.

 So should this part of Hugh's code:
 
 /*
  * Optimization: gather nearby vmas into one call down
  */
 while (next  next-vm_start = vma-vm_end + 
 PMD_SIZE
  !is_hugepage_only_range(next-vm_start, 
 HPAGE_SIZE)){
 vma = next;
 next = vma-vm_next;
 }
 free_pgd_range(tlb, addr, vma-vm_end,
 floor, next? next-vm_start: ceiling);
  
 be changed to use pgd_addr_end() to gather up all the vma that
 are mapped by a single pgd instead of just spanning out the next
 PMD_SIZE?

Oh, I don't think so.  I suppose it could be done at this level,
but then the lower levels would go back to searching through lots
of unnecessary cachelines to find the significant entries, and
we might as well throw out the whole set of patches (which will
soon happen anyway if we can't find why they're not working!).

No, we don't have to pass pgd granularity to flush_tlb_pgtables,
we just have to try a bit harder to supply the right span to it,
whatever that is.  It might be in pmd granularity, it might be in
pud granularity (if we freed some at that level), it might be in
pgd granularity (if we freed some at that level).

 On ia64 we can have a vma big enough to require more than one pgd, but

Yes, no problem.

 in the case that we span, we won't cross the problematic pgd boundaries
 where the holes in the address space are lurking.

Yes, it wouldn't be a hole if a vma was allowed into it.

I do assume that on all architectures, the peculiar regions (might be
holes, might be huge-only regions) are separated from the ordinary
ones by pgd boundaries.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller

Ok, here is a full dump of a free_pgtables() run that
fails to clear out all the PMD's.

It gets called with this VMA list (each entry is a
vm_start/vm_end tuple)

[0x0001:0x000a4000]
[0x000b2000:0x000b8000]
[0x000b8000:0x000de000]
[0x7000:0x7001a000]
[0x70028000:0x7002a000]
[0x7002c000:0x7006a000]
[0x7006a000:0x7006c000]
[0x7006c000:0x70084000]
[0x70084000:0x70088000]
[0x70088000:0x70094000]
[0x70094000:0x70098000]
[0x70098000:0x701da000]
[0x701da000:0x701e8000]
[0x701e8000:0x701f2000]
[0x701f2000:0x701f4000]
[0x701f4000:0x701fc000]
[0x701fc000:0x70204000]
[0x70204000:0x7020c000]
[0x7020c000:0x7021e000]
[0x7021e000:0x7022c000]
[0x7022c000:0x7023]
[0x7023:0x70232000]
[0x70234000:0x7023e000]
[0x7023e000:0x70244000]
[0x70244000:0x7024e000]
[0x7025:0x7025a000]
[0x7025a000:0x7026]
[0x7026:0x7026c000]
[0xefbfe000:0xefc28000]

And then we start to iterate, here is the trace I got:

free_pgd_range(addr[0x1000],end[0xde000],floor[0x0],ceiling[0x7000])
free_pud_range(addr[0x0],end[0xde000],floor[0x0],ceiling[0x7000])
free_pmd_range(addr[0],end[0xde000],floor[0x0],ceiling[0x7000])
free_pte_range(addr[0x0],next[0xde000],end[0xde000])   /* nr_ptes-- */

free_pgd_range(addr[0x7000],end[0x7026c000],floor[0x0],ceiling[0xefbfe000])
free_pud_range(addr[0x7000],end[0x7026c000],floor[0x0],ceiling[0xef80])
/* does not do free_pmd_range() for some reason)

free_pgd_range(addr[0xefbfe000],end[0xefc28000],floor[0x0],ceiling[0x0])
free_pud_range(addr[0xef80],end[0xefc28000],floor[0x0],ceiling[0x0])
/* also do not do free_pmd_range() */

Whoa, how does that work?  We are calling free_pgtables() at
exit_mmap() time with a 0 floor _and_ ceiling?

Oh I see, the tests are against ceiling - 1.
Hmmm...

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Tue, 22 Mar 2005 11:01:44 -0800
David S. Miller [EMAIL PROTECTED] wrote:

 Hmmm...

Thinking some more, one thing that is unique in the PPC64
and SPARC64 cases is that we are executing primarily
32-bit tasks and in such cases one PGD maps the entire
address space.

I wonder if the free_pgtables() loops simply aren't
coping well with something like that.

I'm trying to analyze my traces some more.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Tue, 22 Mar 2005 11:21:25 -0800
David S. Miller [EMAIL PROTECTED] wrote:

 I'm trying to analyze my traces some more.

I think I see what's going wrong.  On the first
address space traversal (free_pgd_range()), we
clear out the pgd, even though there are still
more PMD's to process in that PGD.

So the future loops never do anything since the
PGD is cleared out already.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Hugh Dickins
On Tue, 22 Mar 2005, David S. Miller wrote:
 On Tue, 22 Mar 2005 11:21:25 -0800
 David S. Miller [EMAIL PROTECTED] wrote:
 
  I'm trying to analyze my traces some more.
 
 I think I see what's going wrong.  On the first
 address space traversal (free_pgd_range()), we
 clear out the pgd, even though there are still
 more PMD's to process in that PGD.
 
 So the future loops never do anything since the
 PGD is cleared out already.

Yes, that's the conclusion I was coming to from your excellent traces.

I notice that although both i386 and sparc64 use pgtable-nopud.h, the
i386 pud_clear does nothing at all and the sparc64 pud_clear resets to 0.

If that really is the problem, well, I get as easily mixed between
the levels as the next man, and haven't a clue which is the right
way to do it - beyond i386 not seeing these nr_pte bugs.

But if this is the issue, I don't see how it's new to my freepgt
patches - beyond the fact that they add this BUG_ON consistency check.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Tue, 22 Mar 2005 19:36:46 + (GMT)
Hugh Dickins [EMAIL PROTECTED] wrote:

 I notice that although both i386 and sparc64 use pgtable-nopud.h, the
 i386 pud_clear does nothing at all and the sparc64 pud_clear resets to 0.

This was a dead end.  I386 doesn't do anything with pud_clear() in
order to work around a chip erratum.

IA64 does clear in pud_clear() just like sparc64.

I think it's the floor/ceiling stuff.

At that pud_clear(), we do it if floor--ceiling (after masking)
covers the whole PUD.  Not whether start/end do, which is what
the code sort of does right now.

start and end say which specific entries we might be purging.
floor and ceiling say that once that purging is done, the
extent of the potential address space freed.

I cooked up a quick patch that changes the logic to:

floor = PUD_MASK;
ceiling = PUD_MASK;
if (floor - 1 = ceiling - 1)
return;

pmd = pmd_offset(pud, start);
pud_clear(pud);
pmd_free_tlb(tlb, pmd);

and things start to basically work.

When X started up my machine rebooted, but this was with
all the tracing printk()'s enabled so I'm skeptical as to
the reason.  I'll back out the debugging and play with this
some more.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Tue, 22 Mar 2005 19:36:46 + (GMT)
Hugh Dickins [EMAIL PROTECTED] wrote:

 I notice that although both i386 and sparc64 use pgtable-nopud.h, the
 i386 pud_clear does nothing at all and the sparc64 pud_clear resets to 0.

Aha!  And ppc does as well via asm-generic/4level-fixup.h which is
probably why I did it this way as well when I converted sparc64
over to asm-generic/pgtable-nopud.h

Hmmm, let me play around with this.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller

Hugh, I got tired of rebooting just to get address walking
traces :-)  So I wrote a little simulator.

Basically, it's free_pgtables() with the page table pointer
stuff ripped out.

You run it like this:

./simulator vma_file

Where vma_file is a text file composed of lines of the form:

START END

in hex.  The ordered list of VMA's is built from this text file.
I enclose a sample file I took from  that trace I built earlier
today.

If you want to get fancy, it's probably very easy to make it so
that parse_file() can read in live /proc/${pid}/maps files too :)

It is easy to stick in support for platforms other than SPARC64,
the machinery is all there, just pop the definitions out from
the platforms include/asm-*/*.h and relevant include/asm-generic/*.h
header files.

Enjoy. :-)
#include stddef.h
#include stdio.h
#include sys/types.h
#include sys/stat.h
#include fcntl.h
#include malloc.h

#define ARCH_SPARC64	1
#define ARCH_IA64	2
#define ARCH_PPC64	3

#define ARCH	ARCH_SPARC64

#if (ARCH == ARCH_SPARC64)
#define PAGE_SHIFT	13
#define PMD_SHIFT	(PAGE_SHIFT + (PAGE_SHIFT-3))
#define PMD_SIZE	(1UL  PMD_SHIFT)
#define PMD_MASK	(~(PMD_SIZE-1))
#define PMD_BITS	11
#define PGDIR_SHIFT	(PAGE_SHIFT + (PAGE_SHIFT-3) + PMD_BITS)
#define PGDIR_SIZE	(1UL  PGDIR_SHIFT)
#define PGDIR_MASK	(~(PGDIR_SIZE-1))
#define PUD_SHIFT	PGDIR_SHIFT
#define PTRS_PER_PUD	1
#define PUD_SIZE  	(1UL  PUD_SHIFT)
#define PUD_MASK  	(~(PUD_SIZE-1))
#define pgd_addr_end(addr, end)		\
({	unsigned long __boundary = ((addr) + PGDIR_SIZE)  PGDIR_MASK;	\
	(__boundary - 1  (end) - 1)? __boundary: (end);		\
})
#define pud_addr_end(addr, end)			(end)
#define pmd_addr_end(addr, end)		\
({	unsigned long __boundary = ((addr) + PMD_SIZE)  PMD_MASK;	\
	(__boundary - 1  (end) - 1)? __boundary: (end);		\
})
#else
#error please add definitions for your architecture
#endif

/*
 * Note: this doesn't free the actual pages themselves. That
 * has been handled earlier when unmapping all the memory regions.
 */
static inline void free_pte_range(unsigned long addr)
{
}

static inline void free_pmd_range(unsigned long addr, unsigned long end,
  unsigned long floor, unsigned long ceiling)
{
	unsigned long next;
	unsigned long start;

	start = addr;
	do {
		next = pmd_addr_end(addr, end);
#if 1
		printf(free_pgtables: free_pte_range() addr(%lx) next(%lx) end(%lx)\n,
		   addr, next, end);
#endif
		free_pte_range(addr);
	} while (addr = next, addr != end);

	start = PUD_MASK;
	ceiling = PUD_MASK;
#if 1
	printf(free_pmd_range: start(%lx) ceiling(%lx) , start, ceiling);
#endif
	if ((start  floor) ||
	(end - 1  ceiling - 1)) {
#if 1
		printf(SKIP pud_clear()\n);
#endif
		return;
	}
#if 1
	printf(DO pud_clear()\n);
#endif
}

static inline void free_pud_range(unsigned long addr, unsigned long end,
  unsigned long floor, unsigned long ceiling)
{
	unsigned long next;
	unsigned long start;

	start = addr;
	do {
		next = pud_addr_end(addr, end);
#if 1
		printf(free_pud_range: free_pmd_range(%lx,%lx,%lx,%lx)\n,
		   addr, next, floor, ceiling);
#endif
		free_pmd_range(addr, next, floor, ceiling);
	} while (addr = next, addr != end);

	start = PGDIR_MASK;
	ceiling = PGDIR_MASK;
#if 1
	printf(free_pud_range: start(%lx) ceiling(%lx) ,
	   start, ceiling);
#endif
	if ((start  floor) ||
	(end - 1  ceiling - 1)) {
#if 1
		printf(SKIP pgd_clear()\n);
#endif
		return;
	}
#if 1
	printf(DO pgd_clear()\n);
#endif
}

/*
 * This function frees user-level page tables of a process.
 * Unlike other pagetable walks, some memory layouts might give end 0.
 * Must be called with pagetable lock held.
 */
static inline void free_pgd_range(unsigned long addr, unsigned long end,
  unsigned long floor, unsigned long ceiling)
{
	unsigned long next;
	unsigned long start;

	addr = PMD_MASK;
	if (addr  floor) {
		addr += PMD_SIZE;
		if (!addr)
			return;
	}
	ceiling = PMD_MASK;
	if (end - 1  ceiling - 1)
		end -= PMD_SIZE;
	if (addr  end - 1)
		return;

	start = addr;
	do {
		next = pgd_addr_end(addr, end);
#if 1
		printf(free_pgd_range: free_pud_range(%lx,%lx,%lx,%lx)\n,
		   addr, next, floor, ceiling);
#endif
		free_pud_range(addr, next, floor, ceiling);
	} while (addr = next, addr != end);

	if (1 /*!tlb_is_full_mm(tlb)*/) {
#if 1
		printf(free_pgd_range: Doing flush_tlb_pgtables(%lx,%lx)\n,
		   start, end);
#endif
		/*flush_tlb_pgtables(tlb-mm, start, end);*/
	}
}

struct vm_area_struct {
	struct vm_area_struct *vm_next;
	unsigned long vm_start, vm_end;
};

void free_pgtables(struct vm_area_struct *vma,
		   unsigned long floor, unsigned long ceiling)
{
#if 1
	struct vm_area_struct *tmp = vma;

	printf(VMA DUMP: );
	while (tmp) {
		printf([%lx:%lx] , tmp-vm_start, tmp-vm_end);
		tmp = tmp-vm_next;
	}
	printf(\n);
#endif
	while (vma) {
		struct vm_area_struct *next = vma-vm_next;
		unsigned long addr = vma-vm_start;

		/* Optimization: gather nearby vmas into a single call down */
		while (next  next-vm_start = vma-vm_end + 2*PMD_SIZE) {
			vma = next;

Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Hugh Dickins
On Tue, 22 Mar 2005, David S. Miller wrote:
 On Tue, 22 Mar 2005 19:36:46 + (GMT)
 Hugh Dickins [EMAIL PROTECTED] wrote:
 
  I notice that although both i386 and sparc64 use pgtable-nopud.h, the
  i386 pud_clear does nothing at all and the sparc64 pud_clear resets to 0.
 
 This was a dead end.  I386 doesn't do anything with pud_clear() in
 order to work around a chip erratum.
 
 IA64 does clear in pud_clear() just like sparc64.

My mind kept flipping back and forth on whether it was pud_clear().
I agree, I can't see that it's the issue now.

 I think it's the floor/ceiling stuff.
 
 At that pud_clear(), we do it if floor--ceiling (after masking)
 covers the whole PUD.  Not whether start/end do, which is what
 the code sort of does right now.

Not an explanation I understood ;)

 start and end say which specific entries we might be purging.

Yes.

 floor and ceiling say that once that purging is done, the
 extent of the potential address space freed.

Well, they specify the limits beyond which we cannot free,
because there's other stuff still making use of addresses beyond.

 I cooked up a quick patch that changes the logic to:
 
   floor = PUD_MASK;
   ceiling = PUD_MASK;
   if (floor - 1 = ceiling - 1)
   return;

That can't be right.  In exit_mmap, for example, floor will be 0
throughout, and the condition will be true for all values of ceiling.

   pmd = pmd_offset(pud, start);
   pud_clear(pud);
   pmd_free_tlb(tlb, pmd);
 
 and things start to basically work.

Because none of your higher level tables are being freed at all:
eventually you'll run out of memory.

I still can't see what's wrong with the code that's already
there.  My brain is seizing up, I'm taking a break.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Luck, Tony
 be changed to use pgd_addr_end() to gather up all the vma that
 are mapped by a single pgd instead of just spanning out the next
 PMD_SIZE?

Oh, I don't think so.  I suppose it could be done at this level,
but then the lower levels would go back to searching through lots
of unnecessary cachelines to find the significant entries, and
we might as well throw out the whole set of patches (which will
soon happen anyway if we can't find why they're not working!).

Then I don't see how we decide when to clear a pointer at each
level.  Are there counters of how many entries are active in each
table at all levels (pgd/pud/pmd/pte)?

Consider the case where two or more vmas are mapped at addresses that
share a pgd entry, but are far enough apart that you don't want
to coalesce them.  First call down through that pgd entry must
leave the pointer to the pud (or pmd/pte for systems with less
levels) so that the next call can still get to the pud/pmd/pte
to clear out entries for the other vma.  But without counters
(at all levels) you don't know when you are all done with a table,
or whether you need to leave it for the next call.

If you want to pursue this, then there are probably some fields
in the page_t for the page that is being used as a pgd/pud/pmd/pte
that are available (do all architectures allocate whole pages for
all levels of the page tree?)

Alternatively you could modify the use of floor/ceiling as they
are passed down from the top level to indicate the progressively
greater address ranges that have been dealt with ... but I'm not
completely convinced that gives you enough information.  You would
need to do careful extension of the ceiling at each level to make
sure that you reach out to the end of of each table at each level
to account for gaps between vmas (which would result in no future
calldown hitting this part of the table).

-Tony
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Tue, 22 Mar 2005 21:51:39 + (GMT)
Hugh Dickins [EMAIL PROTECTED] wrote:

 I still can't see what's wrong with the code that's already
 there.  My brain is seizing up, I'm taking a break.

Ok, meanwhile I'll do a brain dump of what I think this
code should be doing.

Let's take an example free_pgd_range() call.  Say the
address parameters are:

addr0x1
end 0xa4000
floor   0x0
ceiling 0xb2000

(This example comes from my exit_mmap() VMA dump earlier
 in this thread.  If you disable the VMA skipping optimization
 the first call to free_pgd_range() has these parameters.)

What ought this free_pgd_range() call do?  This range of
addresses, from floor to ceiling, is smaller than a PMD_SIZE
(which on sparc64 is 1  23).  Therefore it should clear
no PGD or PUD entries.

Yet, it does clear them, specifically:

free_pgd_range():
1) mask addr (0x1) to PMD_MASK, addr is now 0
2) addr  floor (0x0) test does not pass
3) mask ceiling (0xb2000) to PMD_MASK, ceiling is now 0 too
4) end - 1  ceiling - 1 test does not pass
5) addr  end - 1 test does not pass either
6) We now loop one PGDIR_SIZE at a time from
   addr (0x0) to end (0xa4000), calling
   down into...
free_pud_range():
1) addr=0, end=0xa4000, floor=0, ceiling=0
2) We loop one PUD_SIZE at a time from
   addr (0x0) to end (0xa4000), calling
   down into...
free_pmd_range():
1) addr=0, end=0xa4000, floor=0, ceiling=0
2) We loop one PMD_SIZE at a time from
   addr (0x0) to end (0xa4000), calling
   down into...
free_pte_range():

And later when we finish the loops in free_pmd_range()
and free_pud_range() we do pud_clear() and pgd_clear()
respectively, both wrong.

The source of the problems seems to be how ceiling began
at the top of the call chain as 0xb2000, but when we
masked it with PMD_MASK that set it to zero, which means
top of address space in these functions.  That's not
what we want.

I added a quick hack to the simulator I posted, where
we mask ceiling in free_pgd_range(), I do it like this:

if (ceiling) {
ceiling = PMD_MASK;
if (!ceiling)
return;
}

and things seem to behave.  I'll try to analyze things
further and test this out on a real kernel, but all of
these adjustments at the top of free_pgd_range() really
start to look like pure spaghetti. :-)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Benjamin Herrenschmidt
On Tue, 2005-03-22 at 16:37 +, Hugh Dickins wrote:
 On Tue, 22 Mar 2005, Andrew Morton wrote:
  
  With these six patches the ppc64 is hitting the BUG in exit_mmap():
  
  BUG_ON(mm-nr_ptes);/* This is just debugging */
  
  fairly early in boot.
 
 So ppc64 is in the same boat as sparc64 (yet ia64 okay so far).
 
 Sorry, I'm still clueless.
 
 I cannot see those arches doing pte_allocs outside their vmas,
 that of course could cause it.  And nr_ptes is initialized to 0
 once by memset and again by assignment, so it should be starting
 out even zeroer than most fields.

We do funny things in arch/ppc64/mm/init.c in the ioremap_mm, where we
don't use VMAs but our own mecanism (yah, ugly, but that's some legacy
we have from the original port, though I do intend to change that at one
point).

 I should probably be paying more attention to the repellent
 notion that my code is broken.
 
 If you and David could try the lame patch below,
 it'll at least give us a slight clue of where to be looking -
 every mm exiting with nr_ptes 1 means something different from
 every mm exiting with nr_ptes -1 means something different from
 occasional mms exiting with nr_ptes something positive.
 
 I'm not sure whether the patch would ever get to show a more
 interesting proc name than ?.
 
 And does memory leak away into lost pagetables if you continue
 running, or does it actually carry on running fine, and the
 problem appear to be with the BUG_ON itself?
 
 Thanks,
 Hugh
 
 --- freepgt6/mm/mmap.c2005-03-22 04:28:40.0 +
 +++ testing/mm/mmap.c 2005-03-22 15:45:00.0 +
 @@ -1896,6 +1896,7 @@ EXPORT_SYMBOL(do_brk);
  /* Release all mmaps. */
  void exit_mmap(struct mm_struct *mm)
  {
 + static unsigned long good_mms, bad_mms;
   struct mmu_gather *tlb;
   struct vm_area_struct *vma = mm-mmap;
   unsigned long nr_accounted = 0;
 @@ -1931,7 +1932,14 @@ void exit_mmap(struct mm_struct *mm)
   vma = next;
   }
  
 - BUG_ON(mm-nr_ptes);/* This is just debugging */
 + if (mm-nr_ptes  bad_mms  250) {
 + printk(KERN_ERR exit_mmap: %s nr_ptes %ld good_mms %lu\n,
 + current-mm == mm? current-comm: ?,
 + (long)mm-nr_ptes, good_mms);
 + good_mms = 0;
 + bad_mms++;
 + } else
 + good_mms++;
  }
  
  /* Insert vm structure into process list sorted by address
-- 
Benjamin Herrenschmidt [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Nick Piggin
Hugh Dickins wrote:
On Tue, 22 Mar 2005, David S. Miller wrote:
On Tue, 22 Mar 2005 19:36:46 + (GMT)
Hugh Dickins [EMAIL PROTECTED] wrote:

I notice that although both i386 and sparc64 use pgtable-nopud.h, the
i386 pud_clear does nothing at all and the sparc64 pud_clear resets to 0.
This was a dead end.  I386 doesn't do anything with pud_clear() in
order to work around a chip erratum.
IA64 does clear in pud_clear() just like sparc64.

My mind kept flipping back and forth on whether it was pud_clear().
I agree, I can't see that it's the issue now.
It shouldn't be.
In the case that pud is folded, free_pud_range will only call into
free_pmd_range once, and that function will loop over the required
range of the pud (ie. the pmd). If it then also falls through to
pud_clear in that function, it will also fall through to pgd_clear
in free_pud_range. So it doesn't _really_ matter which one does the
actual clearing in that case.
I think David's on the right track - I think there's something a
bit wrong at the top. In my reply to Andrew in this thread I
posted a patch which may at least get things working...
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Tue, 22 Mar 2005 14:40:55 -0800
Luck, Tony [EMAIL PROTECTED] wrote:

 Then I don't see how we decide when to clear a pointer at each
 level.  Are there counters of how many entries are active in each
 table at all levels (pgd/pud/pmd/pte)?

No, there are no counters.

How it works is that it knows the extent in each direction
where mappings do not exist.

Once we know we have a clear span up to the next PMD_SIZE
modulo (and PUD_SIZE and so on and so forth) we know we
can liberate the page table chunks covered by such ranges.

Say we are unmapping a page at some address.  The next VMA
in the address space says where the next potentially valid
mapping resides.  The previous VMA says similarly.  If this
is the first or last VMA, we use the beginning or end of
the virtual address space as our value.

Play around with my little simulator I posted, you'll see how
it works ;-)  Actually, this is the second such simulator you
have seen Tony :-)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Benjamin Herrenschmidt
On Tue, 2005-03-22 at 12:21 -0800, David S. Miller wrote:
 On Tue, 22 Mar 2005 19:36:46 + (GMT)
 Hugh Dickins [EMAIL PROTECTED] wrote:
 
  I notice that although both i386 and sparc64 use pgtable-nopud.h, the
  i386 pud_clear does nothing at all and the sparc64 pud_clear resets to 0.
 
 Aha!  And ppc does as well via asm-generic/4level-fixup.h which is
 probably why I did it this way as well when I converted sparc64
 over to asm-generic/pgtable-nopud.h
 
 Hmmm, let me play around with this.

I converted ppc64 to no-pud.h but it seems to introduce some breakage...
X kills the box when launched on the G5, haven't had time to investigate
yet, it might be some bug of mine. Here is my current patch:


Index: linux-work/include/asm-ppc64/pgalloc.h
===
--- linux-work.orig/include/asm-ppc64/pgalloc.h 2005-03-07 14:00:11.0 
+1100
+++ linux-work/include/asm-ppc64/pgalloc.h  2005-03-17 17:00:50.0 
+1100
@@ -30,7 +30,7 @@
kmem_cache_free(zero_cache, pgd);
 }
 
-#define pgd_populate(MM, PGD, PMD) pgd_set(PGD, PMD)
+#define pud_populate(MM, PUD, PMD) pud_set(PUD, PMD)
 
 static inline pmd_t *
 pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
Index: linux-work/include/asm-ppc64/pgtable.h
===
--- linux-work.orig/include/asm-ppc64/pgtable.h 2005-03-07 14:00:11.0 
+1100
+++ linux-work/include/asm-ppc64/pgtable.h  2005-03-17 17:00:50.0 
+1100
@@ -1,8 +1,6 @@
 #ifndef _PPC64_PGTABLE_H
 #define _PPC64_PGTABLE_H
 
-#include asm-generic/4level-fixup.h
-
 /*
  * This file contains the functions and defines necessary to modify and use
  * the ppc64 hashed page table.
@@ -17,6 +15,8 @@
 #include asm/tlbflush.h
 #endif /* __ASSEMBLY__ */
 
+#include asm-generic/pgtable-nopud.h
+
 /* PMD_SHIFT determines what a second-level page table entry can map */
 #define PMD_SHIFT  (PAGE_SHIFT + PAGE_SHIFT - 3)
 #define PMD_SIZE   (1UL  PMD_SHIFT)
@@ -216,12 +216,12 @@
 #define pmd_page_kernel(pmd)   \
(__bpn_to_ba(pmd_val(pmd)  PMD_TO_PTEPAGE_SHIFT))
 #define pmd_page(pmd)  virt_to_page(pmd_page_kernel(pmd))
-#define pgd_set(pgdp, pmdp)(pgd_val(*(pgdp)) = (__ba_to_bpn(pmdp)))
-#define pgd_none(pgd)  (!pgd_val(pgd))
-#define pgd_bad(pgd)   ((pgd_val(pgd)) == 0)
-#define pgd_present(pgd)   (pgd_val(pgd) != 0UL)
-#define pgd_clear(pgdp)(pgd_val(*(pgdp)) = 0UL)
-#define pgd_page(pgd)  (__bpn_to_ba(pgd_val(pgd))) 
+#define pud_set(pgdp, pmdp)(pud_val(*(pgdp)) = (__ba_to_bpn(pmdp)))
+#define pud_none(pgd)  (!pud_val(pgd))
+#define pud_bad(pgd)   ((pud_val(pgd)) == 0)
+#define pud_present(pgd)   (pud_val(pgd) != 0UL)
+#define pud_clear(pgdp)(pud_val(*(pgdp)) = 0UL)
+#define pud_page(pgd)  (__bpn_to_ba(pud_val(pgd))) 
 
 /* 
  * Find an entry in a page-table-directory.  We combine the address region 
@@ -233,8 +233,8 @@
 #define pgd_offset(mm, address) ((mm)-pgd + pgd_index(address))
 
 /* Find an entry in the second-level page table.. */
-#define pmd_offset(dir,addr) \
-  ((pmd_t *) pgd_page(*(dir)) + (((addr)  PMD_SHIFT)  (PTRS_PER_PMD - 1)))
+#define pmd_offset(pudp,addr) \
+  ((pmd_t *) pud_page(*(pudp)) + (((addr)  PMD_SHIFT)  (PTRS_PER_PMD - 1)))
 
 /* Find an entry in the third-level page table.. */
 #define pte_offset_kernel(dir,addr) \
@@ -552,20 +552,23 @@
 static inline pte_t *find_linux_pte(pgd_t *pgdir, unsigned long ea)
 {
pgd_t *pg;
+   pud_t *pu;
pmd_t *pm;
pte_t *pt = NULL;
pte_t pte;
 
pg = pgdir + pgd_index(ea);
if (!pgd_none(*pg)) {
-
-   pm = pmd_offset(pg, ea);
-   if (pmd_present(*pm)) { 
-   pt = pte_offset_kernel(pm, ea);
-   pte = *pt;
-   if (!pte_present(pte))
-   pt = NULL;
-   }
+   pu = pud_offset(pg, ea);
+   if (!pud_none(*pu)) {   
+   pm = pmd_offset(pu, ea);
+   if (pmd_present(*pm)) { 
+   pt = pte_offset_kernel(pm, ea);
+   pte = *pt;
+   if (!pte_present(pte))
+   pt = NULL;
+   }
+   }   
}
 
return pt;
Index: linux-work/arch/ppc64/mm/init.c
===
--- linux-work.orig/arch/ppc64/mm/init.c2005-03-15 11:57:29.0 
+1100
+++ linux-work/arch/ppc64/mm/init.c 2005-03-17 17:20:14.0 +1100
@@ -136,14 +136,78 @@
 
 #else
 
+static void unmap_im_area_pte(pmd_t *pmd, unsigned long addr,
+ unsigned long end)
+{
+   pte_t *pte;
+
+   pte = 

Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Wed, 23 Mar 2005 10:32:10 +1100
Nick Piggin [EMAIL PROTECTED] wrote:

 I think David's on the right track - I think there's something a
 bit wrong at the top. In my reply to Andrew in this thread I
 posted a patch which may at least get things working...

We have to do the if (ceiling) check in every spot where
we mask it in some way, and if it falls to zero from non-zero
due to the masking, we skip.

That gives me a mostly working kernel.

Bad news is that while lat_proc's fork and exec tests improve
dramatically, shell performance is way down on sparc64.
I'll post before and after numbers in a bit.  Note, this is
just with Hugh's base patch plus bug fixes.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Luck, Tony
How it works is that it knows the extent in each direction
where mappings do not exist.

Once we know we have a clear span up to the next PMD_SIZE
modulo (and PUD_SIZE and so on and so forth) we know we
can liberate the page table chunks covered by such ranges.

Ok ... I see that now (I was mostly looking at free_pgtables()
and missed the conditional in the arglist that passes the
start of the next vma into free_pgd_range() as the ceiling
until we run out of vmas and pass in ceiling.

But I'm still confused by all the math on addr/end at each
level.  Rounding up/down at each level should presumably be
based on the size of objects at the next level.  So the pgd
code should round using PUD_MASK, pud should use PMD_MASK etc.
Perhaps I missed some updates, but the version of the patch
that I have (and the simulator) is using PMD_MASK in the
pgd_free_range() function ... which is surely wrong.

-Tony
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Tue, 22 Mar 2005 15:53:08 -0800
Luck, Tony [EMAIL PROTECTED] wrote:

 But I'm still confused by all the math on addr/end at each
 level.  Rounding up/down at each level should presumably be
 based on the size of objects at the next level.  So the pgd
 code should round using PUD_MASK, pud should use PMD_MASK etc.
 Perhaps I missed some updates, but the version of the patch
 that I have (and the simulator) is using PMD_MASK in the
 pgd_free_range() function ... which is surely wrong.

PMD_MASK decides the smallest page table chunk, so we mask
it at the top level.

Look at the next level down in the call chain, the masking
maskes more sense there.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller

Ok, here are (finally, I've been debugging this so much purely
to see these things) some lmbench numbers with Hugh's base
patch on sparc64.

Ignore my previous comments about shell performance getting
worse, it's some difference that makes things run more slowly
in single user mode compared to a fully brought up system.

First, for 32-bit tasks.

BEFORE
Process fork+exit: 171.4483 microseconds
Process fork+exit: 171.9688 microseconds
Process fork+exit: 169.2727 microseconds
Process fork+exit: 169.0333 microseconds
Process fork+exit: 165.8065 microseconds
Process fork+execve: 555.7000 microseconds
Process fork+execve: 556.6000 microseconds
Process fork+execve: 552.6000 microseconds
Process fork+execve: 557.1000 microseconds
Process fork+execve: 552. microseconds
Process fork+/bin/sh -c: 2207. microseconds
Process fork+/bin/sh -c: 2183. microseconds
Process fork+/bin/sh -c: 2179.6667 microseconds
Process fork+/bin/sh -c: 2190. microseconds
Process fork+/bin/sh -c: 2197.6667 microseconds

AFTER
Process fork+exit: 142.9487 microseconds
Process fork+exit: 147.8649 microseconds
Process fork+exit: 139.0250 microseconds
Process fork+exit: 138.9250 microseconds
Process fork+exit: 136.9268 microseconds
Process fork+execve: 478. microseconds
Process fork+execve: 479.1667 microseconds
Process fork+execve: 479.9091 microseconds
Process fork+execve: 480.1667 microseconds
Process fork+execve: 479.9091 microseconds
Process fork+/bin/sh -c: 2026. microseconds
Process fork+/bin/sh -c: 2029.6667 microseconds
Process fork+/bin/sh -c: 2044.6667 microseconds
Process fork+/bin/sh -c: 2037.6667 microseconds
Process fork+/bin/sh -c: 2028.6667 microseconds

Pretty good, now for 64-bit processes.

BEFORE
Process fork+exit: 226.5200 microseconds
Process fork+exit: 230.0417 microseconds
Process fork+exit: 223.8800 microseconds
Process fork+exit: 226.4091 microseconds
Process fork+exit: 219.3043 microseconds
Process fork+execve: 799.8571 microseconds
Process fork+execve: 806.1429 microseconds
Process fork+execve: 799.5714 microseconds
Process fork+execve: 800.8571 microseconds
Process fork+execve: 788.7143 microseconds
Process fork+/bin/sh -c: 2655. microseconds
Process fork+/bin/sh -c: 2668.5000 microseconds
Process fork+/bin/sh -c: 2649. microseconds
Process fork+/bin/sh -c: 2662.5000 microseconds
Process fork+/bin/sh -c: 2642. microseconds

AFTER
Process fork+exit: 165.1212 microseconds
Process fork+exit: 159.4571 microseconds
Process fork+exit: 160.3714 microseconds
Process fork+exit: 158.9091 microseconds
Process fork+exit: 157.2188 microseconds
Process fork+execve: 536.4545 microseconds
Process fork+execve: 542.0909 microseconds
Process fork+execve: 536.3000 microseconds
Process fork+execve: 540.6364 microseconds
Process fork+execve: 537.1818 microseconds
Process fork+/bin/sh -c: 2275. microseconds
Process fork+/bin/sh -c: 2272. microseconds
Process fork+/bin/sh -c: 2275.6667 microseconds
Process fork+/bin/sh -c: 2270. microseconds
Process fork+/bin/sh -c: 2284. microseconds

Quite nice.  It makes the 64-bit numbers on par with
the 32-bit numbers.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller

Ok, this patch, on top of Hugh's original freepgt patch, gets
me a working system.  It includes Hugh's bug fix, plus the
ceiling masking roll-over fix of mine.

It should get ppc working too, I bet.

--- mm/memory.c.hugh2005-03-22 16:01:07.0 -0800
+++ mm/memory.c 2005-03-22 16:00:08.0 -0800
@@ -127,11 +127,6 @@
unsigned long next;
unsigned long start;
 
-   if (end - 1  ceiling - 1)
-   end -= PMD_SIZE;
-   if (addr  end - 1)
-   return;
-
start = addr;
pmd = pmd_offset(pud, addr);
do {
@@ -144,7 +139,11 @@
start = PUD_MASK;
if (start  floor)
return;
-   ceiling = PUD_MASK;
+   if (ceiling) {
+   ceiling = PUD_MASK;
+   if (!ceiling)
+   return;
+   }
if (end - 1  ceiling - 1)
return;
 
@@ -173,7 +172,11 @@
start = PGDIR_MASK;
if (start  floor)
return;
-   ceiling = PGDIR_MASK;
+   if (ceiling) {
+   ceiling = PGDIR_MASK;
+   if (!ceiling)
+   return;
+   }
if (end - 1  ceiling - 1)
return;
 
@@ -201,8 +204,14 @@
if (!addr)
return;
}
-   ceiling = PMD_MASK;
-   if (addr  ceiling - 1)
+   if (ceiling) {
+   ceiling = PMD_MASK;
+   if (!ceiling)
+   return;
+   }
+   if (end - 1  ceiling - 1)
+   end -= PMD_SIZE;
+   if (addr  end - 1)
return;
 
start = addr;
@@ -214,7 +223,7 @@
free_pud_range(tlb, pgd, addr, next, floor, ceiling);
} while (pgd++, addr = next, addr != end);
 
-   if (!tlb_is_full_mm(tlb)  start  end)
+   if (!tlb_is_full_mm(tlb))
flush_tlb_pgtables(tlb-mm, start, end);
 }
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Nick Piggin
David S. Miller wrote:
On Wed, 23 Mar 2005 10:32:10 +1100
Nick Piggin [EMAIL PROTECTED] wrote:

I think David's on the right track - I think there's something a
bit wrong at the top. In my reply to Andrew in this thread I
posted a patch which may at least get things working...

We have to do the if (ceiling) check in every spot where
we mask it in some way, and if it falls to zero from non-zero
due to the masking, we skip.
That gives me a mostly working kernel.
I see. Tricky.
Bad news is that while lat_proc's fork and exec tests improve
dramatically, shell performance is way down on sparc64.
I'll post before and after numbers in a bit.  Note, this is
just with Hugh's base patch plus bug fixes.
That's interesting. The only extra stuff Hugh's should be
doing is the vma traversal.
The single walk patch seems to fit in quite well with Hugh's
updated patchset. I haven't quite worked out how to best do
hugepages, but it is easily possible. But I won't dust that
off again until Hugh's is nicely tested and working.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Wed, 23 Mar 2005 11:19:38 +1100
Nick Piggin [EMAIL PROTECTED] wrote:

  dramatically, shell performance is way down on sparc64.
  I'll post before and after numbers in a bit.  Note, this is
  just with Hugh's base patch plus bug fixes.
  
 
 That's interesting. The only extra stuff Hugh's should be
 doing is the vma traversal.

Like I said in another posting, ignore this anomaly.
It's some difference in the way the shell runs
when done from init=/bin/sh single user vs. full
multi-user.  Full sparc64 before/after in that posting,
go check it out it's nice :-)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Hugh Dickins
On Tue, 22 Mar 2005, Luck, Tony wrote:
 
 Alternatively you could modify the use of floor/ceiling as they
 are passed down from the top level to indicate the progressively
 greater address ranges that have been dealt with ... but I'm not
 completely convinced that gives you enough information.  You would
 need to do careful extension of the ceiling at each level to make
 sure that you reach out to the end of of each table at each level
 to account for gaps between vmas (which would result in no future
 calldown hitting this part of the table).

That pretty much describes how it does work (when it works!)

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Hugh Dickins
On Tue, 22 Mar 2005, David S. Miller wrote:
 On Tue, 22 Mar 2005 21:51:39 + (GMT)
 Hugh Dickins [EMAIL PROTECTED] wrote:
 
  I still can't see what's wrong with the code that's already
  there.  My brain is seizing up, I'm taking a break.
 
 Ok, meanwhile I'll do a brain dump of what I think this
 code should be doing.
 
 Let's take an example free_pgd_range() call.  Say the
 address parameters are:
 
 addr  0x1
 end   0xa4000
 floor 0x0
 ceiling   0xb2000

This actual example helped to focus my mind a lot, thank you.

 (This example comes from my exit_mmap() VMA dump earlier
  in this thread.  If you disable the VMA skipping optimization
  the first call to free_pgd_range() has these parameters.)
 
 What ought this free_pgd_range() call do?  This range of
 addresses, from floor to ceiling, is smaller than a PMD_SIZE
 (which on sparc64 is 1  23).  Therefore it should clear
 no PGD or PUD entries.

Yup, it ought to decide at the beginning of free_pgd_range
that it simply has no work to do.

 Yet, it does clear them, specifically:
 
 free_pgd_range():
   1) mask addr (0x1) to PMD_MASK, addr is now 0
   2) addr  floor (0x0) test does not pass
   3) mask ceiling (0xb2000) to PMD_MASK, ceiling is now 0 too
   4) end - 1  ceiling - 1 test does not pass
   5) addr  end - 1 test does not pass either

And now we've gone wrong, yes.

 The source of the problems seems to be how ceiling began
 at the top of the call chain as 0xb2000, but when we
 masked it with PMD_MASK that set it to zero, which means
 top of address space in these functions.  That's not
 what we want.
 
 I added a quick hack to the simulator I posted, where
 we mask ceiling in free_pgd_range(), I do it like this:
 
   if (ceiling) {
   ceiling = PMD_MASK;
   if (!ceiling)
   return;
   }

At first that just looked like a hack to me.  But on reflection,
no, you're doing exactly what I had to do with addr above: in the
case where we arrive at 0 from non-0 value, have to get out quick
to avoid confusion with the other 0.  These wrap issues are hard.

And in other mail I see you found more such checks were needed.
I believe you've got it, thank you so much!

Though frankly, by now, I'm sure of nothing:
will review in the morning.

 and things seem to behave.  I'll try to analyze things
 further and test this out on a real kernel, but all of
 these adjustments at the top of free_pgd_range() really
 start to look like pure spaghetti. :-)

Well, it's trying to decide in reasonably few steps that it's not
worth wasting time going down to the deeper levels.  Lots of
returns as it eliminates cases, yes.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Hugh Dickins
On Tue, 22 Mar 2005, Luck, Tony wrote:
 
 But I'm still confused by all the math on addr/end at each
 level.

You think the rest of us are not ;-?

 Rounding up/down at each level should presumably be
 based on the size of objects at the next level.  So the pgd
 code should round using PUD_MASK, pud should use PMD_MASK etc.
 Perhaps I missed some updates, but the version of the patch
 that I have (and the simulator) is using PMD_MASK in the
 pgd_free_range() function ... which is surely wrong.

It's confusing but not wrong (in principle).  It's trying to decide
immediately on entry whether it will be worth going down to the lower
levels: if even the lowest level will have no work to do, no point in
proceeding.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Andrew Morton
Hugh Dickins [EMAIL PROTECTED] wrote:

 On Tue, 22 Mar 2005, Luck, Tony wrote:
   
   But I'm still confused by all the math on addr/end at each
   level.
 
  You think the rest of us are not ;-?

umm, given the difficulty which you guys are having with this, I get a bit
worried about clarity, simplicity and maintainability of the end result.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Tue, 22 Mar 2005 17:10:13 -0800
Andrew Morton [EMAIL PROTECTED] wrote:

 Hugh Dickins [EMAIL PROTECTED] wrote:
 
  On Tue, 22 Mar 2005, Luck, Tony wrote:

But I'm still confused by all the math on addr/end at each
level.
  
   You think the rest of us are not ;-?
 
 umm, given the difficulty which you guys are having with this, I get a bit
 worried about clarity, simplicity and maintainability of the end result.

We're working on it, trust me :-)

I have a simplification in mind that should take care of the issue
that led us to these problems.  We should simply pass in ceiling
as -1 instead of 0.  Every single test against ceiling is
really done against ceiling - 1.

Therefore, passing ceiling in as top - 1 and then adjusting the
tests will clean this up substantially and make is much simpler.

I'm even sure that other similar refactoring is possible.

Hugh took the quantum leap we needed by implementing this at all,
some spaghetti code tests do not detract from his work conceptually.
That kind of stuff can be cleaned up.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Wed, 23 Mar 2005 00:51:02 + (GMT)
Hugh Dickins [EMAIL PROTECTED] wrote:

 This actual example helped to focus my mind a lot, thank you.

No problem, I needed to work through specific examples
to see things clearly too.

  and things seem to behave.  I'll try to analyze things
  further and test this out on a real kernel, but all of
  these adjustments at the top of free_pgd_range() really
  start to look like pure spaghetti. :-)
 
 Well, it's trying to decide in reasonably few steps that it's not
 worth wasting time going down to the deeper levels.  Lots of
 returns as it eliminates cases, yes.

Yes, I understand.

But let's recognize (as I mention in another email) that all of
the tests against ceiling are against ceiling - 1.  If we pass
-1 instead of 0 (and foo - 1 instead of foo) as the ceiling
arg, then adjust the tests to be against plain ceiling, so much
of the special casing disappears.

There are probably other simplifications.

This is kind of what I was hinting at when I said it looks like
spaghetti.  :-)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread Nick Piggin
David S. Miller wrote:
On Tue, 22 Mar 2005 17:10:13 -0800
Andrew Morton [EMAIL PROTECTED] wrote:

Hugh Dickins [EMAIL PROTECTED] wrote:
On Tue, 22 Mar 2005, Luck, Tony wrote:
 
 But I'm still confused by all the math on addr/end at each
 level.

You think the rest of us are not ;-?
umm, given the difficulty which you guys are having with this, I get a bit
worried about clarity, simplicity and maintainability of the end result.
We're working on it, trust me :-)
I have a simplification in mind that should take care of the issue
that led us to these problems.  We should simply pass in ceiling
as -1 instead of 0.  Every single test against ceiling is
really done against ceiling - 1.
Therefore, passing ceiling in as top - 1 and then adjusting the
tests will clean this up substantially and make is much simpler.

The ugly thing you get with an inclusive ceiling is that your masking
becomes more difficult I think.
I might try to attack this from another angle and see if I can come up
with something.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-22 Thread David S. Miller
On Wed, 23 Mar 2005 13:10:42 +1100
Nick Piggin [EMAIL PROTECTED] wrote:

 The ugly thing you get with an inclusive ceiling is that your masking
 becomes more difficult I think.

Good point.

 I might try to attack this from another angle and see if I can come up
 with something.

Great, let me know if you want something tested out on sparc64.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-21 Thread Luck, Tony
Builds clean and boots on ia64.

I haven't tried any hugetlb operations on it though.

-Tony
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-21 Thread Nick Piggin
Hugh Dickins wrote:
On Mon, 21 Mar 2005, David S. Miller wrote:
On Tue, 22 Mar 2005 15:14:54 +1100
Nick Piggin <[EMAIL PROTECTED]> wrote:

Question, Dave: flush_tlb_pgtables after Hugh's patch is also
possibly not being called with enough range to cover all page
tables that have been freed.

Good question from Nick.

For example, you may have a single page (start,end) address range
to free, but if this is enclosed by a large enough (floor,ceiling)
then it may free an entire pgd entry.
I assume the intention of the API would be to provide the full
pgd width in that case?
It just wants the range of page tables liberated.  I guess
essentially PMD_SIZE is the granularity.

I _think_ that answer means that my current code is fine in this respect.
But I'm not entirely convinced.  Since sparc64 is the only architecture
which implements a flush_tlb_pgtables which actually uses start,end,
we do need to suit your needs there - informed reassurance welcome!
But Hugh I think you may still be freeing pgd (PGDIR_SIZE)
regions that you don't quite cover with flush_tlb_pgtables?
I would have thought you'd need something like this:
if (!tlb_is_full_mm(tlb)) {
/* This is the full range of page tables we could possibly free 
*/
start = max(start & PGDIR_SIZE, (floor + PMD_SIZE - 1) & 
PMD_SIZE);
end = min((end + PGDIR_SIZE - 1) & PGDIR_SIZE, ceiling & 
PMD_SIZE);
flush_tlb_pgtables(tlb->mm, start, end);
}
But I'll have to go away and look at it more...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-21 Thread Hugh Dickins
On Mon, 21 Mar 2005, David S. Miller wrote:
> On Tue, 22 Mar 2005 15:14:54 +1100
> Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > Question, Dave: flush_tlb_pgtables after Hugh's patch is also
> > possibly not being called with enough range to cover all page
> > tables that have been freed.

Good question from Nick.

> > For example, you may have a single page (start,end) address range
> > to free, but if this is enclosed by a large enough (floor,ceiling)
> > then it may free an entire pgd entry.
> > 
> > I assume the intention of the API would be to provide the full
> > pgd width in that case?
> 
> It just wants the range of page tables liberated.  I guess
> essentially PMD_SIZE is the granularity.

I _think_ that answer means that my current code is fine in this respect.
But I'm not entirely convinced.  Since sparc64 is the only architecture
which implements a flush_tlb_pgtables which actually uses start,end,
we do need to suit your needs there - informed reassurance welcome!

> Anyways, for the record I made it only call flush_tlb_pgtables()
> when end > start,

Fine.  The patch I just sent, moving the tests, is how I'd like it to
be fixed finally, but what you've done in the interim should do fine.

> but instead of that BUG() I now get the BUG()
> on mm->nr_ptes being non-zero at the end of exit_mmap().

And no way would your change be causing this.  Oh dear.

> Something is up with the floor/ceiling stuff methinks.

Seems so.  I did have an off-by-one originally,
but that version never leaked out.

> It's funny since this code aparently works fine on ia64 which
> is fully 3-level too.  Hmm...

Yes, odd.  I'll have to have another think later on.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-21 Thread Hugh Dickins
On Mon, 21 Mar 2005, David S. Miller wrote:
> 
> flush_tlb_pgtables() on sparc64 has a BUG() check which
> is basically:
> 
>   BUG((long)start > (long)end);
> 
> This catches two cases of bogus arguments:
> 
> 1) start --> end straddles sparc64 address space hole

That's an interesting remark.  I hadn't noticed the signed long type.
I believe the vma gathering in free_pgtables will have no problem with
that, but what about the old code?  What happens if an app does a huge
munmap straddling the address space hole?  Or is all the user address
space below the hole?

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-21 Thread Hugh Dickins
Many thanks for the testing.

On Mon, 21 Mar 2005, David S. Miller wrote:
> 
> This adjustment of addr relative to floor is very
> strange, it can advance "addr" (and thus "start")
> past the end of the VMA we are unmapping.

Not strange, it's just trying to skip a pointless iteration.

> In fact, it is miraculious that this free_pud_range()
> calling loop terminates properly!  Actually, it is
> no mystery, since the next PGD address is the same
> for both the original and adjusted value of "addr".
> So the loop terminates after the first iteration.

Miraculous indeed, I hadn't realized those do {} while () loops
were as robust as that.  But it's certainly wrong to have been
calling them once in this case, even if they did recover.

> Anyways, there's the full analysis, what do you make
> of this Hugh? :-)

Much better than I deserve.  Silly me.  This patch should fix it.

[PATCH 6/5] freepgt: fix addr,end check

Fix placing of free_p?d_range's check for addr against end.

Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]>

--- freepgt5/mm/memory.c2005-03-22 04:28:40.0 +
+++ freepgt6/mm/memory.c2005-03-22 05:11:05.0 +
@@ -127,11 +127,6 @@ static inline void free_pmd_range(struct
unsigned long next;
unsigned long start;
 
-   if (end - 1 > ceiling - 1)
-   end -= PMD_SIZE;
-   if (addr > end - 1)
-   return;
-
start = addr;
pmd = pmd_offset(pud, addr);
do {
@@ -202,7 +197,9 @@ void free_pgd_range(struct mmu_gather **
return;
}
ceiling &= PMD_MASK;
-   if (addr > ceiling - 1)
+   if (end - 1 > ceiling - 1)
+   end -= PMD_SIZE;
+   if (addr > end - 1)
return;
 
start = addr;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-21 Thread David S. Miller
On Tue, 22 Mar 2005 15:14:54 +1100
Nick Piggin <[EMAIL PROTECTED]> wrote:

> Question, Dave: flush_tlb_pgtables after Hugh's patch is also
> possibly not being called with enough range to cover all page
> tables that have been freed.
> 
> For example, you may have a single page (start,end) address range
> to free, but if this is enclosed by a large enough (floor,ceiling)
> then it may free an entire pgd entry.
> 
> I assume the intention of the API would be to provide the full
> pgd width in that case?

It just wants the range of page tables liberated.  I guess
essentially PMD_SIZE is the granularity.

Anyways, for the record I made it only call flush_tlb_pgtables()
when end > start, but instead of that BUG() I now get the BUG()
on mm->nr_ptes being non-zero at the end of exit_mmap().

Something is up with the floor/ceiling stuff methinks.

It's funny since this code aparently works fine on ia64 which
is fully 3-level too.  Hmm...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-21 Thread Nick Piggin
On Mon, 2005-03-21 at 15:02 -0800, David S. Miller wrote:

> Anyways, there's the full analysis, what do you make
> of this Hugh? :-)

Impressive, and my name isn't even Hugh.

Question, Dave: flush_tlb_pgtables after Hugh's patch is also
possibly not being called with enough range to cover all page
tables that have been freed.

For example, you may have a single page (start,end) address range
to free, but if this is enclosed by a large enough (floor,ceiling)
then it may free an entire pgd entry.

I assume the intention of the API would be to provide the full
pgd width in that case?




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-21 Thread David S. Miller
On Mon, 21 Mar 2005 14:31:36 -0800
"Luck, Tony" <[EMAIL PROTECTED]> wrote:

> Builds clean and boots on ia64.
> 
> I haven't tried any hugetlb operations on it though.

It works on ia64 because it doesn't actually do anything
in flush_tlb_pgtables(), I bet.

Hugh, I know the exact trigger case, it's unmapping a VMA
right before the stack segment.  So the free_pgtables() call
happens with this state:

prev->vm_end== 0x70186000
next->vm_start  == 0xefab8000
vma->vm_start   == 0x70186000
vma->vm_end == 0x70188000

(so we're doing munmap(0x70186000, PAGE_SIZE), the sparc64
 stack segment for 32-bit tasks grows down from 0xf000,
 the bottom of it is at 0xefab8000 at this point in time)

So the free_pgtables() call will be with:

floor   == 0x70186000
ceiling == 0xefab8000

This should be fairly simple, so let's analyze exactly what
happens:

1) vma == the munmap() call's VMA
   next == stack segment VMA, which sits right after "vma"
   addr == 0x70186000 (base of munmap() area)

2) VMA optimization loop runs:
  next->vm_start is 0xefab8000 
  vma->vm_end is 0x70188000
  on sparc64 PMD_SIZE is 1UL << 23 or 0x80
  therefore vma->vm_end + (2 * PMD_SIZE) is 0x71188000
  this is much less than 0xefab8000 so the loop terminates
  immediately

Therefore, next is unchanged.

3) free_pgd_range() is invoked with:
addr== 0x70186000
end == 0x70188000
floor   == 0x70186000
ceiling == 0xefab8000

4) We mask addr with PMD_MASK (which is 0xff80)
   This sets addr to 0x7000, which makes it less
   than floor, therefore addr has PMD_SIZE added to it.
   Now, addr is 0x7080, this is the source of the
   problems as this value determines the "start" argument
   passed to flush_tlb_pgtables().  Note how it is larger
   than "end".

5) We also mask ceiling with PMD_MASK.
   This sets ceiling to 0xef80.
   Now addr is less than or equal to ceiling - 1 so
   we continue.

6) start is set to addr, which as stated is 0x7080,
   the free_pud_range() loop is executed

7) start is 0x7080 and end is 0x70188000
   and here we have the problem that start > end,
   flush_tlb_pgtables() is called with the arguments
   like this and we trigger the aforementioned BUG().

This adjustment of addr relative to floor is very
strange, it can advance "addr" (and thus "start")
past the end of the VMA we are unmapping.

In fact, it is miraculious that this free_pud_range()
calling loop terminates properly!  Actually, it is
no mystery, since the next PGD address is the same
for both the original and adjusted value of "addr".
So the loop terminates after the first iteration.

Anyways, there's the full analysis, what do you make
of this Hugh? :-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-21 Thread David S. Miller
On Mon, 21 Mar 2005 20:52:44 + (GMT)
Hugh Dickins <[EMAIL PROTECTED]> wrote:

Hugh, I'm getting some problems on sparc64 here:

> +static inline void free_pgd_range(struct mmu_gather *tlb,
> + unsigned long addr, unsigned long end,
> + unsigned long floor, unsigned long ceiling)
>  {
>   pgd_t *pgd;
>   unsigned long next;
> + unsigned long start;
>  
> + addr &= PMD_MASK;
> + if (addr < floor) {
> + addr += PMD_SIZE;
> + if (!addr)
> + return;
> + }
> + ceiling &= PMD_MASK;
> + if (addr > ceiling - 1)
> + return;
> +
> + start = addr;
>   pgd = pgd_offset(tlb->mm, addr);
>   do {
>   next = pgd_addr_end(addr, end);
>   if (pgd_none_or_clear_bad(pgd))
>   continue;
> - clear_pud_range(tlb, pgd, addr, next);
> + free_pud_range(tlb, pgd, addr, next, floor, ceiling);
>   } while (pgd++, addr = next, addr != end);
> +
> + if (!tlb_is_full_mm(tlb))
> + flush_tlb_pgtables(tlb->mm, start, end);
> +}

flush_tlb_pgtables() on sparc64 has a BUG() check which
is basically:

BUG((long)start > (long)end);

This catches two cases of bogus arguments:

1) start --> end straddles sparc64 address space hole
2) start > end

With your changes, this triggers on even the first user
process execution.  Specifically I get the first
trap with start=0x7080 and end=0x70188000.  (these
addresses are in the region where generally 32-bit tasks
get their non-fixed mmap() requests satisfied, so these
are probably shared library chunks).

I think the VMA gathering optimization one level up in
free_pgtables() has some logic error in it.   Maybe...

I'll try to figure out what exactly is going on, but
perhaps you can spot it before me. :-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-21 Thread David S. Miller
On Mon, 21 Mar 2005 20:52:44 + (GMT)
Hugh Dickins [EMAIL PROTECTED] wrote:

Hugh, I'm getting some problems on sparc64 here:

 +static inline void free_pgd_range(struct mmu_gather *tlb,
 + unsigned long addr, unsigned long end,
 + unsigned long floor, unsigned long ceiling)
  {
   pgd_t *pgd;
   unsigned long next;
 + unsigned long start;
  
 + addr = PMD_MASK;
 + if (addr  floor) {
 + addr += PMD_SIZE;
 + if (!addr)
 + return;
 + }
 + ceiling = PMD_MASK;
 + if (addr  ceiling - 1)
 + return;
 +
 + start = addr;
   pgd = pgd_offset(tlb-mm, addr);
   do {
   next = pgd_addr_end(addr, end);
   if (pgd_none_or_clear_bad(pgd))
   continue;
 - clear_pud_range(tlb, pgd, addr, next);
 + free_pud_range(tlb, pgd, addr, next, floor, ceiling);
   } while (pgd++, addr = next, addr != end);
 +
 + if (!tlb_is_full_mm(tlb))
 + flush_tlb_pgtables(tlb-mm, start, end);
 +}

flush_tlb_pgtables() on sparc64 has a BUG() check which
is basically:

BUG((long)start  (long)end);

This catches two cases of bogus arguments:

1) start -- end straddles sparc64 address space hole
2) start  end

With your changes, this triggers on even the first user
process execution.  Specifically I get the first
trap with start=0x7080 and end=0x70188000.  (these
addresses are in the region where generally 32-bit tasks
get their non-fixed mmap() requests satisfied, so these
are probably shared library chunks).

I think the VMA gathering optimization one level up in
free_pgtables() has some logic error in it.   Maybe...

I'll try to figure out what exactly is going on, but
perhaps you can spot it before me. :-)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-21 Thread David S. Miller
On Mon, 21 Mar 2005 14:31:36 -0800
Luck, Tony [EMAIL PROTECTED] wrote:

 Builds clean and boots on ia64.
 
 I haven't tried any hugetlb operations on it though.

It works on ia64 because it doesn't actually do anything
in flush_tlb_pgtables(), I bet.

Hugh, I know the exact trigger case, it's unmapping a VMA
right before the stack segment.  So the free_pgtables() call
happens with this state:

prev-vm_end== 0x70186000
next-vm_start  == 0xefab8000
vma-vm_start   == 0x70186000
vma-vm_end == 0x70188000

(so we're doing munmap(0x70186000, PAGE_SIZE), the sparc64
 stack segment for 32-bit tasks grows down from 0xf000,
 the bottom of it is at 0xefab8000 at this point in time)

So the free_pgtables() call will be with:

floor   == 0x70186000
ceiling == 0xefab8000

This should be fairly simple, so let's analyze exactly what
happens:

1) vma == the munmap() call's VMA
   next == stack segment VMA, which sits right after vma
   addr == 0x70186000 (base of munmap() area)

2) VMA optimization loop runs:
  next-vm_start is 0xefab8000 
  vma-vm_end is 0x70188000
  on sparc64 PMD_SIZE is 1UL  23 or 0x80
  therefore vma-vm_end + (2 * PMD_SIZE) is 0x71188000
  this is much less than 0xefab8000 so the loop terminates
  immediately

Therefore, next is unchanged.

3) free_pgd_range() is invoked with:
addr== 0x70186000
end == 0x70188000
floor   == 0x70186000
ceiling == 0xefab8000

4) We mask addr with PMD_MASK (which is 0xff80)
   This sets addr to 0x7000, which makes it less
   than floor, therefore addr has PMD_SIZE added to it.
   Now, addr is 0x7080, this is the source of the
   problems as this value determines the start argument
   passed to flush_tlb_pgtables().  Note how it is larger
   than end.

5) We also mask ceiling with PMD_MASK.
   This sets ceiling to 0xef80.
   Now addr is less than or equal to ceiling - 1 so
   we continue.

6) start is set to addr, which as stated is 0x7080,
   the free_pud_range() loop is executed

7) start is 0x7080 and end is 0x70188000
   and here we have the problem that start  end,
   flush_tlb_pgtables() is called with the arguments
   like this and we trigger the aforementioned BUG().

This adjustment of addr relative to floor is very
strange, it can advance addr (and thus start)
past the end of the VMA we are unmapping.

In fact, it is miraculious that this free_pud_range()
calling loop terminates properly!  Actually, it is
no mystery, since the next PGD address is the same
for both the original and adjusted value of addr.
So the loop terminates after the first iteration.

Anyways, there's the full analysis, what do you make
of this Hugh? :-)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-21 Thread Nick Piggin
On Mon, 2005-03-21 at 15:02 -0800, David S. Miller wrote:

 Anyways, there's the full analysis, what do you make
 of this Hugh? :-)

Impressive, and my name isn't even Hugh.

Question, Dave: flush_tlb_pgtables after Hugh's patch is also
possibly not being called with enough range to cover all page
tables that have been freed.

For example, you may have a single page (start,end) address range
to free, but if this is enclosed by a large enough (floor,ceiling)
then it may free an entire pgd entry.

I assume the intention of the API would be to provide the full
pgd width in that case?




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-21 Thread Hugh Dickins
Many thanks for the testing.

On Mon, 21 Mar 2005, David S. Miller wrote:
 
 This adjustment of addr relative to floor is very
 strange, it can advance addr (and thus start)
 past the end of the VMA we are unmapping.

Not strange, it's just trying to skip a pointless iteration.

 In fact, it is miraculious that this free_pud_range()
 calling loop terminates properly!  Actually, it is
 no mystery, since the next PGD address is the same
 for both the original and adjusted value of addr.
 So the loop terminates after the first iteration.

Miraculous indeed, I hadn't realized those do {} while () loops
were as robust as that.  But it's certainly wrong to have been
calling them once in this case, even if they did recover.

 Anyways, there's the full analysis, what do you make
 of this Hugh? :-)

Much better than I deserve.  Silly me.  This patch should fix it.

[PATCH 6/5] freepgt: fix addr,end check

Fix placing of free_p?d_range's check for addr against end.

Signed-off-by: Hugh Dickins [EMAIL PROTECTED]

--- freepgt5/mm/memory.c2005-03-22 04:28:40.0 +
+++ freepgt6/mm/memory.c2005-03-22 05:11:05.0 +
@@ -127,11 +127,6 @@ static inline void free_pmd_range(struct
unsigned long next;
unsigned long start;
 
-   if (end - 1  ceiling - 1)
-   end -= PMD_SIZE;
-   if (addr  end - 1)
-   return;
-
start = addr;
pmd = pmd_offset(pud, addr);
do {
@@ -202,7 +197,9 @@ void free_pgd_range(struct mmu_gather **
return;
}
ceiling = PMD_MASK;
-   if (addr  ceiling - 1)
+   if (end - 1  ceiling - 1)
+   end -= PMD_SIZE;
+   if (addr  end - 1)
return;
 
start = addr;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-21 Thread Hugh Dickins
On Mon, 21 Mar 2005, David S. Miller wrote:
 
 flush_tlb_pgtables() on sparc64 has a BUG() check which
 is basically:
 
   BUG((long)start  (long)end);
 
 This catches two cases of bogus arguments:
 
 1) start -- end straddles sparc64 address space hole

That's an interesting remark.  I hadn't noticed the signed long type.
I believe the vma gathering in free_pgtables will have no problem with
that, but what about the old code?  What happens if an app does a huge
munmap straddling the address space hole?  Or is all the user address
space below the hole?

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-21 Thread Hugh Dickins
On Mon, 21 Mar 2005, David S. Miller wrote:
 On Tue, 22 Mar 2005 15:14:54 +1100
 Nick Piggin [EMAIL PROTECTED] wrote:
 
  Question, Dave: flush_tlb_pgtables after Hugh's patch is also
  possibly not being called with enough range to cover all page
  tables that have been freed.

Good question from Nick.

  For example, you may have a single page (start,end) address range
  to free, but if this is enclosed by a large enough (floor,ceiling)
  then it may free an entire pgd entry.
  
  I assume the intention of the API would be to provide the full
  pgd width in that case?
 
 It just wants the range of page tables liberated.  I guess
 essentially PMD_SIZE is the granularity.

I _think_ that answer means that my current code is fine in this respect.
But I'm not entirely convinced.  Since sparc64 is the only architecture
which implements a flush_tlb_pgtables which actually uses start,end,
we do need to suit your needs there - informed reassurance welcome!

 Anyways, for the record I made it only call flush_tlb_pgtables()
 when end  start,

Fine.  The patch I just sent, moving the tests, is how I'd like it to
be fixed finally, but what you've done in the interim should do fine.

 but instead of that BUG() I now get the BUG()
 on mm-nr_ptes being non-zero at the end of exit_mmap().

And no way would your change be causing this.  Oh dear.

 Something is up with the floor/ceiling stuff methinks.

Seems so.  I did have an off-by-one originally,
but that version never leaked out.

 It's funny since this code aparently works fine on ia64 which
 is fully 3-level too.  Hmm...

Yes, odd.  I'll have to have another think later on.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-21 Thread Nick Piggin
Hugh Dickins wrote:
On Mon, 21 Mar 2005, David S. Miller wrote:
On Tue, 22 Mar 2005 15:14:54 +1100
Nick Piggin [EMAIL PROTECTED] wrote:

Question, Dave: flush_tlb_pgtables after Hugh's patch is also
possibly not being called with enough range to cover all page
tables that have been freed.

Good question from Nick.

For example, you may have a single page (start,end) address range
to free, but if this is enclosed by a large enough (floor,ceiling)
then it may free an entire pgd entry.
I assume the intention of the API would be to provide the full
pgd width in that case?
It just wants the range of page tables liberated.  I guess
essentially PMD_SIZE is the granularity.

I _think_ that answer means that my current code is fine in this respect.
But I'm not entirely convinced.  Since sparc64 is the only architecture
which implements a flush_tlb_pgtables which actually uses start,end,
we do need to suit your needs there - informed reassurance welcome!
But Hugh I think you may still be freeing pgd (PGDIR_SIZE)
regions that you don't quite cover with flush_tlb_pgtables?
I would have thought you'd need something like this:
if (!tlb_is_full_mm(tlb)) {
/* This is the full range of page tables we could possibly free 
*/
start = max(start  PGDIR_SIZE, (floor + PMD_SIZE - 1)  
PMD_SIZE);
end = min((end + PGDIR_SIZE - 1)  PGDIR_SIZE, ceiling  
PMD_SIZE);
flush_tlb_pgtables(tlb-mm, start, end);
}
But I'll have to go away and look at it more...
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 1/5] freepgt: free_pgtables use vma list

2005-03-21 Thread Luck, Tony
Builds clean and boots on ia64.

I haven't tried any hugetlb operations on it though.

-Tony
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/