Re: [PATCH 1/5] freepgt: free_pgtables use vma list
> Thanks for the warning, Ben, but I don't see a problem there: that's > in your separate ioremap_mm, which is rather like init_mm, and won't > ever go through exit_mmap, and doesn't need its page tables freed - > isn't that right? Right. > But it was worth auditing the different architectures for this: there > seems to be just one problem, where the x86_64 32-bit vsyscall page > is mapped into userspace by __map_syscall32 without linking a real > vma into mm. Which Andi has already marked with his "RED-PEN". The ppc64 vDSO is mapped with a real VMA bot not mmap call (the vDMA is built from scratch from binfmt_elf, or rather from an arch callback issued by binfmt_elf, like the stack VMA). Though should be fine too though but you may want to double check. > That's not something I can fix up in a hurry. Yes, as the comment > suggests we should probably allocate a real vma for it, but that might > have a few consequences (if only /proc//maps showing two vdsos?). > I'll have to let someone else deal with that. Why 2 ? we map the 32 bits one for 32 bits processes and the 64 bits one for 64 bits processes on ppc64 without problem ... In fact, Andi could even re-use our hook I suppose. The way I do it allows also for free copy-on-write semantics (with mprotect though, I don't set it writeable by default) so that gdb can put breakpoints in it, etc... > For the moment, I think the behaviour of x86_64 32bit-support with > the freepgt patches will depend on personality - ADDR_LIMIT_32BIT > should usually work fine (unless the app moves its stack elsewhere > and munmaps the old one: that might well unmap its vdso too); but > ADDR_LIMIT_3GB will be liable to leak tables (if get_user_pages or > its /proc//maps has been examined). I don't know how common > ADDR_LIMIT_3GB use is, but whatever: okay for testing, but not for > including the patches in a release. > > Hugh -- Benjamin Herrenschmidt <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Wed, 23 Mar 2005, Benjamin Herrenschmidt wrote: > On Tue, 2005-03-22 at 16:37 +, Hugh Dickins wrote: > > > > I cannot see those arches doing pte_allocs outside their vmas, > > that of course could cause it. And nr_ptes is initialized to 0 > > once by memset and again by assignment, so it should be starting > > out even zeroer than most fields. > > We do funny things in arch/ppc64/mm/init.c in the ioremap_mm, where we > don't use VMAs but our own mecanism (yah, ugly, but that's some legacy > we have from the original port, though I do intend to change that at one > point). Thanks for the warning, Ben, but I don't see a problem there: that's in your separate ioremap_mm, which is rather like init_mm, and won't ever go through exit_mmap, and doesn't need its page tables freed - isn't that right? But it was worth auditing the different architectures for this: there seems to be just one problem, where the x86_64 32-bit vsyscall page is mapped into userspace by __map_syscall32 without linking a real vma into mm. Which Andi has already marked with his "RED-PEN". That's not something I can fix up in a hurry. Yes, as the comment suggests we should probably allocate a real vma for it, but that might have a few consequences (if only /proc//maps showing two vdsos?). I'll have to let someone else deal with that. For the moment, I think the behaviour of x86_64 32bit-support with the freepgt patches will depend on personality - ADDR_LIMIT_32BIT should usually work fine (unless the app moves its stack elsewhere and munmaps the old one: that might well unmap its vdso too); but ADDR_LIMIT_3GB will be liable to leak tables (if get_user_pages or its /proc//maps has been examined). I don't know how common ADDR_LIMIT_3GB use is, but whatever: okay for testing, but not for including the patches in a release. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Wed, 23 Mar 2005 13:10:42 +1100 Nick Piggin <[EMAIL PROTECTED]> wrote: > The ugly thing you get with an inclusive ceiling is that your masking > becomes more difficult I think. Good point. > I might try to attack this from another angle and see if I can come up > with something. Great, let me know if you want something tested out on sparc64. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
David S. Miller wrote: On Tue, 22 Mar 2005 17:10:13 -0800 Andrew Morton <[EMAIL PROTECTED]> wrote: Hugh Dickins <[EMAIL PROTECTED]> wrote: On Tue, 22 Mar 2005, Luck, Tony wrote: > > But I'm still confused by all the math on addr/end at each > level. You think the rest of us are not ;-? umm, given the difficulty which you guys are having with this, I get a bit worried about clarity, simplicity and maintainability of the end result. We're working on it, trust me :-) I have a simplification in mind that should take care of the issue that led us to these problems. We should simply pass in "ceiling" as "-1" instead of "0". Every single test against ceiling is really done against "ceiling - 1". Therefore, passing ceiling in as "top - 1" and then adjusting the tests will clean this up substantially and make is much simpler. The ugly thing you get with an inclusive ceiling is that your masking becomes more difficult I think. I might try to attack this from another angle and see if I can come up with something. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Wed, 23 Mar 2005 00:51:02 + (GMT) Hugh Dickins <[EMAIL PROTECTED]> wrote: > This actual example helped to focus my mind a lot, thank you. No problem, I needed to work through specific examples to see things clearly too. > > and things seem to behave. I'll try to analyze things > > further and test this out on a real kernel, but all of > > these adjustments at the top of free_pgd_range() really > > start to look like pure spaghetti. :-) > > Well, it's trying to decide in reasonably few steps that it's not > worth wasting time going down to the deeper levels. Lots of > "return"s as it eliminates cases, yes. Yes, I understand. But let's recognize (as I mention in another email) that all of the tests against ceiling are against "ceiling - 1". If we pass -1 instead of 0 (and "foo - 1" instead of "foo") as the ceiling arg, then adjust the tests to be against plain "ceiling", so much of the special casing disappears. There are probably other simplifications. This is kind of what I was hinting at when I said it looks like spaghetti. :-) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Tue, 22 Mar 2005 17:10:13 -0800 Andrew Morton <[EMAIL PROTECTED]> wrote: > Hugh Dickins <[EMAIL PROTECTED]> wrote: > > > > On Tue, 22 Mar 2005, Luck, Tony wrote: > > > > > > But I'm still confused by all the math on addr/end at each > > > level. > > > > You think the rest of us are not ;-? > > umm, given the difficulty which you guys are having with this, I get a bit > worried about clarity, simplicity and maintainability of the end result. We're working on it, trust me :-) I have a simplification in mind that should take care of the issue that led us to these problems. We should simply pass in "ceiling" as "-1" instead of "0". Every single test against ceiling is really done against "ceiling - 1". Therefore, passing ceiling in as "top - 1" and then adjusting the tests will clean this up substantially and make is much simpler. I'm even sure that other similar refactoring is possible. Hugh took the quantum leap we needed by implementing this at all, some spaghetti code tests do not detract from his work conceptually. That kind of stuff can be cleaned up. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
Hugh Dickins <[EMAIL PROTECTED]> wrote: > > On Tue, 22 Mar 2005, Luck, Tony wrote: > > > > But I'm still confused by all the math on addr/end at each > > level. > > You think the rest of us are not ;-? umm, given the difficulty which you guys are having with this, I get a bit worried about clarity, simplicity and maintainability of the end result. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/5] freepgt: free_pgtables use vma list
On Tue, 22 Mar 2005, Luck, Tony wrote: > > But I'm still confused by all the math on addr/end at each > level. You think the rest of us are not ;-? > Rounding up/down at each level should presumably be > based on the size of objects at the next level. So the pgd > code should round using PUD_MASK, pud should use PMD_MASK etc. > Perhaps I missed some updates, but the version of the patch > that I have (and the simulator) is using PMD_MASK in the > pgd_free_range() function ... which is surely wrong. It's confusing but not wrong (in principle). It's trying to decide immediately on entry whether it will be worth going down to the lower levels: if even the lowest level will have no work to do, no point in proceeding. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Tue, 22 Mar 2005, David S. Miller wrote: > On Tue, 22 Mar 2005 21:51:39 + (GMT) > Hugh Dickins <[EMAIL PROTECTED]> wrote: > > > I still can't see what's wrong with the code that's already > > there. My brain is seizing up, I'm taking a break. > > Ok, meanwhile I'll do a brain dump of what I think this > code should be doing. > > Let's take an example free_pgd_range() call. Say the > address parameters are: > > addr 0x1 > end 0xa4000 > floor 0x0 > ceiling 0xb2000 This actual example helped to focus my mind a lot, thank you. > (This example comes from my exit_mmap() VMA dump earlier > in this thread. If you disable the VMA skipping optimization > the first call to free_pgd_range() has these parameters.) > > What ought this free_pgd_range() call do? This range of > addresses, from floor to ceiling, is smaller than a PMD_SIZE > (which on sparc64 is 1 << 23). Therefore it should clear > no PGD or PUD entries. Yup, it ought to decide at the beginning of free_pgd_range that it simply has no work to do. > Yet, it does clear them, specifically: > > free_pgd_range(): > 1) mask addr (0x1) to PMD_MASK, addr is now 0 > 2) addr < floor (0x0) test does not pass > 3) mask ceiling (0xb2000) to PMD_MASK, ceiling is now 0 too > 4) end - 1 > ceiling - 1 test does not pass > 5) addr > end - 1 test does not pass either And now we've gone wrong, yes. > The source of the problems seems to be how ceiling began > at the top of the call chain as 0xb2000, but when we > masked it with PMD_MASK that set it to zero, which means > "top of address space" in these functions. That's not > what we want. > > I added a quick hack to the simulator I posted, where > we mask ceiling in free_pgd_range(), I do it like this: > > if (ceiling) { > ceiling &= PMD_MASK; > if (!ceiling) > return; > } At first that just looked like a hack to me. But on reflection, no, you're doing exactly what I had to do with addr above: in the case where we arrive at 0 from non-0 value, have to get out quick to avoid confusion with the "other" 0. These wrap issues are hard. And in other mail I see you found more such checks were needed. I believe you've got it, thank you so much! Though frankly, by now, I'm sure of nothing: will review in the morning. > and things seem to behave. I'll try to analyze things > further and test this out on a real kernel, but all of > these adjustments at the top of free_pgd_range() really > start to look like pure spaghetti. :-) Well, it's trying to decide in reasonably few steps that it's not worth wasting time going down to the deeper levels. Lots of "return"s as it eliminates cases, yes. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/5] freepgt: free_pgtables use vma list
On Tue, 22 Mar 2005, Luck, Tony wrote: > > Alternatively you could modify the use of floor/ceiling as they > are passed down from the top level to indicate the progressively > greater address ranges that have been dealt with ... but I'm not > completely convinced that gives you enough information. You would > need to do careful extension of the ceiling at each level to make > sure that you reach out to the end of of each table at each level > to account for gaps between vmas (which would result in no future > calldown hitting this part of the table). That pretty much describes how it does work (when it works!) Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Wed, 23 Mar 2005 11:19:38 +1100 Nick Piggin <[EMAIL PROTECTED]> wrote: > > dramatically, shell performance is way down on sparc64. > > I'll post before and after numbers in a bit. Note, this is > > just with Hugh's base patch plus bug fixes. > > > > That's interesting. The only "extra" stuff Hugh's should be > doing is the vma traversal. Like I said in another posting, ignore this anomaly. It's some difference in the way the shell runs when done from "init=/bin/sh" single user vs. full multi-user. Full sparc64 before/after in that posting, go check it out it's nice :-) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
David S. Miller wrote: On Wed, 23 Mar 2005 10:32:10 +1100 Nick Piggin <[EMAIL PROTECTED]> wrote: I think David's on the right track - I think there's something a bit wrong at the top. In my reply to Andrew in this thread I posted a patch which may at least get things working... We have to do the "if (ceiling)" check in every spot where we mask it in some way, and if it falls to zero from non-zero due to the masking, we skip. That gives me a mostly working kernel. I see. Tricky. Bad news is that while lat_proc's fork and exec tests improve dramatically, shell performance is way down on sparc64. I'll post before and after numbers in a bit. Note, this is just with Hugh's base patch plus bug fixes. That's interesting. The only "extra" stuff Hugh's should be doing is the vma traversal. The single walk patch seems to fit in quite well with Hugh's updated patchset. I haven't quite worked out how to best do hugepages, but it is easily possible. But I won't dust that off again until Hugh's is nicely tested and working. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
Ok, this patch, on top of Hugh's original freepgt patch, gets me a working system. It includes Hugh's bug fix, plus the ceiling masking roll-over fix of mine. It should get ppc working too, I bet. --- mm/memory.c.hugh2005-03-22 16:01:07.0 -0800 +++ mm/memory.c 2005-03-22 16:00:08.0 -0800 @@ -127,11 +127,6 @@ unsigned long next; unsigned long start; - if (end - 1 > ceiling - 1) - end -= PMD_SIZE; - if (addr > end - 1) - return; - start = addr; pmd = pmd_offset(pud, addr); do { @@ -144,7 +139,11 @@ start &= PUD_MASK; if (start < floor) return; - ceiling &= PUD_MASK; + if (ceiling) { + ceiling &= PUD_MASK; + if (!ceiling) + return; + } if (end - 1 > ceiling - 1) return; @@ -173,7 +172,11 @@ start &= PGDIR_MASK; if (start < floor) return; - ceiling &= PGDIR_MASK; + if (ceiling) { + ceiling &= PGDIR_MASK; + if (!ceiling) + return; + } if (end - 1 > ceiling - 1) return; @@ -201,8 +204,14 @@ if (!addr) return; } - ceiling &= PMD_MASK; - if (addr > ceiling - 1) + if (ceiling) { + ceiling &= PMD_MASK; + if (!ceiling) + return; + } + if (end - 1 > ceiling - 1) + end -= PMD_SIZE; + if (addr > end - 1) return; start = addr; @@ -214,7 +223,7 @@ free_pud_range(tlb, pgd, addr, next, floor, ceiling); } while (pgd++, addr = next, addr != end); - if (!tlb_is_full_mm(tlb) && start < end) + if (!tlb_is_full_mm(tlb)) flush_tlb_pgtables(tlb->mm, start, end); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
Ok, here are (finally, I've been debugging this so much purely to see these things) some lmbench numbers with Hugh's base patch on sparc64. Ignore my previous comments about shell performance getting worse, it's some difference that makes things run more slowly in single user mode compared to a fully brought up system. First, for 32-bit tasks. BEFORE Process fork+exit: 171.4483 microseconds Process fork+exit: 171.9688 microseconds Process fork+exit: 169.2727 microseconds Process fork+exit: 169.0333 microseconds Process fork+exit: 165.8065 microseconds Process fork+execve: 555.7000 microseconds Process fork+execve: 556.6000 microseconds Process fork+execve: 552.6000 microseconds Process fork+execve: 557.1000 microseconds Process fork+execve: 552. microseconds Process fork+/bin/sh -c: 2207. microseconds Process fork+/bin/sh -c: 2183. microseconds Process fork+/bin/sh -c: 2179.6667 microseconds Process fork+/bin/sh -c: 2190. microseconds Process fork+/bin/sh -c: 2197.6667 microseconds AFTER Process fork+exit: 142.9487 microseconds Process fork+exit: 147.8649 microseconds Process fork+exit: 139.0250 microseconds Process fork+exit: 138.9250 microseconds Process fork+exit: 136.9268 microseconds Process fork+execve: 478. microseconds Process fork+execve: 479.1667 microseconds Process fork+execve: 479.9091 microseconds Process fork+execve: 480.1667 microseconds Process fork+execve: 479.9091 microseconds Process fork+/bin/sh -c: 2026. microseconds Process fork+/bin/sh -c: 2029.6667 microseconds Process fork+/bin/sh -c: 2044.6667 microseconds Process fork+/bin/sh -c: 2037.6667 microseconds Process fork+/bin/sh -c: 2028.6667 microseconds Pretty good, now for 64-bit processes. BEFORE Process fork+exit: 226.5200 microseconds Process fork+exit: 230.0417 microseconds Process fork+exit: 223.8800 microseconds Process fork+exit: 226.4091 microseconds Process fork+exit: 219.3043 microseconds Process fork+execve: 799.8571 microseconds Process fork+execve: 806.1429 microseconds Process fork+execve: 799.5714 microseconds Process fork+execve: 800.8571 microseconds Process fork+execve: 788.7143 microseconds Process fork+/bin/sh -c: 2655. microseconds Process fork+/bin/sh -c: 2668.5000 microseconds Process fork+/bin/sh -c: 2649. microseconds Process fork+/bin/sh -c: 2662.5000 microseconds Process fork+/bin/sh -c: 2642. microseconds AFTER Process fork+exit: 165.1212 microseconds Process fork+exit: 159.4571 microseconds Process fork+exit: 160.3714 microseconds Process fork+exit: 158.9091 microseconds Process fork+exit: 157.2188 microseconds Process fork+execve: 536.4545 microseconds Process fork+execve: 542.0909 microseconds Process fork+execve: 536.3000 microseconds Process fork+execve: 540.6364 microseconds Process fork+execve: 537.1818 microseconds Process fork+/bin/sh -c: 2275. microseconds Process fork+/bin/sh -c: 2272. microseconds Process fork+/bin/sh -c: 2275.6667 microseconds Process fork+/bin/sh -c: 2270. microseconds Process fork+/bin/sh -c: 2284. microseconds Quite nice. It makes the 64-bit numbers on par with the 32-bit numbers. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Tue, 22 Mar 2005 15:53:08 -0800 "Luck, Tony" <[EMAIL PROTECTED]> wrote: > But I'm still confused by all the math on addr/end at each > level. Rounding up/down at each level should presumably be > based on the size of objects at the next level. So the pgd > code should round using PUD_MASK, pud should use PMD_MASK etc. > Perhaps I missed some updates, but the version of the patch > that I have (and the simulator) is using PMD_MASK in the > pgd_free_range() function ... which is surely wrong. PMD_MASK decides the smallest page table chunk, so we mask it at the top level. Look at the next level down in the call chain, the masking maskes more sense there. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/5] freepgt: free_pgtables use vma list
>How it works is that it knows the extent in each direction >where mappings do not exist. > >Once we know we have a clear span up to the next PMD_SIZE >modulo (and PUD_SIZE and so on and so forth) we know we >can liberate the page table chunks covered by such ranges. Ok ... I see that now (I was mostly looking at free_pgtables() and missed the conditional in the arglist that passes the start of the next vma into free_pgd_range() as the ceiling until we run out of vmas and pass in "ceiling". But I'm still confused by all the math on addr/end at each level. Rounding up/down at each level should presumably be based on the size of objects at the next level. So the pgd code should round using PUD_MASK, pud should use PMD_MASK etc. Perhaps I missed some updates, but the version of the patch that I have (and the simulator) is using PMD_MASK in the pgd_free_range() function ... which is surely wrong. -Tony - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Wed, 23 Mar 2005 10:32:10 +1100 Nick Piggin <[EMAIL PROTECTED]> wrote: > I think David's on the right track - I think there's something a > bit wrong at the top. In my reply to Andrew in this thread I > posted a patch which may at least get things working... We have to do the "if (ceiling)" check in every spot where we mask it in some way, and if it falls to zero from non-zero due to the masking, we skip. That gives me a mostly working kernel. Bad news is that while lat_proc's fork and exec tests improve dramatically, shell performance is way down on sparc64. I'll post before and after numbers in a bit. Note, this is just with Hugh's base patch plus bug fixes. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Tue, 2005-03-22 at 12:21 -0800, David S. Miller wrote: > On Tue, 22 Mar 2005 19:36:46 + (GMT) > Hugh Dickins <[EMAIL PROTECTED]> wrote: > > > I notice that although both i386 and sparc64 use pgtable-nopud.h, the > > i386 pud_clear does nothing at all and the sparc64 pud_clear resets to 0. > > Aha! And ppc does as well via asm-generic/4level-fixup.h which is > probably why I did it this way as well when I converted sparc64 > over to asm-generic/pgtable-nopud.h > > Hmmm, let me play around with this. I converted ppc64 to no-pud.h but it seems to introduce some breakage... X kills the box when launched on the G5, haven't had time to investigate yet, it might be some bug of mine. Here is my current patch: Index: linux-work/include/asm-ppc64/pgalloc.h === --- linux-work.orig/include/asm-ppc64/pgalloc.h 2005-03-07 14:00:11.0 +1100 +++ linux-work/include/asm-ppc64/pgalloc.h 2005-03-17 17:00:50.0 +1100 @@ -30,7 +30,7 @@ kmem_cache_free(zero_cache, pgd); } -#define pgd_populate(MM, PGD, PMD) pgd_set(PGD, PMD) +#define pud_populate(MM, PUD, PMD) pud_set(PUD, PMD) static inline pmd_t * pmd_alloc_one(struct mm_struct *mm, unsigned long addr) Index: linux-work/include/asm-ppc64/pgtable.h === --- linux-work.orig/include/asm-ppc64/pgtable.h 2005-03-07 14:00:11.0 +1100 +++ linux-work/include/asm-ppc64/pgtable.h 2005-03-17 17:00:50.0 +1100 @@ -1,8 +1,6 @@ #ifndef _PPC64_PGTABLE_H #define _PPC64_PGTABLE_H -#include - /* * This file contains the functions and defines necessary to modify and use * the ppc64 hashed page table. @@ -17,6 +15,8 @@ #include #endif /* __ASSEMBLY__ */ +#include + /* PMD_SHIFT determines what a second-level page table entry can map */ #define PMD_SHIFT (PAGE_SHIFT + PAGE_SHIFT - 3) #define PMD_SIZE (1UL << PMD_SHIFT) @@ -216,12 +216,12 @@ #define pmd_page_kernel(pmd) \ (__bpn_to_ba(pmd_val(pmd) >> PMD_TO_PTEPAGE_SHIFT)) #define pmd_page(pmd) virt_to_page(pmd_page_kernel(pmd)) -#define pgd_set(pgdp, pmdp)(pgd_val(*(pgdp)) = (__ba_to_bpn(pmdp))) -#define pgd_none(pgd) (!pgd_val(pgd)) -#define pgd_bad(pgd) ((pgd_val(pgd)) == 0) -#define pgd_present(pgd) (pgd_val(pgd) != 0UL) -#define pgd_clear(pgdp)(pgd_val(*(pgdp)) = 0UL) -#define pgd_page(pgd) (__bpn_to_ba(pgd_val(pgd))) +#define pud_set(pgdp, pmdp)(pud_val(*(pgdp)) = (__ba_to_bpn(pmdp))) +#define pud_none(pgd) (!pud_val(pgd)) +#define pud_bad(pgd) ((pud_val(pgd)) == 0) +#define pud_present(pgd) (pud_val(pgd) != 0UL) +#define pud_clear(pgdp)(pud_val(*(pgdp)) = 0UL) +#define pud_page(pgd) (__bpn_to_ba(pud_val(pgd))) /* * Find an entry in a page-table-directory. We combine the address region @@ -233,8 +233,8 @@ #define pgd_offset(mm, address) ((mm)->pgd + pgd_index(address)) /* Find an entry in the second-level page table.. */ -#define pmd_offset(dir,addr) \ - ((pmd_t *) pgd_page(*(dir)) + (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))) +#define pmd_offset(pudp,addr) \ + ((pmd_t *) pud_page(*(pudp)) + (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))) /* Find an entry in the third-level page table.. */ #define pte_offset_kernel(dir,addr) \ @@ -552,20 +552,23 @@ static inline pte_t *find_linux_pte(pgd_t *pgdir, unsigned long ea) { pgd_t *pg; + pud_t *pu; pmd_t *pm; pte_t *pt = NULL; pte_t pte; pg = pgdir + pgd_index(ea); if (!pgd_none(*pg)) { - - pm = pmd_offset(pg, ea); - if (pmd_present(*pm)) { - pt = pte_offset_kernel(pm, ea); - pte = *pt; - if (!pte_present(pte)) - pt = NULL; - } + pu = pud_offset(pg, ea); + if (!pud_none(*pu)) { + pm = pmd_offset(pu, ea); + if (pmd_present(*pm)) { + pt = pte_offset_kernel(pm, ea); + pte = *pt; + if (!pte_present(pte)) + pt = NULL; + } + } } return pt; Index: linux-work/arch/ppc64/mm/init.c === --- linux-work.orig/arch/ppc64/mm/init.c2005-03-15 11:57:29.0 +1100 +++ linux-work/arch/ppc64/mm/init.c 2005-03-17 17:20:14.0 +1100 @@ -136,14 +136,78 @@ #else +static void unmap_im_area_pte(pmd_t *pmd, unsigned long addr, + unsigned long end) +{ + pte_t *pte; + + pte = pte_offset_kernel(pmd, addr); + do { +
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
Hugh Dickins wrote: On Tue, 22 Mar 2005, David S. Miller wrote: On Tue, 22 Mar 2005 19:36:46 + (GMT) Hugh Dickins <[EMAIL PROTECTED]> wrote: I notice that although both i386 and sparc64 use pgtable-nopud.h, the i386 pud_clear does nothing at all and the sparc64 pud_clear resets to 0. This was a dead end. I386 doesn't do anything with pud_clear() in order to work around a chip erratum. IA64 does clear in pud_clear() just like sparc64. My mind kept flipping back and forth on whether it was pud_clear(). I agree, I can't see that it's the issue now. It shouldn't be. In the case that pud is folded, free_pud_range will only call into free_pmd_range once, and that function will loop over the required range of the pud (ie. the pmd). If it then also falls through to pud_clear in that function, it will also fall through to pgd_clear in free_pud_range. So it doesn't _really_ matter which one does the actual clearing in that case. I think David's on the right track - I think there's something a bit wrong at the top. In my reply to Andrew in this thread I posted a patch which may at least get things working... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Tue, 22 Mar 2005 14:40:55 -0800 "Luck, Tony" <[EMAIL PROTECTED]> wrote: > Then I don't see how we decide when to clear a pointer at each > level. Are there counters of how many entries are active in each > table at all levels (pgd/pud/pmd/pte)? No, there are no counters. How it works is that it knows the extent in each direction where mappings do not exist. Once we know we have a clear span up to the next PMD_SIZE modulo (and PUD_SIZE and so on and so forth) we know we can liberate the page table chunks covered by such ranges. Say we are unmapping a page at some address. The next VMA in the address space says where the next potentially valid mapping resides. The previous VMA says similarly. If this is the first or last VMA, we use the beginning or end of the virtual address space as our value. Play around with my little simulator I posted, you'll see how it works ;-) Actually, this is the second such simulator you have seen Tony :-) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Tue, 2005-03-22 at 16:37 +, Hugh Dickins wrote: > On Tue, 22 Mar 2005, Andrew Morton wrote: > > > > With these six patches the ppc64 is hitting the BUG in exit_mmap(): > > > > BUG_ON(mm->nr_ptes);/* This is just debugging */ > > > > fairly early in boot. > > So ppc64 is in the same boat as sparc64 (yet ia64 okay so far). > > Sorry, I'm still clueless. > > I cannot see those arches doing pte_allocs outside their vmas, > that of course could cause it. And nr_ptes is initialized to 0 > once by memset and again by assignment, so it should be starting > out even zeroer than most fields. We do funny things in arch/ppc64/mm/init.c in the ioremap_mm, where we don't use VMAs but our own mecanism (yah, ugly, but that's some legacy we have from the original port, though I do intend to change that at one point). > I should probably be paying more attention to the repellent > notion that my code is broken. > > If you and David could try the lame patch below, > it'll at least give us a slight clue of where to be looking - > every mm exiting with nr_ptes 1 means something different from > every mm exiting with nr_ptes -1 means something different from > occasional mms exiting with nr_ptes something positive. > > I'm not sure whether the patch would ever get to show a more > interesting proc name than "?". > > And does memory leak away into lost pagetables if you continue > running, or does it actually carry on running fine, and the > problem appear to be with the BUG_ON itself? > > Thanks, > Hugh > > --- freepgt6/mm/mmap.c2005-03-22 04:28:40.0 + > +++ testing/mm/mmap.c 2005-03-22 15:45:00.0 + > @@ -1896,6 +1896,7 @@ EXPORT_SYMBOL(do_brk); > /* Release all mmaps. */ > void exit_mmap(struct mm_struct *mm) > { > + static unsigned long good_mms, bad_mms; > struct mmu_gather *tlb; > struct vm_area_struct *vma = mm->mmap; > unsigned long nr_accounted = 0; > @@ -1931,7 +1932,14 @@ void exit_mmap(struct mm_struct *mm) > vma = next; > } > > - BUG_ON(mm->nr_ptes);/* This is just debugging */ > + if (mm->nr_ptes && bad_mms < 250) { > + printk(KERN_ERR "exit_mmap: %s nr_ptes %ld good_mms %lu\n", > + current->mm == mm? current->comm: "?", > + (long)mm->nr_ptes, good_mms); > + good_mms = 0; > + bad_mms++; > + } else > + good_mms++; > } > > /* Insert vm structure into process list sorted by address -- Benjamin Herrenschmidt <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Tue, 22 Mar 2005 21:51:39 + (GMT) Hugh Dickins <[EMAIL PROTECTED]> wrote: > I still can't see what's wrong with the code that's already > there. My brain is seizing up, I'm taking a break. Ok, meanwhile I'll do a brain dump of what I think this code should be doing. Let's take an example free_pgd_range() call. Say the address parameters are: addr0x1 end 0xa4000 floor 0x0 ceiling 0xb2000 (This example comes from my exit_mmap() VMA dump earlier in this thread. If you disable the VMA skipping optimization the first call to free_pgd_range() has these parameters.) What ought this free_pgd_range() call do? This range of addresses, from floor to ceiling, is smaller than a PMD_SIZE (which on sparc64 is 1 << 23). Therefore it should clear no PGD or PUD entries. Yet, it does clear them, specifically: free_pgd_range(): 1) mask addr (0x1) to PMD_MASK, addr is now 0 2) addr < floor (0x0) test does not pass 3) mask ceiling (0xb2000) to PMD_MASK, ceiling is now 0 too 4) end - 1 > ceiling - 1 test does not pass 5) addr > end - 1 test does not pass either 6) We now loop one PGDIR_SIZE at a time from addr (0x0) to end (0xa4000), calling down into... free_pud_range(): 1) addr=0, end=0xa4000, floor=0, ceiling=0 2) We loop one PUD_SIZE at a time from addr (0x0) to end (0xa4000), calling down into... free_pmd_range(): 1) addr=0, end=0xa4000, floor=0, ceiling=0 2) We loop one PMD_SIZE at a time from addr (0x0) to end (0xa4000), calling down into... free_pte_range(): And later when we finish the loops in free_pmd_range() and free_pud_range() we do pud_clear() and pgd_clear() respectively, both wrong. The source of the problems seems to be how ceiling began at the top of the call chain as 0xb2000, but when we masked it with PMD_MASK that set it to zero, which means "top of address space" in these functions. That's not what we want. I added a quick hack to the simulator I posted, where we mask ceiling in free_pgd_range(), I do it like this: if (ceiling) { ceiling &= PMD_MASK; if (!ceiling) return; } and things seem to behave. I'll try to analyze things further and test this out on a real kernel, but all of these adjustments at the top of free_pgd_range() really start to look like pure spaghetti. :-) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/5] freepgt: free_pgtables use vma list
>> be changed to use pgd_addr_end() to gather up all the vma that >> are mapped by a single pgd instead of just spanning out the next >> PMD_SIZE? > >Oh, I don't think so. I suppose it could be done at this level, >but then the lower levels would go back to searching through lots >of unnecessary cachelines to find the significant entries, and >we might as well throw out the whole set of patches (which will >soon happen anyway if we can't find why they're not working!). Then I don't see how we decide when to clear a pointer at each level. Are there counters of how many entries are active in each table at all levels (pgd/pud/pmd/pte)? Consider the case where two or more vmas are mapped at addresses that share a pgd entry, but are far enough apart that you don't want to coalesce them. First call down through that pgd entry must leave the pointer to the pud (or pmd/pte for systems with less levels) so that the next call can still get to the pud/pmd/pte to clear out entries for the other vma. But without counters (at all levels) you don't know when you are all done with a table, or whether you need to leave it for the next call. If you want to pursue this, then there are probably some fields in the page_t for the page that is being used as a pgd/pud/pmd/pte that are available (do all architectures allocate whole pages for all levels of the page tree?) Alternatively you could modify the use of floor/ceiling as they are passed down from the top level to indicate the progressively greater address ranges that have been dealt with ... but I'm not completely convinced that gives you enough information. You would need to do careful extension of the ceiling at each level to make sure that you reach out to the end of of each table at each level to account for gaps between vmas (which would result in no future calldown hitting this part of the table). -Tony - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Tue, 22 Mar 2005, David S. Miller wrote: > On Tue, 22 Mar 2005 19:36:46 + (GMT) > Hugh Dickins <[EMAIL PROTECTED]> wrote: > > > I notice that although both i386 and sparc64 use pgtable-nopud.h, the > > i386 pud_clear does nothing at all and the sparc64 pud_clear resets to 0. > > This was a dead end. I386 doesn't do anything with pud_clear() in > order to work around a chip erratum. > > IA64 does clear in pud_clear() just like sparc64. My mind kept flipping back and forth on whether it was pud_clear(). I agree, I can't see that it's the issue now. > I think it's the floor/ceiling stuff. > > At that pud_clear(), we do it if floor-->ceiling (after masking) > covers the whole PUD. Not whether start/end do, which is what > the code sort of does right now. Not an explanation I understood ;) > "start" and "end" say which specific entries we might be purging. Yes. > "floor" and "ceiling" say that once that purging is done, the > extent of the potential address space freed. Well, they specify the limits beyond which we cannot free, because there's other stuff still making use of addresses beyond. > I cooked up a quick patch that changes the logic to: > > floor &= PUD_MASK; > ceiling &= PUD_MASK; > if (floor - 1 >= ceiling - 1) > return; That can't be right. In exit_mmap, for example, floor will be 0 throughout, and the condition will be true for all values of ceiling. > pmd = pmd_offset(pud, start); > pud_clear(pud); > pmd_free_tlb(tlb, pmd); > > and things start to basically work. Because none of your higher level tables are being freed at all: eventually you'll run out of memory. I still can't see what's wrong with the code that's already there. My brain is seizing up, I'm taking a break. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
Hugh, I got tired of rebooting just to get address walking traces :-) So I wrote a little simulator. Basically, it's free_pgtables() with the page table pointer stuff ripped out. You run it like this: ./simulator vma_file Where vma_file is a text file composed of lines of the form: START END in hex. The ordered list of VMA's is built from this text file. I enclose a sample file I took from that trace I built earlier today. If you want to get fancy, it's probably very easy to make it so that parse_file() can read in live /proc/${pid}/maps files too :) It is easy to stick in support for platforms other than SPARC64, the machinery is all there, just pop the definitions out from the platforms include/asm-*/*.h and relevant include/asm-generic/*.h header files. Enjoy. :-) #include #include #include #include #include #include #define ARCH_SPARC64 1 #define ARCH_IA64 2 #define ARCH_PPC64 3 #define ARCH ARCH_SPARC64 #if (ARCH == ARCH_SPARC64) #define PAGE_SHIFT 13 #define PMD_SHIFT (PAGE_SHIFT + (PAGE_SHIFT-3)) #define PMD_SIZE (1UL << PMD_SHIFT) #define PMD_MASK (~(PMD_SIZE-1)) #define PMD_BITS 11 #define PGDIR_SHIFT (PAGE_SHIFT + (PAGE_SHIFT-3) + PMD_BITS) #define PGDIR_SIZE (1UL << PGDIR_SHIFT) #define PGDIR_MASK (~(PGDIR_SIZE-1)) #define PUD_SHIFT PGDIR_SHIFT #define PTRS_PER_PUD 1 #define PUD_SIZE (1UL << PUD_SHIFT) #define PUD_MASK (~(PUD_SIZE-1)) #define pgd_addr_end(addr, end) \ ({ unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK; \ (__boundary - 1 < (end) - 1)? __boundary: (end); \ }) #define pud_addr_end(addr, end) (end) #define pmd_addr_end(addr, end) \ ({ unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK; \ (__boundary - 1 < (end) - 1)? __boundary: (end); \ }) #else #error please add definitions for your architecture #endif /* * Note: this doesn't free the actual pages themselves. That * has been handled earlier when unmapping all the memory regions. */ static inline void free_pte_range(unsigned long addr) { } static inline void free_pmd_range(unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling) { unsigned long next; unsigned long start; start = addr; do { next = pmd_addr_end(addr, end); #if 1 printf("free_pgtables: free_pte_range() addr(%lx) next(%lx) end(%lx)\n", addr, next, end); #endif free_pte_range(addr); } while (addr = next, addr != end); start &= PUD_MASK; ceiling &= PUD_MASK; #if 1 printf("free_pmd_range: start(%lx) ceiling(%lx) ", start, ceiling); #endif if ((start < floor) || (end - 1 > ceiling - 1)) { #if 1 printf("SKIP pud_clear()\n"); #endif return; } #if 1 printf("DO pud_clear()\n"); #endif } static inline void free_pud_range(unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling) { unsigned long next; unsigned long start; start = addr; do { next = pud_addr_end(addr, end); #if 1 printf("free_pud_range: free_pmd_range(%lx,%lx,%lx,%lx)\n", addr, next, floor, ceiling); #endif free_pmd_range(addr, next, floor, ceiling); } while (addr = next, addr != end); start &= PGDIR_MASK; ceiling &= PGDIR_MASK; #if 1 printf("free_pud_range: start(%lx) ceiling(%lx) ", start, ceiling); #endif if ((start < floor) || (end - 1 > ceiling - 1)) { #if 1 printf("SKIP pgd_clear()\n"); #endif return; } #if 1 printf("DO pgd_clear()\n"); #endif } /* * This function frees user-level page tables of a process. * Unlike other pagetable walks, some memory layouts might give end 0. * Must be called with pagetable lock held. */ static inline void free_pgd_range(unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling) { unsigned long next; unsigned long start; addr &= PMD_MASK; if (addr < floor) { addr += PMD_SIZE; if (!addr) return; } ceiling &= PMD_MASK; if (end - 1 > ceiling - 1) end -= PMD_SIZE; if (addr > end - 1) return; start = addr; do { next = pgd_addr_end(addr, end); #if 1 printf("free_pgd_range: free_pud_range(%lx,%lx,%lx,%lx)\n", addr, next, floor, ceiling); #endif free_pud_range(addr, next, floor, ceiling); } while (addr = next, addr != end); if (1 /*!tlb_is_full_mm(tlb)*/) { #if 1 printf("free_pgd_range: Doing flush_tlb_pgtables(%lx,%lx)\n", start, end); #endif /*flush_tlb_pgtables(tlb->mm, start, end);*/ } } struct vm_area_struct { struct vm_area_struct *vm_next; unsigned long vm_start, vm_end; }; void free_pgtables(struct vm_area_struct *vma, unsigned long floor, unsigned long ceiling) { #if 1 struct vm_area_struct *tmp = vma; printf("VMA DUMP: "); while (tmp) { printf("[%lx:%lx] ", tmp->vm_start, tmp->vm_end); tmp = tmp->vm_next; } printf("\n"); #endif while (vma) { struct vm_area_struct *next = vma->vm_next; unsigned long addr = vma->vm_start; /* Optimization: gather nearby vmas into a single call down */ while (next && next->vm_start <= vma->vm_end + 2*PMD_SIZE) { vma
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Tue, 22 Mar 2005 19:36:46 + (GMT) Hugh Dickins <[EMAIL PROTECTED]> wrote: > I notice that although both i386 and sparc64 use pgtable-nopud.h, the > i386 pud_clear does nothing at all and the sparc64 pud_clear resets to 0. Aha! And ppc does as well via asm-generic/4level-fixup.h which is probably why I did it this way as well when I converted sparc64 over to asm-generic/pgtable-nopud.h Hmmm, let me play around with this. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Tue, 22 Mar 2005 19:36:46 + (GMT) Hugh Dickins <[EMAIL PROTECTED]> wrote: > I notice that although both i386 and sparc64 use pgtable-nopud.h, the > i386 pud_clear does nothing at all and the sparc64 pud_clear resets to 0. This was a dead end. I386 doesn't do anything with pud_clear() in order to work around a chip erratum. IA64 does clear in pud_clear() just like sparc64. I think it's the floor/ceiling stuff. At that pud_clear(), we do it if floor-->ceiling (after masking) covers the whole PUD. Not whether start/end do, which is what the code sort of does right now. "start" and "end" say which specific entries we might be purging. "floor" and "ceiling" say that once that purging is done, the extent of the potential address space freed. I cooked up a quick patch that changes the logic to: floor &= PUD_MASK; ceiling &= PUD_MASK; if (floor - 1 >= ceiling - 1) return; pmd = pmd_offset(pud, start); pud_clear(pud); pmd_free_tlb(tlb, pmd); and things start to basically work. When X started up my machine rebooted, but this was with all the tracing printk()'s enabled so I'm skeptical as to the reason. I'll back out the debugging and play with this some more. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Tue, 22 Mar 2005, David S. Miller wrote: > On Tue, 22 Mar 2005 11:21:25 -0800 > "David S. Miller" <[EMAIL PROTECTED]> wrote: > > > I'm trying to analyze my traces some more. > > I think I see what's going wrong. On the first > address space traversal (free_pgd_range()), we > clear out the pgd, even though there are still > more PMD's to process in that PGD. > > So the future loops never do anything since the > PGD is cleared out already. Yes, that's the conclusion I was coming to from your excellent traces. I notice that although both i386 and sparc64 use pgtable-nopud.h, the i386 pud_clear does nothing at all and the sparc64 pud_clear resets to 0. If that really is the problem, well, I get as easily mixed between the levels as the next man, and haven't a clue which is the right way to do it - beyond i386 not seeing these nr_pte bugs. But if this is the issue, I don't see how it's new to my freepgt patches - beyond the fact that they add this BUG_ON consistency check. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Tue, 22 Mar 2005 11:21:25 -0800 "David S. Miller" <[EMAIL PROTECTED]> wrote: > I'm trying to analyze my traces some more. I think I see what's going wrong. On the first address space traversal (free_pgd_range()), we clear out the pgd, even though there are still more PMD's to process in that PGD. So the future loops never do anything since the PGD is cleared out already. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Tue, 22 Mar 2005 11:01:44 -0800 "David S. Miller" <[EMAIL PROTECTED]> wrote: > Hmmm... Thinking some more, one thing that is unique in the PPC64 and SPARC64 cases is that we are executing primarily 32-bit tasks and in such cases one PGD maps the entire address space. I wonder if the free_pgtables() loops simply aren't coping well with something like that. I'm trying to analyze my traces some more. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
Ok, here is a full dump of a free_pgtables() run that fails to clear out all the PMD's. It gets called with this VMA list (each entry is a vm_start/vm_end tuple) [0x0001:0x000a4000] [0x000b2000:0x000b8000] [0x000b8000:0x000de000] [0x7000:0x7001a000] [0x70028000:0x7002a000] [0x7002c000:0x7006a000] [0x7006a000:0x7006c000] [0x7006c000:0x70084000] [0x70084000:0x70088000] [0x70088000:0x70094000] [0x70094000:0x70098000] [0x70098000:0x701da000] [0x701da000:0x701e8000] [0x701e8000:0x701f2000] [0x701f2000:0x701f4000] [0x701f4000:0x701fc000] [0x701fc000:0x70204000] [0x70204000:0x7020c000] [0x7020c000:0x7021e000] [0x7021e000:0x7022c000] [0x7022c000:0x7023] [0x7023:0x70232000] [0x70234000:0x7023e000] [0x7023e000:0x70244000] [0x70244000:0x7024e000] [0x7025:0x7025a000] [0x7025a000:0x7026] [0x7026:0x7026c000] [0xefbfe000:0xefc28000] And then we start to iterate, here is the trace I got: free_pgd_range(addr[0x1000],end[0xde000],floor[0x0],ceiling[0x7000]) free_pud_range(addr[0x0],end[0xde000],floor[0x0],ceiling[0x7000]) free_pmd_range(addr[0],end[0xde000],floor[0x0],ceiling[0x7000]) free_pte_range(addr[0x0],next[0xde000],end[0xde000]) /* nr_ptes-- */ free_pgd_range(addr[0x7000],end[0x7026c000],floor[0x0],ceiling[0xefbfe000]) free_pud_range(addr[0x7000],end[0x7026c000],floor[0x0],ceiling[0xef80]) /* does not do free_pmd_range() for some reason) free_pgd_range(addr[0xefbfe000],end[0xefc28000],floor[0x0],ceiling[0x0]) free_pud_range(addr[0xef80],end[0xefc28000],floor[0x0],ceiling[0x0]) /* also do not do free_pmd_range() */ Whoa, how does that work? We are calling free_pgtables() at exit_mmap() time with a 0 floor _and_ ceiling? Oh I see, the tests are against "ceiling - 1". Hmmm... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/5] freepgt: free_pgtables use vma list
On Tue, 22 Mar 2005, Luck, Tony wrote: > >> For example, you may have a single page (start,end) address range > >> to free, but if this is enclosed by a large enough (floor,ceiling) > >> then it may free an entire pgd entry. > >> > >> I assume the intention of the API would be to provide the full > >> pgd width in that case? > > > >Yes, that is what should happen if the full PGD entry is liberated. > > > >Any time page table chunks are liberated, they have to be included > >in the range passed to the flush_tlb_pgtables() call. This now makes it crystal clear to me that Nick's suspicion was right, my start,end to flush_tlb_pgtables is too narrow. I'll take another look there and correct it. > So should this part of Hugh's code: > > /* > * Optimization: gather nearby vmas into one call down > */ > while (next && next->vm_start <= vma->vm_end + > PMD_SIZE > && !is_hugepage_only_range(next->vm_start, > HPAGE_SIZE)){ > vma = next; > next = vma->vm_next; > } > free_pgd_range(tlb, addr, vma->vm_end, > floor, next? next->vm_start: ceiling); > > be changed to use pgd_addr_end() to gather up all the vma that > are mapped by a single pgd instead of just spanning out the next > PMD_SIZE? Oh, I don't think so. I suppose it could be done at this level, but then the lower levels would go back to searching through lots of unnecessary cachelines to find the significant entries, and we might as well throw out the whole set of patches (which will soon happen anyway if we can't find why they're not working!). No, we don't have to pass pgd granularity to flush_tlb_pgtables, we just have to try a bit harder to supply the right span to it, whatever that is. It might be in pmd granularity, it might be in pud granularity (if we freed some at that level), it might be in pgd granularity (if we freed some at that level). > On ia64 we can have a vma big enough to require more than one pgd, but Yes, no problem. > in the case that we span, we won't cross the problematic pgd boundaries > where the holes in the address space are lurking. Yes, it wouldn't be a hole if a vma was allowed into it. I do assume that on all architectures, the peculiar regions (might be holes, might be huge-only regions) are separated from the ordinary ones by pgd boundaries. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Tue, 22 Mar 2005 16:37:09 + (GMT) Hugh Dickins <[EMAIL PROTECTED]> wrote: > If you and David could try the lame patch below, > it'll at least give us a slight clue of where to be looking - > every mm exiting with nr_ptes 1 means something different from > every mm exiting with nr_ptes -1 means something different from > occasional mms exiting with nr_ptes something positive. I didn't use your debugging patch, I used something slightly more sophisticated to simply traces nr_ptes counter bumps then halt the system when the BUG_ON(mm->nr_ptes) triggered so I could analyze things. What happens is (all virtual addresses are PMD_MASK'd): 1) We map 3 PTE tables. 0x 0x7000 0xef80 2) We only unmap one PTE table. 0x 3) Then we of course BUG() because nr_ptes is 2 I'm now adding more debugging to figure out why only the first PTE table is being unmapped. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Tue, 22 Mar 2005 15:14:54 +1100 Nick Piggin <[EMAIL PROTECTED]> wrote: > For example, you may have a single page (start,end) address range > to free, but if this is enclosed by a large enough (floor,ceiling) > then it may free an entire pgd entry. > > I assume the intention of the API would be to provide the full > pgd width in that case? Yes, that is what should happen if the full PGD entry is liberated. Any time page table chunks are liberated, they have to be included in the range passed to the flush_tlb_pgtables() call. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/5] freepgt: free_pgtables use vma list
>> For example, you may have a single page (start,end) address range >> to free, but if this is enclosed by a large enough (floor,ceiling) >> then it may free an entire pgd entry. >> >> I assume the intention of the API would be to provide the full >> pgd width in that case? > >Yes, that is what should happen if the full PGD entry is liberated. > >Any time page table chunks are liberated, they have to be included >in the range passed to the flush_tlb_pgtables() call. So should this part of Hugh's code: /* * Optimization: gather nearby vmas into one call down */ while (next && next->vm_start <= vma->vm_end + PMD_SIZE && !is_hugepage_only_range(next->vm_start, HPAGE_SIZE)){ vma = next; next = vma->vm_next; } free_pgd_range(tlb, addr, vma->vm_end, floor, next? next->vm_start: ceiling); be changed to use pgd_addr_end() to gather up all the vma that are mapped by a single pgd instead of just spanning out the next PMD_SIZE? On ia64 we can have a vma big enough to require more than one pgd, but in the case that we span, we won't cross the problematic pgd boundaries where the holes in the address space are lurking. -Tony - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Tue, 22 Mar 2005 06:08:38 + (GMT) Hugh Dickins <[EMAIL PROTECTED]> wrote: > > It just wants the range of page tables liberated. I guess > > essentially PMD_SIZE is the granularity. > > I _think_ that answer means that my current code is fine in this respect. > But I'm not entirely convinced. Since sparc64 is the only architecture > which implements a flush_tlb_pgtables which actually uses start,end, > we do need to suit your needs there - informed reassurance welcome! Ok. This interface is meant to deal with platforms that virtually map their page tables, usually for faster TLB miss processing. As stated, IA64 does this just as sparc64 does, however they flush their linear page table virtual mappings in a different place. This by definition means that the granularity is PMD_SIZE. That is the smallest chunk of page table, ie. what a pte_t chunk maps. > > It's funny since this code aparently works fine on ia64 which > > is fully 3-level too. Hmm... > > Yes, odd. I'll have to have another think later on. I'll play around with some of the patches you posted today and get back to you. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Tue, 22 Mar 2005 05:47:13 + (GMT) Hugh Dickins <[EMAIL PROTECTED]> wrote: > > 1) start --> end straddles sparc64 address space hole > > That's an interesting remark. I hadn't noticed the signed long type. > I believe the vma gathering in free_pgtables will have no problem with > that, but what about the old code? What happens if an app does a huge > munmap straddling the address space hole? Or is all the user address > space below the hole? It will BUG(), crap. :( Good catch, I'll fix this. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Tue, 22 Mar 2005, Andrew Morton wrote: > > With these six patches the ppc64 is hitting the BUG in exit_mmap(): > > BUG_ON(mm->nr_ptes);/* This is just debugging */ > > fairly early in boot. So ppc64 is in the same boat as sparc64 (yet ia64 okay so far). Sorry, I'm still clueless. I cannot see those arches doing pte_allocs outside their vmas, that of course could cause it. And nr_ptes is initialized to 0 once by memset and again by assignment, so it should be starting out even zeroer than most fields. I should probably be paying more attention to the repellent notion that my code is broken. If you and David could try the lame patch below, it'll at least give us a slight clue of where to be looking - every mm exiting with nr_ptes 1 means something different from every mm exiting with nr_ptes -1 means something different from occasional mms exiting with nr_ptes something positive. I'm not sure whether the patch would ever get to show a more interesting proc name than "?". And does memory leak away into lost pagetables if you continue running, or does it actually carry on running fine, and the problem appear to be with the BUG_ON itself? Thanks, Hugh --- freepgt6/mm/mmap.c 2005-03-22 04:28:40.0 + +++ testing/mm/mmap.c 2005-03-22 15:45:00.0 + @@ -1896,6 +1896,7 @@ EXPORT_SYMBOL(do_brk); /* Release all mmaps. */ void exit_mmap(struct mm_struct *mm) { + static unsigned long good_mms, bad_mms; struct mmu_gather *tlb; struct vm_area_struct *vma = mm->mmap; unsigned long nr_accounted = 0; @@ -1931,7 +1932,14 @@ void exit_mmap(struct mm_struct *mm) vma = next; } - BUG_ON(mm->nr_ptes);/* This is just debugging */ + if (mm->nr_ptes && bad_mms < 250) { + printk(KERN_ERR "exit_mmap: %s nr_ptes %ld good_mms %lu\n", + current->mm == mm? current->comm: "?", + (long)mm->nr_ptes, good_mms); + good_mms = 0; + bad_mms++; + } else + good_mms++; } /* Insert vm structure into process list sorted by address - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
Andrew Morton wrote: With these six patches the ppc64 is hitting the BUG in exit_mmap(): BUG_ON(mm->nr_ptes);/* This is just debugging */ fairly early in boot. No doubt Hugh will have this fixed before long... but if you have time to spare, you may just try hitting it on the head and making it a bit dumber. It might help show where the problem is. - don't span multiple vmas - don't be so smart about avoiding unfreeable regions I dunno. Maybe it won't help at all. Maybe it will make things worse :P Index: linux-2.6/mm/memory.c === --- linux-2.6.orig/mm/memory.c 2005-03-22 23:04:34.0 +1100 +++ linux-2.6/mm/memory.c 2005-03-22 23:11:31.0 +1100 @@ -110,13 +110,18 @@ void pmd_clear_bad(pmd_t *pmd) * Note: this doesn't free the actual pages themselves. That * has been handled earlier when unmapping all the memory regions. */ -static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd) +static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, + unsigned long addr, unsigned long end, + unsigned long floor, unsigned long ceiling) { - struct page *page = pmd_page(*pmd); - pmd_clear(pmd); - pte_free_tlb(tlb, page); - dec_page_state(nr_page_table_pages); - tlb->mm->nr_ptes--; + if (((addr & PMD_MASK) >= floor) + && (end - 1 <= (ceiling & PMD_MASK) - 1)) { + struct page *page = pmd_page(*pmd); + pmd_clear(pmd); + pte_free_tlb(tlb, page); + dec_page_state(nr_page_table_pages); + tlb->mm->nr_ptes--; + } } static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud, @@ -133,7 +138,7 @@ static inline void free_pmd_range(struct next = pmd_addr_end(addr, end); if (pmd_none_or_clear_bad(pmd)) continue; - free_pte_range(tlb, pmd); + free_pte_range(tlb, pmd, addr, end, floor, ceiling); } while (pmd++, addr = next, addr != end); start &= PUD_MASK; @@ -190,18 +195,6 @@ void free_pgd_range(struct mmu_gather ** unsigned long next; unsigned long start; - addr &= PMD_MASK; - if (addr < floor) { - addr += PMD_SIZE; - if (!addr) - return; - } - ceiling &= PMD_MASK; - if (end - 1 > ceiling - 1) - end -= PMD_SIZE; - if (addr > end - 1) - return; - start = addr; pgd = pgd_offset((*tlb)->mm, addr); do { @@ -226,14 +219,6 @@ void free_pgtables(struct mmu_gather **t hugetlb_free_pgd_range(tlb, addr, vma->vm_end, floor, next? next->vm_start: ceiling); } else { - /* -* Optimization: gather nearby vmas into one call down -*/ - while (next && next->vm_start <= vma->vm_end + PMD_SIZE - && !is_hugepage_only_range(next->vm_start, HPAGE_SIZE)){ - vma = next; - next = vma->vm_next; - } free_pgd_range(tlb, addr, vma->vm_end, floor, next? next->vm_start: ceiling); }
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
With these six patches the ppc64 is hitting the BUG in exit_mmap(): BUG_ON(mm->nr_ptes);/* This is just debugging */ fairly early in boot. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 1/5] freepgt: free_pgtables use vma list
Builds clean and boots on ia64. I haven't tried any hugetlb operations on it though. -Tony - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
Hugh Dickins wrote: On Mon, 21 Mar 2005, David S. Miller wrote: On Tue, 22 Mar 2005 15:14:54 +1100 Nick Piggin <[EMAIL PROTECTED]> wrote: Question, Dave: flush_tlb_pgtables after Hugh's patch is also possibly not being called with enough range to cover all page tables that have been freed. Good question from Nick. For example, you may have a single page (start,end) address range to free, but if this is enclosed by a large enough (floor,ceiling) then it may free an entire pgd entry. I assume the intention of the API would be to provide the full pgd width in that case? It just wants the range of page tables liberated. I guess essentially PMD_SIZE is the granularity. I _think_ that answer means that my current code is fine in this respect. But I'm not entirely convinced. Since sparc64 is the only architecture which implements a flush_tlb_pgtables which actually uses start,end, we do need to suit your needs there - informed reassurance welcome! But Hugh I think you may still be freeing pgd (PGDIR_SIZE) regions that you don't quite cover with flush_tlb_pgtables? I would have thought you'd need something like this: if (!tlb_is_full_mm(tlb)) { /* This is the full range of page tables we could possibly free */ start = max(start & PGDIR_SIZE, (floor + PMD_SIZE - 1) & PMD_SIZE); end = min((end + PGDIR_SIZE - 1) & PGDIR_SIZE, ceiling & PMD_SIZE); flush_tlb_pgtables(tlb->mm, start, end); } But I'll have to go away and look at it more... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Mon, 21 Mar 2005, David S. Miller wrote: > On Tue, 22 Mar 2005 15:14:54 +1100 > Nick Piggin <[EMAIL PROTECTED]> wrote: > > > Question, Dave: flush_tlb_pgtables after Hugh's patch is also > > possibly not being called with enough range to cover all page > > tables that have been freed. Good question from Nick. > > For example, you may have a single page (start,end) address range > > to free, but if this is enclosed by a large enough (floor,ceiling) > > then it may free an entire pgd entry. > > > > I assume the intention of the API would be to provide the full > > pgd width in that case? > > It just wants the range of page tables liberated. I guess > essentially PMD_SIZE is the granularity. I _think_ that answer means that my current code is fine in this respect. But I'm not entirely convinced. Since sparc64 is the only architecture which implements a flush_tlb_pgtables which actually uses start,end, we do need to suit your needs there - informed reassurance welcome! > Anyways, for the record I made it only call flush_tlb_pgtables() > when end > start, Fine. The patch I just sent, moving the tests, is how I'd like it to be fixed finally, but what you've done in the interim should do fine. > but instead of that BUG() I now get the BUG() > on mm->nr_ptes being non-zero at the end of exit_mmap(). And no way would your change be causing this. Oh dear. > Something is up with the floor/ceiling stuff methinks. Seems so. I did have an off-by-one originally, but that version never leaked out. > It's funny since this code aparently works fine on ia64 which > is fully 3-level too. Hmm... Yes, odd. I'll have to have another think later on. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Mon, 21 Mar 2005, David S. Miller wrote: > > flush_tlb_pgtables() on sparc64 has a BUG() check which > is basically: > > BUG((long)start > (long)end); > > This catches two cases of bogus arguments: > > 1) start --> end straddles sparc64 address space hole That's an interesting remark. I hadn't noticed the signed long type. I believe the vma gathering in free_pgtables will have no problem with that, but what about the old code? What happens if an app does a huge munmap straddling the address space hole? Or is all the user address space below the hole? Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
Many thanks for the testing. On Mon, 21 Mar 2005, David S. Miller wrote: > > This adjustment of addr relative to floor is very > strange, it can advance "addr" (and thus "start") > past the end of the VMA we are unmapping. Not strange, it's just trying to skip a pointless iteration. > In fact, it is miraculious that this free_pud_range() > calling loop terminates properly! Actually, it is > no mystery, since the next PGD address is the same > for both the original and adjusted value of "addr". > So the loop terminates after the first iteration. Miraculous indeed, I hadn't realized those do {} while () loops were as robust as that. But it's certainly wrong to have been calling them once in this case, even if they did recover. > Anyways, there's the full analysis, what do you make > of this Hugh? :-) Much better than I deserve. Silly me. This patch should fix it. [PATCH 6/5] freepgt: fix addr,end check Fix placing of free_p?d_range's check for addr against end. Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]> --- freepgt5/mm/memory.c2005-03-22 04:28:40.0 + +++ freepgt6/mm/memory.c2005-03-22 05:11:05.0 + @@ -127,11 +127,6 @@ static inline void free_pmd_range(struct unsigned long next; unsigned long start; - if (end - 1 > ceiling - 1) - end -= PMD_SIZE; - if (addr > end - 1) - return; - start = addr; pmd = pmd_offset(pud, addr); do { @@ -202,7 +197,9 @@ void free_pgd_range(struct mmu_gather ** return; } ceiling &= PMD_MASK; - if (addr > ceiling - 1) + if (end - 1 > ceiling - 1) + end -= PMD_SIZE; + if (addr > end - 1) return; start = addr; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Tue, 22 Mar 2005 15:14:54 +1100 Nick Piggin <[EMAIL PROTECTED]> wrote: > Question, Dave: flush_tlb_pgtables after Hugh's patch is also > possibly not being called with enough range to cover all page > tables that have been freed. > > For example, you may have a single page (start,end) address range > to free, but if this is enclosed by a large enough (floor,ceiling) > then it may free an entire pgd entry. > > I assume the intention of the API would be to provide the full > pgd width in that case? It just wants the range of page tables liberated. I guess essentially PMD_SIZE is the granularity. Anyways, for the record I made it only call flush_tlb_pgtables() when end > start, but instead of that BUG() I now get the BUG() on mm->nr_ptes being non-zero at the end of exit_mmap(). Something is up with the floor/ceiling stuff methinks. It's funny since this code aparently works fine on ia64 which is fully 3-level too. Hmm... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Mon, 2005-03-21 at 15:02 -0800, David S. Miller wrote: > Anyways, there's the full analysis, what do you make > of this Hugh? :-) Impressive, and my name isn't even Hugh. Question, Dave: flush_tlb_pgtables after Hugh's patch is also possibly not being called with enough range to cover all page tables that have been freed. For example, you may have a single page (start,end) address range to free, but if this is enclosed by a large enough (floor,ceiling) then it may free an entire pgd entry. I assume the intention of the API would be to provide the full pgd width in that case? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Mon, 21 Mar 2005 14:31:36 -0800 "Luck, Tony" <[EMAIL PROTECTED]> wrote: > Builds clean and boots on ia64. > > I haven't tried any hugetlb operations on it though. It works on ia64 because it doesn't actually do anything in flush_tlb_pgtables(), I bet. Hugh, I know the exact trigger case, it's unmapping a VMA right before the stack segment. So the free_pgtables() call happens with this state: prev->vm_end== 0x70186000 next->vm_start == 0xefab8000 vma->vm_start == 0x70186000 vma->vm_end == 0x70188000 (so we're doing munmap(0x70186000, PAGE_SIZE), the sparc64 stack segment for 32-bit tasks grows down from 0xf000, the bottom of it is at 0xefab8000 at this point in time) So the free_pgtables() call will be with: floor == 0x70186000 ceiling == 0xefab8000 This should be fairly simple, so let's analyze exactly what happens: 1) vma == the munmap() call's VMA next == stack segment VMA, which sits right after "vma" addr == 0x70186000 (base of munmap() area) 2) VMA optimization loop runs: next->vm_start is 0xefab8000 vma->vm_end is 0x70188000 on sparc64 PMD_SIZE is 1UL << 23 or 0x80 therefore vma->vm_end + (2 * PMD_SIZE) is 0x71188000 this is much less than 0xefab8000 so the loop terminates immediately Therefore, next is unchanged. 3) free_pgd_range() is invoked with: addr== 0x70186000 end == 0x70188000 floor == 0x70186000 ceiling == 0xefab8000 4) We mask addr with PMD_MASK (which is 0xff80) This sets addr to 0x7000, which makes it less than floor, therefore addr has PMD_SIZE added to it. Now, addr is 0x7080, this is the source of the problems as this value determines the "start" argument passed to flush_tlb_pgtables(). Note how it is larger than "end". 5) We also mask ceiling with PMD_MASK. This sets ceiling to 0xef80. Now addr is less than or equal to ceiling - 1 so we continue. 6) start is set to addr, which as stated is 0x7080, the free_pud_range() loop is executed 7) start is 0x7080 and end is 0x70188000 and here we have the problem that start > end, flush_tlb_pgtables() is called with the arguments like this and we trigger the aforementioned BUG(). This adjustment of addr relative to floor is very strange, it can advance "addr" (and thus "start") past the end of the VMA we are unmapping. In fact, it is miraculious that this free_pud_range() calling loop terminates properly! Actually, it is no mystery, since the next PGD address is the same for both the original and adjusted value of "addr". So the loop terminates after the first iteration. Anyways, there's the full analysis, what do you make of this Hugh? :-) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/5] freepgt: free_pgtables use vma list
On Mon, 21 Mar 2005 20:52:44 + (GMT) Hugh Dickins <[EMAIL PROTECTED]> wrote: Hugh, I'm getting some problems on sparc64 here: > +static inline void free_pgd_range(struct mmu_gather *tlb, > + unsigned long addr, unsigned long end, > + unsigned long floor, unsigned long ceiling) > { > pgd_t *pgd; > unsigned long next; > + unsigned long start; > > + addr &= PMD_MASK; > + if (addr < floor) { > + addr += PMD_SIZE; > + if (!addr) > + return; > + } > + ceiling &= PMD_MASK; > + if (addr > ceiling - 1) > + return; > + > + start = addr; > pgd = pgd_offset(tlb->mm, addr); > do { > next = pgd_addr_end(addr, end); > if (pgd_none_or_clear_bad(pgd)) > continue; > - clear_pud_range(tlb, pgd, addr, next); > + free_pud_range(tlb, pgd, addr, next, floor, ceiling); > } while (pgd++, addr = next, addr != end); > + > + if (!tlb_is_full_mm(tlb)) > + flush_tlb_pgtables(tlb->mm, start, end); > +} flush_tlb_pgtables() on sparc64 has a BUG() check which is basically: BUG((long)start > (long)end); This catches two cases of bogus arguments: 1) start --> end straddles sparc64 address space hole 2) start > end With your changes, this triggers on even the first user process execution. Specifically I get the first trap with start=0x7080 and end=0x70188000. (these addresses are in the region where generally 32-bit tasks get their non-fixed mmap() requests satisfied, so these are probably shared library chunks). I think the VMA gathering optimization one level up in free_pgtables() has some logic error in it. Maybe... I'll try to figure out what exactly is going on, but perhaps you can spot it before me. :-) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/