Re: hugetlbfs for ppc440 - kernel BUG
On Tue, Oct 21, 2008 at 03:50:30PM -0700, Satya wrote: On Tue, Oct 21, 2008 at 3:46 PM, Satya [EMAIL PROTECTED] wrote: Ben, Look here: http://www-unix.mcs.anl.gov/zeptoos/hugepages/ thanks, ./satya On Tue, Oct 21, 2008 at 1:47 PM, Benjamin Herrenschmidt [EMAIL PROTECTED] wrote: On Tue, 2007-07-10 at 13:38 -0500, Satya wrote: hello, I am trying to implement hugetlbfs on the IBM Bluegene/L IO node (ppc440) and I have a big problem as well as a few questions to ask the group. I patched a 2.6.21.6 linux kernel (manually) with Edi Shmueli's hugetlbfs implementation (found here: http://patchwork.ozlabs.org/linuxppc/patch?id=8427) for this. I did have to make slight changes (described at the end) to make it work. My test program is a shortened version of a sys v shared memory example described in Documentation/vm/hugetlbpage.txt Hi ! The patchwork link unfortunately didn't survive the transition to patchwork 2. Do you know what's the status of Hugetlb support for 44x ? Is there any plan to release that for upstream inclusion ? Cheers, Ben. whoops, sorry for top-posting. Here is a patch that worked at that time: http://www-unix.mcs.anl.gov/zeptoos/hugepages/hugetlbpage_44x.patch I didn't follow up after this to get it merged upstream. Also I don't know if hugetlb core has changed to deal with PTEs in high memory. Ok, had a look at this. It's had some tweaks since I last looked at the bluegene hugepage/440 patch. It still has the rather ugly approach of storing the hugepage PTEs always at the bottom level, and duplicating them umpteen times (including pointing multiple PMDs at a single PTE page when the hugepage size exceeds the area mapped by a PMD). It also has the most serious bug I remember from the old version - the DIRTY and ACCESSED handling is completely bogus, because it doesn't keep the copies of the bits in the many copies of the PTEs in sync. Between the TLB miss rewrite that's happened in the meantime and my patch to handle these from hugetlb_fault() it's at least now easier to fix this bug. Also the patch is arch/ppc based. I'll try to sort this out in the near future. I guess the only big question is whether its important to support hugepage sizes 2M. For hugepage sizes =2M (16M and 256M) we can just make PMD pointers into hugepage pointers with the addition of a suitable size field, as we do for 40x. For page sizes 2M things get more complicated because we need some sort of second level hugepage tables (which may or may not be distinct from the ordinary second level tables). -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: hugetlbfs for ppc440 - kernel BUG
On Tue, 2007-07-10 at 13:38 -0500, Satya wrote: hello, I am trying to implement hugetlbfs on the IBM Bluegene/L IO node (ppc440) and I have a big problem as well as a few questions to ask the group. I patched a 2.6.21.6 linux kernel (manually) with Edi Shmueli's hugetlbfs implementation (found here: http://patchwork.ozlabs.org/linuxppc/patch?id=8427) for this. I did have to make slight changes (described at the end) to make it work. My test program is a shortened version of a sys v shared memory example described in Documentation/vm/hugetlbpage.txt Hi ! The patchwork link unfortunately didn't survive the transition to patchwork 2. Do you know what's the status of Hugetlb support for 44x ? Is there any plan to release that for upstream inclusion ? Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: hugetlbfs for ppc440 - kernel BUG
On Tue, 2008-10-21 at 15:46 -0700, Satya wrote: Ben, Look here: http://www-unix.mcs.anl.gov/zeptoos/hugepages/ Thanks. What is the status ? Do they work fine ? Are they going to be re-submitted for inclusion ? Cheers, Ben. thanks, ./satya On Tue, Oct 21, 2008 at 1:47 PM, Benjamin Herrenschmidt [EMAIL PROTECTED] wrote: On Tue, 2007-07-10 at 13:38 -0500, Satya wrote: hello, I am trying to implement hugetlbfs on the IBM Bluegene/L IO node (ppc440) and I have a big problem as well as a few questions to ask the group. I patched a 2.6.21.6 linux kernel (manually) with Edi Shmueli's hugetlbfs implementation (found here: http://patchwork.ozlabs.org/linuxppc/patch?id=8427) for this. I did have to make slight changes (described at the end) to make it work. My test program is a shortened version of a sys v shared memory example described in Documentation/vm/hugetlbpage.txt Hi ! The patchwork link unfortunately didn't survive the transition to patchwork 2. Do you know what's the status of Hugetlb support for 44x ? Is there any plan to release that for upstream inclusion ? Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: hugetlbfs for ppc440 - kernel BUG
Ben, Look here: http://www-unix.mcs.anl.gov/zeptoos/hugepages/ thanks, ./satya On Tue, Oct 21, 2008 at 1:47 PM, Benjamin Herrenschmidt [EMAIL PROTECTED] wrote: On Tue, 2007-07-10 at 13:38 -0500, Satya wrote: hello, I am trying to implement hugetlbfs on the IBM Bluegene/L IO node (ppc440) and I have a big problem as well as a few questions to ask the group. I patched a 2.6.21.6 linux kernel (manually) with Edi Shmueli's hugetlbfs implementation (found here: http://patchwork.ozlabs.org/linuxppc/patch?id=8427) for this. I did have to make slight changes (described at the end) to make it work. My test program is a shortened version of a sys v shared memory example described in Documentation/vm/hugetlbpage.txt Hi ! The patchwork link unfortunately didn't survive the transition to patchwork 2. Do you know what's the status of Hugetlb support for 44x ? Is there any plan to release that for upstream inclusion ? Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: hugetlbfs for ppc440 - kernel BUG
On Tue, Oct 21, 2008 at 3:46 PM, Satya [EMAIL PROTECTED] wrote: Ben, Look here: http://www-unix.mcs.anl.gov/zeptoos/hugepages/ thanks, ./satya On Tue, Oct 21, 2008 at 1:47 PM, Benjamin Herrenschmidt [EMAIL PROTECTED] wrote: On Tue, 2007-07-10 at 13:38 -0500, Satya wrote: hello, I am trying to implement hugetlbfs on the IBM Bluegene/L IO node (ppc440) and I have a big problem as well as a few questions to ask the group. I patched a 2.6.21.6 linux kernel (manually) with Edi Shmueli's hugetlbfs implementation (found here: http://patchwork.ozlabs.org/linuxppc/patch?id=8427) for this. I did have to make slight changes (described at the end) to make it work. My test program is a shortened version of a sys v shared memory example described in Documentation/vm/hugetlbpage.txt Hi ! The patchwork link unfortunately didn't survive the transition to patchwork 2. Do you know what's the status of Hugetlb support for 44x ? Is there any plan to release that for upstream inclusion ? Cheers, Ben. whoops, sorry for top-posting. Here is a patch that worked at that time: http://www-unix.mcs.anl.gov/zeptoos/hugepages/hugetlbpage_44x.patch I didn't follow up after this to get it merged upstream. Also I don't know if hugetlb core has changed to deal with PTEs in high memory. ./satya ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: hugetlbfs for ppc440 - kernel BUG
On Wed, Oct 22, 2008 at 09:53:13AM +1100, Benjamin Herrenschmidt wrote: On Tue, 2008-10-21 at 15:46 -0700, Satya wrote: Ben, Look here: http://www-unix.mcs.anl.gov/zeptoos/hugepages/ Thanks. What is the status ? Do they work fine ? Are they going to be re-submitted for inclusion ? Hrm. Last I looked at the 440 hugepage patches they appeared to have several serious bugs (I was surprised they worked at all). I had meant to fix them up and push, but I never quite got around to it. I'll have at this link later today. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: hugetlbfs for ppc440 - kernel BUG -- follow up
hello, Upon investigating the below issue further, I found that pte_alloc_map() calls kmap_atomic. The allocated pte page must be unmapped before invoking any function that might_sleep. In this case clear_huge_page() is being called without invoking pte_unmap(). The 'normal' counterpart of hugetlb_no_page (which is do_no_page() in mm/memory.c) does call pte_unmap() before calling alloc_page() (which might sleep). So, I believe pte_unmap() must be invoked first in hugetlb_no_page(). But the problem here is, we do not have a reference to the pmd to map the pte again (using pte_offset_map()). The do_no_page() function does have a pmd_t* parameter, so it can remap the pte when required. For now, I resolved the problem by expanding the pte_alloc_map() macro by hand and replacing kmap_atomic with kmap(), although I think it is not the right thing to do. Let me know if my analysis is helping you figure out the problem here. Thanks! --satya. On 7/10/07, Satya [EMAIL PROTECTED] wrote: hello, I am trying to implement hugetlbfs on the IBM Bluegene/L IO node (ppc440) and I have a big problem as well as a few questions to ask the group. I patched a 2.6.21.6 linux kernel (manually) with Edi Shmueli's hugetlbfs implementation (found here: http://patchwork.ozlabs.org/linuxppc/patch?id=8427) for this. I did have to make slight changes (described at the end) to make it work. My test program is a shortened version of a sys v shared memory example described in Documentation/vm/hugetlbpage.txt I get the following kernel BUG when a page fault occurs on a huge page address: BUG: scheduling while atomic: shmtest2/0x1001/1291 Call Trace: [CFF0BCE0] [C00084F4] show_stack+0x4c/0x194 (unreliable) [CFF0BD20] [C01A53C4] schedule+0x664/0x668 [CFF0BD60] [C00175F8] __cond_resched+0x24/0x50 [CFF0BD80] [C01A5A6C] cond_resched+0x50/0x58 [CFF0BD90] [C005A31C] clear_huge_page+0x28/0x174 [CFF0BDC0] [C005B360] hugetlb_no_page+0xb4/0x220 [CFF0BE00] [C005B5BC] hugetlb_fault+0xf0/0xf4 [CFF0BE30] [C0052AC0] __handle_mm_fault+0x3a8/0x3ac [CFF0BE70] [C00094A0] do_page_fault+0x118/0x428 [CFF0BF40] [C0002360] handle_page_fault+0xc/0x80 BUG: scheduling while atomic: shmtest2/0x1001/1291 Now for my questions: 1. Can the kernel really reschedule in a page fault handler context ? 2. Just to test where this scheduling while atomic bug is arising, i put schedule() calls at various places in the path of the stack trace shown above. I found that a call to pte_alloc_map() puts the kernel in a context where it cannot reschedule without throwing up. Here is a trace of what's going on: __handle_mm_fault - hugetlb_fault - huge_pte_alloc() - pte_alloc_map() Any call to schedule() before pte_alloc_map() does not throw this error. Well, this might be a flawed experiment, I am no expert kernel hacker. Does this throw any light on the problem? Here are the modifications I made to Edi's patch: arch/ppc/mm/hugetlbpage.c struct page * follow_huge_addr(struct mm_struct *mm, unsigned long address, int write) { pte_t *pte; struct page *page; + struct vm_area_struct *vma; + + vma = find_vma(mm, address); + if (!vma || !is_vm_hugetlb_page(vma)) +return ERR_PTR(-EINVAL); pte = huge_pte_offset(mm, address); page = pte_page(*pte); return page; } +int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep) +{ +return 0; +} Here is my test program: #include stdlib.h #include stdio.h #include sys/types.h #include sys/ipc.h #include sys/shm.h #include sys/mman.h #ifndef SHM_HUGETLB #define SHM_HUGETLB 04000 #endif #define LENGTH (16UL*1024*1024) #define dprintf(x) printf(x) #define ADDR (void *)(0x0UL) #define SHMAT_FLAGS (0) int main(void) { int shmid; unsigned long i; char *shmaddr; if ((shmid = shmget(2, LENGTH, SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W)) 0) { perror(shmget); exit(1); } printf(shmid: 0x%x\n, shmid); shmaddr = shmat(shmid, ADDR, SHMAT_FLAGS); if (shmaddr == (char *)-1) { perror(Shared memory attach failure); shmctl(shmid, IPC_RMID, NULL); exit(2); } printf(shmaddr: %p\n, shmaddr); printf(touching a huge page..\n); shmaddr[0]='a'; shmaddr[1]='b'; if (shmdt((const void *)shmaddr) != 0) { perror(Detach failure); shmctl(shmid, IPC_RMID, NULL); exit(3); } shmctl(shmid, IPC_RMID, NULL); return 0; } thanks! Satya. -- ...what's remarkable, is that atoms have assembled into entities which are somehow able to ponder their origins. -- http://cs.uic.edu/~spopuri ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: hugetlbfs for ppc440 - kernel BUG -- follow up
On Tue, 2007-07-17 at 16:07 -0500, Satya wrote: hello, Upon investigating the below issue further, I found that pte_alloc_map() calls kmap_atomic. The allocated pte page must be unmapped before invoking any function that might_sleep. In this case clear_huge_page() is being called without invoking pte_unmap(). The 'normal' counterpart of hugetlb_no_page (which is do_no_page() in mm/memory.c) does call pte_unmap() before calling alloc_page() (which might sleep). So, I believe pte_unmap() must be invoked first in hugetlb_no_page(). But the problem here is, we do not have a reference to the pmd to map the pte again (using pte_offset_map()). The do_no_page() function does have a pmd_t* parameter, so it can remap the pte when required. For now, I resolved the problem by expanding the pte_alloc_map() macro by hand and replacing kmap_atomic with kmap(), although I think it is not the right thing to do. Let me know if my analysis is helping you figure out the problem here. Thanks! Except that I don't see where pte_alloc_map() has been called before hand... hugetlb_no_page() is called by hugetlb_fault() which is called by __handle_mm_fault(), with no lock held. Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: hugetlbfs for ppc440 - kernel BUG -- follow up
On 7/17/07, Benjamin Herrenschmidt [EMAIL PROTECTED] wrote: On Tue, 2007-07-17 at 16:07 -0500, Satya wrote: hello, Upon investigating the below issue further, I found that pte_alloc_map() calls kmap_atomic. The allocated pte page must be unmapped before invoking any function that might_sleep. In this case clear_huge_page() is being called without invoking pte_unmap(). The 'normal' counterpart of hugetlb_no_page (which is do_no_page() in mm/memory.c) does call pte_unmap() before calling alloc_page() (which might sleep). So, I believe pte_unmap() must be invoked first in hugetlb_no_page(). But the problem here is, we do not have a reference to the pmd to map the pte again (using pte_offset_map()). The do_no_page() function does have a pmd_t* parameter, so it can remap the pte when required. For now, I resolved the problem by expanding the pte_alloc_map() macro by hand and replacing kmap_atomic with kmap(), although I think it is not the right thing to do. Let me know if my analysis is helping you figure out the problem here. Thanks! Except that I don't see where pte_alloc_map() has been called before hand... hugetlb_no_page() is called by hugetlb_fault() which is called by __handle_mm_fault(), with no lock held. the calling sequence is : __handle_mm_fault - hugetlb_fault - huge_pte_alloc() - pte_alloc_map() where - stands for 'calls'. hugetlb_fault() calls hugetlb_no_page() after returning from huge_pte_alloc(). [huge_pte_alloc() is an arch specific call back implemented in the patch referred to in my earlier posts] Satya. Ben. -- ...what's remarkable, is that atoms have assembled into entities which are somehow able to ponder their origins. -- http://cs.uic.edu/~spopuri ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: hugetlbfs for ppc440 - kernel BUG -- follow up
On Tue, 2007-07-17 at 21:18 -0500, Satya wrote: the calling sequence is : __handle_mm_fault - hugetlb_fault - huge_pte_alloc() - pte_alloc_map() where - stands for 'calls'. hugetlb_fault() calls hugetlb_no_page() after returning from huge_pte_alloc(). [huge_pte_alloc() is an arch specific call back implemented in the patch referred to in my earlier posts] Ok, so I think the problem might be there. If you look at other implementations of hugetlbfs, such as x86, there is no need to do any mapping in huge_pte_alloc(). Only the PTE pages can be mapped/unmapped and the huge pages are stored at the PMD level. You may want to do something similar, and if you need a PTE level for huge pages specifically, then you could do your own allocations there that don't require a mapping. Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
hugetlbfs for ppc440 - kernel BUG
hello, I am trying to implement hugetlbfs on the IBM Bluegene/L IO node (ppc440) and I have a big problem as well as a few questions to ask the group. I patched a 2.6.21.6 linux kernel (manually) with Edi Shmueli's hugetlbfs implementation (found here: http://patchwork.ozlabs.org/linuxppc/patch?id=8427) for this. I did have to make slight changes (described at the end) to make it work. My test program is a shortened version of a sys v shared memory example described in Documentation/vm/hugetlbpage.txt I get the following kernel BUG when a page fault occurs on a huge page address: BUG: scheduling while atomic: shmtest2/0x1001/1291 Call Trace: [CFF0BCE0] [C00084F4] show_stack+0x4c/0x194 (unreliable) [CFF0BD20] [C01A53C4] schedule+0x664/0x668 [CFF0BD60] [C00175F8] __cond_resched+0x24/0x50 [CFF0BD80] [C01A5A6C] cond_resched+0x50/0x58 [CFF0BD90] [C005A31C] clear_huge_page+0x28/0x174 [CFF0BDC0] [C005B360] hugetlb_no_page+0xb4/0x220 [CFF0BE00] [C005B5BC] hugetlb_fault+0xf0/0xf4 [CFF0BE30] [C0052AC0] __handle_mm_fault+0x3a8/0x3ac [CFF0BE70] [C00094A0] do_page_fault+0x118/0x428 [CFF0BF40] [C0002360] handle_page_fault+0xc/0x80 BUG: scheduling while atomic: shmtest2/0x1001/1291 Now for my questions: 1. Can the kernel really reschedule in a page fault handler context ? 2. Just to test where this scheduling while atomic bug is arising, i put schedule() calls at various places in the path of the stack trace shown above. I found that a call to pte_alloc_map() puts the kernel in a context where it cannot reschedule without throwing up. Here is a trace of what's going on: __handle_mm_fault - hugetlb_fault - huge_pte_alloc() - pte_alloc_map() Any call to schedule() before pte_alloc_map() does not throw this error. Well, this might be a flawed experiment, I am no expert kernel hacker. Does this throw any light on the problem? Here are the modifications I made to Edi's patch: arch/ppc/mm/hugetlbpage.c struct page * follow_huge_addr(struct mm_struct *mm, unsigned long address, int write) { pte_t *pte; struct page *page; + struct vm_area_struct *vma; + + vma = find_vma(mm, address); + if (!vma || !is_vm_hugetlb_page(vma)) +return ERR_PTR(-EINVAL); pte = huge_pte_offset(mm, address); page = pte_page(*pte); return page; } +int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep) +{ +return 0; +} Here is my test program: #include stdlib.h #include stdio.h #include sys/types.h #include sys/ipc.h #include sys/shm.h #include sys/mman.h #ifndef SHM_HUGETLB #define SHM_HUGETLB 04000 #endif #define LENGTH (16UL*1024*1024) #define dprintf(x) printf(x) #define ADDR (void *)(0x0UL) #define SHMAT_FLAGS (0) int main(void) { int shmid; unsigned long i; char *shmaddr; if ((shmid = shmget(2, LENGTH, SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W)) 0) { perror(shmget); exit(1); } printf(shmid: 0x%x\n, shmid); shmaddr = shmat(shmid, ADDR, SHMAT_FLAGS); if (shmaddr == (char *)-1) { perror(Shared memory attach failure); shmctl(shmid, IPC_RMID, NULL); exit(2); } printf(shmaddr: %p\n, shmaddr); printf(touching a huge page..\n); shmaddr[0]='a'; shmaddr[1]='b'; if (shmdt((const void *)shmaddr) != 0) { perror(Detach failure); shmctl(shmid, IPC_RMID, NULL); exit(3); } shmctl(shmid, IPC_RMID, NULL); return 0; } thanks! Satya. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev