Re: hugetlbfs for ppc440 - kernel BUG

2008-10-22 Thread David Gibson
On Tue, Oct 21, 2008 at 03:50:30PM -0700, Satya wrote:
 On Tue, Oct 21, 2008 at 3:46 PM, Satya [EMAIL PROTECTED] wrote:
 
  Ben,
  Look here: http://www-unix.mcs.anl.gov/zeptoos/hugepages/
 
  thanks,
  ./satya
 
 
  On Tue, Oct 21, 2008 at 1:47 PM, Benjamin Herrenschmidt 
  [EMAIL PROTECTED] wrote:
 
  On Tue, 2007-07-10 at 13:38 -0500, Satya wrote:
   hello,
   I am trying to implement hugetlbfs on the IBM Bluegene/L IO node
   (ppc440) and I have a big problem as well as a few questions to ask
   the group. I patched a 2.6.21.6 linux kernel (manually) with Edi
   Shmueli's hugetlbfs implementation (found here:
   http://patchwork.ozlabs.org/linuxppc/patch?id=8427) for this. I did
   have to make slight changes (described at the end) to make it work.
   My test program is a shortened version of a sys v shared memory
   example described in Documentation/vm/hugetlbpage.txt
 
  Hi !
 
  The patchwork link unfortunately didn't survive the transition to
  patchwork 2.
 
  Do you know what's the status of Hugetlb support for 44x ? Is there any
  plan to release that for upstream inclusion ?
 
  Cheers,
  Ben.
 
 
 
 
 whoops, sorry for top-posting. Here is a patch that worked at that time:
 http://www-unix.mcs.anl.gov/zeptoos/hugepages/hugetlbpage_44x.patch
 
 I didn't follow up after this to get it merged upstream. Also I don't know
 if hugetlb core has changed to deal with PTEs in high memory.

Ok, had a look at this.  It's had some tweaks since I last looked at
the bluegene hugepage/440 patch.  It still has the rather ugly
approach of storing the hugepage PTEs always at the bottom level, and
duplicating them umpteen times (including pointing multiple PMDs at a
single PTE page when the hugepage size exceeds the area mapped by a
PMD).  It also has the most serious bug I remember from the old
version - the DIRTY and ACCESSED handling is completely bogus, because
it doesn't keep the copies of the bits in the many copies of the PTEs
in sync.  Between the TLB miss rewrite that's happened in the meantime
and my patch to handle these from hugetlb_fault() it's at least now
easier to fix this bug.  Also the patch is arch/ppc based.

I'll try to sort this out in the near future.  I guess the only big
question is whether its important to support hugepage sizes  2M.  For
hugepage sizes =2M (16M and 256M) we can just make PMD pointers into
hugepage pointers with the addition of a suitable size field, as we do
for 40x.  For page sizes 2M things get more complicated because we
need some sort of second level hugepage tables (which may or may not
be distinct from the ordinary second level tables).

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: hugetlbfs for ppc440 - kernel BUG

2008-10-21 Thread Benjamin Herrenschmidt
On Tue, 2007-07-10 at 13:38 -0500, Satya wrote:
 hello,
 I am trying to implement hugetlbfs on the IBM Bluegene/L IO node
 (ppc440) and I have a big problem as well as a few questions to ask
 the group. I patched a 2.6.21.6 linux kernel (manually) with Edi
 Shmueli's hugetlbfs implementation (found here:
 http://patchwork.ozlabs.org/linuxppc/patch?id=8427) for this. I did
 have to make slight changes (described at the end) to make it work.
 My test program is a shortened version of a sys v shared memory
 example described in Documentation/vm/hugetlbpage.txt

Hi !

The patchwork link unfortunately didn't survive the transition to
patchwork 2.

Do you know what's the status of Hugetlb support for 44x ? Is there any
plan to release that for upstream inclusion ?

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: hugetlbfs for ppc440 - kernel BUG

2008-10-21 Thread Benjamin Herrenschmidt
On Tue, 2008-10-21 at 15:46 -0700, Satya wrote:
 Ben,
 Look here: http://www-unix.mcs.anl.gov/zeptoos/hugepages/

Thanks. What is the status ? Do they work fine ? Are they going to be
re-submitted for inclusion ?

Cheers,
Ben.

 thanks,
 ./satya
 
 On Tue, Oct 21, 2008 at 1:47 PM, Benjamin Herrenschmidt
 [EMAIL PROTECTED] wrote:
 On Tue, 2007-07-10 at 13:38 -0500, Satya wrote:
  hello,
  I am trying to implement hugetlbfs on the IBM Bluegene/L IO
 node
  (ppc440) and I have a big problem as well as a few questions
 to ask
  the group. I patched a 2.6.21.6 linux kernel (manually) with
 Edi
  Shmueli's hugetlbfs implementation (found here:
  http://patchwork.ozlabs.org/linuxppc/patch?id=8427) for
 this. I did
  have to make slight changes (described at the end) to make
 it work.
  My test program is a shortened version of a sys v shared
 memory
  example described in Documentation/vm/hugetlbpage.txt
 
 
 Hi !
 
 The patchwork link unfortunately didn't survive the transition
 to
 patchwork 2.
 
 Do you know what's the status of Hugetlb support for 44x ? Is
 there any
 plan to release that for upstream inclusion ?
 
 Cheers,
 Ben.
 
 
 

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: hugetlbfs for ppc440 - kernel BUG

2008-10-21 Thread Satya
Ben,
Look here: http://www-unix.mcs.anl.gov/zeptoos/hugepages/

thanks,
./satya

On Tue, Oct 21, 2008 at 1:47 PM, Benjamin Herrenschmidt 
[EMAIL PROTECTED] wrote:

 On Tue, 2007-07-10 at 13:38 -0500, Satya wrote:
  hello,
  I am trying to implement hugetlbfs on the IBM Bluegene/L IO node
  (ppc440) and I have a big problem as well as a few questions to ask
  the group. I patched a 2.6.21.6 linux kernel (manually) with Edi
  Shmueli's hugetlbfs implementation (found here:
  http://patchwork.ozlabs.org/linuxppc/patch?id=8427) for this. I did
  have to make slight changes (described at the end) to make it work.
  My test program is a shortened version of a sys v shared memory
  example described in Documentation/vm/hugetlbpage.txt

 Hi !

 The patchwork link unfortunately didn't survive the transition to
 patchwork 2.

 Do you know what's the status of Hugetlb support for 44x ? Is there any
 plan to release that for upstream inclusion ?

 Cheers,
 Ben.



___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: hugetlbfs for ppc440 - kernel BUG

2008-10-21 Thread Satya
On Tue, Oct 21, 2008 at 3:46 PM, Satya [EMAIL PROTECTED] wrote:

 Ben,
 Look here: http://www-unix.mcs.anl.gov/zeptoos/hugepages/

 thanks,
 ./satya


 On Tue, Oct 21, 2008 at 1:47 PM, Benjamin Herrenschmidt 
 [EMAIL PROTECTED] wrote:

 On Tue, 2007-07-10 at 13:38 -0500, Satya wrote:
  hello,
  I am trying to implement hugetlbfs on the IBM Bluegene/L IO node
  (ppc440) and I have a big problem as well as a few questions to ask
  the group. I patched a 2.6.21.6 linux kernel (manually) with Edi
  Shmueli's hugetlbfs implementation (found here:
  http://patchwork.ozlabs.org/linuxppc/patch?id=8427) for this. I did
  have to make slight changes (described at the end) to make it work.
  My test program is a shortened version of a sys v shared memory
  example described in Documentation/vm/hugetlbpage.txt

 Hi !

 The patchwork link unfortunately didn't survive the transition to
 patchwork 2.

 Do you know what's the status of Hugetlb support for 44x ? Is there any
 plan to release that for upstream inclusion ?

 Cheers,
 Ben.




whoops, sorry for top-posting. Here is a patch that worked at that time:
http://www-unix.mcs.anl.gov/zeptoos/hugepages/hugetlbpage_44x.patch

I didn't follow up after this to get it merged upstream. Also I don't know
if hugetlb core has changed to deal with PTEs in high memory.

./satya
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: hugetlbfs for ppc440 - kernel BUG

2008-10-21 Thread David Gibson
On Wed, Oct 22, 2008 at 09:53:13AM +1100, Benjamin Herrenschmidt wrote:
 On Tue, 2008-10-21 at 15:46 -0700, Satya wrote:
  Ben,
  Look here: http://www-unix.mcs.anl.gov/zeptoos/hugepages/
 
 Thanks. What is the status ? Do they work fine ? Are they going to be
 re-submitted for inclusion ?

Hrm.  Last I looked at the 440 hugepage patches they appeared to have
several serious bugs (I was surprised they worked at all).  I had
meant to fix them up and push, but I never quite got around to it.
I'll have at this link later today.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: hugetlbfs for ppc440 - kernel BUG -- follow up

2007-07-17 Thread Satya
hello,

Upon investigating the below issue further, I found that
pte_alloc_map() calls kmap_atomic. The allocated pte page must be
unmapped before invoking any function that might_sleep.

In this case clear_huge_page() is being called without invoking
pte_unmap(). The 'normal' counterpart of hugetlb_no_page (which is
do_no_page() in mm/memory.c) does call pte_unmap() before calling
alloc_page() (which might sleep).

So, I believe pte_unmap() must be invoked first in hugetlb_no_page().
But the problem here is, we do not have a reference to the pmd to map
the pte again (using pte_offset_map()). The do_no_page() function does
have a pmd_t* parameter, so it can remap the pte when required.

For now, I resolved the problem by expanding the pte_alloc_map() macro
by hand and replacing kmap_atomic with kmap(), although I think it is
not the right thing to do.

Let me know if my analysis is helping you figure out the problem here. Thanks!

--satya.

On 7/10/07, Satya [EMAIL PROTECTED] wrote:
 hello,
 I am trying to implement hugetlbfs on the IBM Bluegene/L IO node
 (ppc440) and I have a big problem as well as a few questions to ask
 the group. I patched a 2.6.21.6 linux kernel (manually) with Edi
 Shmueli's hugetlbfs implementation (found here:
 http://patchwork.ozlabs.org/linuxppc/patch?id=8427) for this. I did
 have to make slight changes (described at the end) to make it work.
 My test program is a shortened version of a sys v shared memory
 example described in Documentation/vm/hugetlbpage.txt

 I get the following kernel BUG when a page fault occurs on a huge page 
 address:
 BUG: scheduling while atomic: shmtest2/0x1001/1291
 Call Trace:
 [CFF0BCE0] [C00084F4] show_stack+0x4c/0x194 (unreliable)
  [CFF0BD20] [C01A53C4] schedule+0x664/0x668
 [CFF0BD60] [C00175F8] __cond_resched+0x24/0x50
 [CFF0BD80] [C01A5A6C] cond_resched+0x50/0x58
 [CFF0BD90] [C005A31C] clear_huge_page+0x28/0x174
 [CFF0BDC0] [C005B360] hugetlb_no_page+0xb4/0x220
 [CFF0BE00] [C005B5BC] hugetlb_fault+0xf0/0xf4
 [CFF0BE30] [C0052AC0] __handle_mm_fault+0x3a8/0x3ac
 [CFF0BE70] [C00094A0] do_page_fault+0x118/0x428
 [CFF0BF40] [C0002360] handle_page_fault+0xc/0x80
 BUG: scheduling while atomic: shmtest2/0x1001/1291

 Now for my questions:

 1. Can the kernel really reschedule in a page fault handler context ?

 2. Just to test where this scheduling while atomic bug is arising, i
 put schedule() calls at various places in the path of the stack trace
 shown above.
 I found that a call to pte_alloc_map() puts the kernel in a context
 where it cannot reschedule without throwing up. Here is a trace of
 what's going on:

 __handle_mm_fault - hugetlb_fault - huge_pte_alloc() - pte_alloc_map()

 Any call to schedule() before pte_alloc_map() does not throw this
 error. Well, this might be a flawed experiment, I am no expert kernel
 hacker. Does this throw any light on the problem?

 Here are the modifications I made to Edi's patch:

 arch/ppc/mm/hugetlbpage.c
 struct page *
 follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
 {
   pte_t *pte;
   struct page *page;
 +  struct vm_area_struct *vma;
 +
 +  vma = find_vma(mm, address);
 + if (!vma || !is_vm_hugetlb_page(vma))
 +return ERR_PTR(-EINVAL);

   pte = huge_pte_offset(mm, address);
   page = pte_page(*pte);
   return page;
 }

 +int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 +{
 +return 0;
 +}

 Here is my test program:

 #include stdlib.h
 #include stdio.h
 #include sys/types.h
 #include sys/ipc.h
 #include sys/shm.h
 #include sys/mman.h

 #ifndef SHM_HUGETLB
 #define SHM_HUGETLB 04000
 #endif

 #define LENGTH (16UL*1024*1024)

 #define dprintf(x)  printf(x)

 #define ADDR (void *)(0x0UL)
 #define SHMAT_FLAGS (0)


 int main(void)
 {
 int shmid;
 unsigned long i;
 char *shmaddr;

 if ((shmid = shmget(2, LENGTH,
 SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W))  0) {
 perror(shmget);
 exit(1);
 }
 printf(shmid: 0x%x\n, shmid);

 shmaddr = shmat(shmid, ADDR, SHMAT_FLAGS);
 if (shmaddr == (char *)-1) {
 perror(Shared memory attach failure);
 shmctl(shmid, IPC_RMID, NULL);
 exit(2);
 }
 printf(shmaddr: %p\n, shmaddr);
 printf(touching a huge page..\n);

 shmaddr[0]='a';
 shmaddr[1]='b';

 if (shmdt((const void *)shmaddr) != 0) {
 perror(Detach failure);
 shmctl(shmid, IPC_RMID, NULL);
 exit(3);
 }

 shmctl(shmid, IPC_RMID, NULL);

 return 0;
 }

 thanks!
 Satya.



-- 
...what's remarkable, is that atoms have assembled into entities which
are somehow able to ponder their origins.
--
http://cs.uic.edu/~spopuri
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: hugetlbfs for ppc440 - kernel BUG -- follow up

2007-07-17 Thread Benjamin Herrenschmidt
On Tue, 2007-07-17 at 16:07 -0500, Satya wrote:
 hello,
 
 Upon investigating the below issue further, I found that
 pte_alloc_map() calls kmap_atomic. The allocated pte page must be
 unmapped before invoking any function that might_sleep.
 
 In this case clear_huge_page() is being called without invoking
 pte_unmap(). The 'normal' counterpart of hugetlb_no_page (which is
 do_no_page() in mm/memory.c) does call pte_unmap() before calling
 alloc_page() (which might sleep).
 
 So, I believe pte_unmap() must be invoked first in hugetlb_no_page().
 But the problem here is, we do not have a reference to the pmd to map
 the pte again (using pte_offset_map()). The do_no_page() function does
 have a pmd_t* parameter, so it can remap the pte when required.
 
 For now, I resolved the problem by expanding the pte_alloc_map() macro
 by hand and replacing kmap_atomic with kmap(), although I think it is
 not the right thing to do.
 
 Let me know if my analysis is helping you figure out the problem here. Thanks!

Except that I don't see where pte_alloc_map() has been called before
hand... hugetlb_no_page() is called by hugetlb_fault() which is called
by __handle_mm_fault(), with no lock held.

Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: hugetlbfs for ppc440 - kernel BUG -- follow up

2007-07-17 Thread Satya
On 7/17/07, Benjamin Herrenschmidt [EMAIL PROTECTED] wrote:
 On Tue, 2007-07-17 at 16:07 -0500, Satya wrote:
  hello,
 
  Upon investigating the below issue further, I found that
  pte_alloc_map() calls kmap_atomic. The allocated pte page must be
  unmapped before invoking any function that might_sleep.
 
  In this case clear_huge_page() is being called without invoking
  pte_unmap(). The 'normal' counterpart of hugetlb_no_page (which is
  do_no_page() in mm/memory.c) does call pte_unmap() before calling
  alloc_page() (which might sleep).
 
  So, I believe pte_unmap() must be invoked first in hugetlb_no_page().
  But the problem here is, we do not have a reference to the pmd to map
  the pte again (using pte_offset_map()). The do_no_page() function does
  have a pmd_t* parameter, so it can remap the pte when required.
 
  For now, I resolved the problem by expanding the pte_alloc_map() macro
  by hand and replacing kmap_atomic with kmap(), although I think it is
  not the right thing to do.
 
  Let me know if my analysis is helping you figure out the problem here. 
  Thanks!

 Except that I don't see where pte_alloc_map() has been called before
 hand... hugetlb_no_page() is called by hugetlb_fault() which is called
 by __handle_mm_fault(), with no lock held.


the calling sequence is :

__handle_mm_fault - hugetlb_fault - huge_pte_alloc() - pte_alloc_map()

where - stands for 'calls'.

hugetlb_fault() calls hugetlb_no_page() after returning from huge_pte_alloc().

[huge_pte_alloc() is an arch specific call back implemented in the
patch referred to in my earlier posts]

Satya.

 Ben.





-- 
...what's remarkable, is that atoms have assembled into entities which
are somehow able to ponder their origins.
--
http://cs.uic.edu/~spopuri
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: hugetlbfs for ppc440 - kernel BUG -- follow up

2007-07-17 Thread Benjamin Herrenschmidt
On Tue, 2007-07-17 at 21:18 -0500, Satya wrote:
 the calling sequence is :
 
 __handle_mm_fault - hugetlb_fault - huge_pte_alloc() -
 pte_alloc_map()
 
 where - stands for 'calls'.
 
 hugetlb_fault() calls hugetlb_no_page() after returning from
 huge_pte_alloc().
 
 [huge_pte_alloc() is an arch specific call back implemented in the
 patch referred to in my earlier posts]

Ok, so I think the problem might be there. If you look at other
implementations of hugetlbfs, such as x86, there is no need to do any
mapping in huge_pte_alloc(). Only the PTE pages can be mapped/unmapped
and the huge pages are stored at the PMD level. You may want to do
something similar, and if you need a PTE level for huge pages
specifically, then you could do your own allocations there that don't
require a mapping.

Ben.



___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


hugetlbfs for ppc440 - kernel BUG

2007-07-10 Thread Satya
hello,
I am trying to implement hugetlbfs on the IBM Bluegene/L IO node
(ppc440) and I have a big problem as well as a few questions to ask
the group. I patched a 2.6.21.6 linux kernel (manually) with Edi
Shmueli's hugetlbfs implementation (found here:
http://patchwork.ozlabs.org/linuxppc/patch?id=8427) for this. I did
have to make slight changes (described at the end) to make it work.
My test program is a shortened version of a sys v shared memory
example described in Documentation/vm/hugetlbpage.txt

I get the following kernel BUG when a page fault occurs on a huge page address:
BUG: scheduling while atomic: shmtest2/0x1001/1291
Call Trace:
[CFF0BCE0] [C00084F4] show_stack+0x4c/0x194 (unreliable)
 [CFF0BD20] [C01A53C4] schedule+0x664/0x668
[CFF0BD60] [C00175F8] __cond_resched+0x24/0x50
[CFF0BD80] [C01A5A6C] cond_resched+0x50/0x58
[CFF0BD90] [C005A31C] clear_huge_page+0x28/0x174
[CFF0BDC0] [C005B360] hugetlb_no_page+0xb4/0x220
[CFF0BE00] [C005B5BC] hugetlb_fault+0xf0/0xf4
[CFF0BE30] [C0052AC0] __handle_mm_fault+0x3a8/0x3ac
[CFF0BE70] [C00094A0] do_page_fault+0x118/0x428
[CFF0BF40] [C0002360] handle_page_fault+0xc/0x80
BUG: scheduling while atomic: shmtest2/0x1001/1291

Now for my questions:

1. Can the kernel really reschedule in a page fault handler context ?

2. Just to test where this scheduling while atomic bug is arising, i
put schedule() calls at various places in the path of the stack trace
shown above.
I found that a call to pte_alloc_map() puts the kernel in a context
where it cannot reschedule without throwing up. Here is a trace of
what's going on:

__handle_mm_fault - hugetlb_fault - huge_pte_alloc() - pte_alloc_map()

Any call to schedule() before pte_alloc_map() does not throw this
error. Well, this might be a flawed experiment, I am no expert kernel
hacker. Does this throw any light on the problem?

Here are the modifications I made to Edi's patch:

arch/ppc/mm/hugetlbpage.c
struct page *
follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
{
  pte_t *pte;
  struct page *page;
+  struct vm_area_struct *vma;
+
+  vma = find_vma(mm, address);
+ if (!vma || !is_vm_hugetlb_page(vma))
+return ERR_PTR(-EINVAL);

  pte = huge_pte_offset(mm, address);
  page = pte_page(*pte);
  return page;
}

+int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
+{
+return 0;
+}

Here is my test program:

#include stdlib.h
#include stdio.h
#include sys/types.h
#include sys/ipc.h
#include sys/shm.h
#include sys/mman.h

#ifndef SHM_HUGETLB
#define SHM_HUGETLB 04000
#endif

#define LENGTH (16UL*1024*1024)

#define dprintf(x)  printf(x)

#define ADDR (void *)(0x0UL)
#define SHMAT_FLAGS (0)


int main(void)
{
int shmid;
unsigned long i;
char *shmaddr;

if ((shmid = shmget(2, LENGTH,
SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W))  0) {
perror(shmget);
exit(1);
}
printf(shmid: 0x%x\n, shmid);

shmaddr = shmat(shmid, ADDR, SHMAT_FLAGS);
if (shmaddr == (char *)-1) {
perror(Shared memory attach failure);
shmctl(shmid, IPC_RMID, NULL);
exit(2);
}
printf(shmaddr: %p\n, shmaddr);
printf(touching a huge page..\n);

shmaddr[0]='a';
shmaddr[1]='b';

if (shmdt((const void *)shmaddr) != 0) {
perror(Detach failure);
shmctl(shmid, IPC_RMID, NULL);
exit(3);
}

shmctl(shmid, IPC_RMID, NULL);

return 0;
}

thanks!
Satya.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev