[RFC] [PATCH 0/5 V2] Huge page backed user-space stacks

2008-07-28 Thread Eric Munson
Certain workloads benefit if their data or text segments are backed by
huge pages. The stack is no exception to this rule but there is no
mechanism currently that allows the backing of a stack reliably with
huge pages.  Doing this from userspace is excessively messy and has some
awkward restrictions.  Particularly on POWER where 256MB of address space
gets wasted if the stack is setup there.

This patch stack introduces a personality flag that indicates the kernel
should setup the stack as a hugetlbfs-backed region. A userspace utility
may set this flag then exec a process whose stack is to be backed by
hugetlb pages.

Eric Munson (5):
  Align stack boundaries based on personality
  Add shared and reservation control to hugetlb_file_setup
  Split boundary checking from body of do_munmap
  Build hugetlb backed process stacks
  [PPC] Setup stack memory segment for hugetlb pages

 arch/powerpc/mm/hugetlbpage.c |6 +
 arch/powerpc/mm/slice.c   |   11 ++
 fs/exec.c |  209 ++---
 fs/hugetlbfs/inode.c  |   52 +++
 include/asm-powerpc/hugetlb.h |3 +
 include/linux/hugetlb.h   |   22 -
 include/linux/mm.h|1 +
 include/linux/personality.h   |3 +
 ipc/shm.c |2 +-
 mm/mmap.c |   11 ++-
 10 files changed, 284 insertions(+), 36 deletions(-)

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH 4/5 V2] Build hugetlb backed process stacks

2008-07-28 Thread Eric Munson
This patch allows a processes stack to be backed by huge pages on request.
The personality flag defined in a previous patch should be set before
exec is called for the target process to use a huge page backed stack.

When the hugetlb file is setup to back the stack it is sized to fit the
ulimit for stack size or 256 MB if ulimit is unlimited.  The GROWSUP and
GROWSDOWN VM flags are turned off because a hugetlb backed vma is not
resizable so it will be appropriately sized when created.  When a process
exceeds stack size it recieves a segfault as it would if it exceeded the
ulimit.

Also certain architectures require special setup for a memory region before
huge pages can be used in that region.  This patch defines a function with
__attribute__ ((weak)) set that can be defined by these architectures to
do any necessary setup.  If it exists, it will be called right before the
hugetlb file is mmapped.

Signed-off-by: Eric Munson [EMAIL PROTECTED]

---
Based on 2.6.26-rc8-mm1

Changes from V1:
Add comment about not padding huge stacks
Break personality_page_align helper and personality flag into separate patch
Add move_to_huge_pages function that moves the stack onto huge pages
Add hugetlb_mm_setup weak function for archs that require special setup to
 use hugetlb pages
Rebase to 2.6.26-rc8-mm1

 fs/exec.c   |  194 ---
 include/linux/hugetlb.h |5 +
 2 files changed, 187 insertions(+), 12 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index c99ba24..bf9ead2 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -50,6 +50,7 @@
 #include linux/cn_proc.h
 #include linux/audit.h
 #include linux/hugetlb.h
+#include linux/mman.h
 
 #include asm/uaccess.h
 #include asm/mmu_context.h
@@ -59,6 +60,8 @@
 #include linux/kmod.h
 #endif
 
+#define HUGE_STACK_MAX (256*1024*1024)
+
 #ifdef __alpha__
 /* for /sbin/loader handling in search_binary_handler() */
 #include linux/a.out.h
@@ -189,7 +192,12 @@ static struct page *get_arg_page(struct linux_binprm 
*bprm, unsigned long pos,
return NULL;
 
if (write) {
-   unsigned long size = bprm-vma-vm_end - bprm-vma-vm_start;
+   /*
+* Args are always placed at the high end of the stack space
+* so this calculation will give the proper size and it is
+* compatible with huge page stacks.
+*/
+   unsigned long size = bprm-vma-vm_end - pos;
struct rlimit *rlim;
 
/*
@@ -255,7 +263,10 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
 * configured yet.
 */
vma-vm_end = STACK_TOP_MAX;
-   vma-vm_start = vma-vm_end - PAGE_SIZE;
+   if (current-personality  HUGETLB_STACK)
+   vma-vm_start = vma-vm_end - HPAGE_SIZE;
+   else
+   vma-vm_start = vma-vm_end - PAGE_SIZE;
 
vma-vm_flags = VM_STACK_FLAGS;
vma-vm_page_prot = vm_get_page_prot(vma-vm_flags);
@@ -574,6 +585,156 @@ static int shift_arg_pages(struct vm_area_struct *vma, 
unsigned long shift)
return 0;
 }
 
+static struct file *hugetlb_stack_file(int stack_hpages)
+{
+   struct file *hugefile = NULL;
+
+   if (!stack_hpages) {
+   set_personality(current-personality  (~HUGETLB_STACK));
+   printk(KERN_DEBUG
+   Stack rlimit set too low for huge page backed 
stack.\n);
+   return NULL;
+   }
+
+   hugefile = hugetlb_file_setup(HUGETLB_STACK_FILE,
+   HPAGE_SIZE * stack_hpages,
+   HUGETLB_PRIVATE_INODE);
+   if (unlikely(IS_ERR(hugefile))) {
+   /*
+* If huge pages are not available for this stack fall
+* fall back to normal pages for execution instead of
+* failing.
+*/
+   printk(KERN_DEBUG
+   Huge page backed stack unavailable for process %lu.\n,
+   (unsigned long)current-pid);
+   set_personality(current-personality  (~HUGETLB_STACK));
+   return NULL;
+   }
+   return hugefile;
+}
+
+static int move_to_huge_pages(struct linux_binprm *bprm,
+   struct vm_area_struct *vma, unsigned long shift)
+{
+   struct mm_struct *mm = vma-vm_mm;
+   struct vm_area_struct *new_vma;
+   unsigned long old_end = vma-vm_end;
+   unsigned long old_start = vma-vm_start;
+   unsigned long new_end = old_end - shift;
+   unsigned long new_start, length;
+   unsigned long arg_size = new_end - bprm-p;
+   unsigned long flags = vma-vm_flags;
+   struct file *hugefile = NULL;
+   unsigned int stack_hpages = 0;
+   struct page **from_pages = NULL;
+   struct page **to_pages = NULL;
+   unsigned long num_pages = (arg_size / PAGE_SIZE) + 1;
+   int ret;
+   int i;
+
+#ifdef

[PATCH 1/5 V2] Align stack boundaries based on personality

2008-07-28 Thread Eric Munson
This patch adds a personality flag that requests hugetlb pages be used for
a processes stack.  It adds a helper function that chooses the proper ALIGN
macro based on tthe process personality and calls this function from
setup_arg_pages when aligning the stack address.

Signed-off-by: Andy Whitcroft [EMAIL PROTECTED]
Signed-off-by: Eric Munson [EMAIL PROTECTED]

---
Based on 2.6.26-rc8-mm1

Changes from V1:
Rebase to 2.6.26-rc8-mm1

 fs/exec.c   |   15 ++-
 include/linux/hugetlb.h |3 +++
 include/linux/personality.h |3 +++
 3 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index af9b29c..c99ba24 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -49,6 +49,7 @@
 #include linux/tsacct_kern.h
 #include linux/cn_proc.h
 #include linux/audit.h
+#include linux/hugetlb.h
 
 #include asm/uaccess.h
 #include asm/mmu_context.h
@@ -155,6 +156,18 @@ exit:
goto out;
 }
 
+static unsigned long personality_page_align(unsigned long addr)
+{
+   if (current-personality  HUGETLB_STACK)
+#ifdef CONFIG_STACK_GROWSUP
+   return HPAGE_ALIGN(addr);
+#else
+   return addr  HPAGE_MASK;
+#endif
+
+   return PAGE_ALIGN(addr);
+}
+
 #ifdef CONFIG_MMU
 
 static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
@@ -596,7 +609,7 @@ int setup_arg_pages(struct linux_binprm *bprm,
bprm-p = vma-vm_end - stack_shift;
 #else
stack_top = arch_align_stack(stack_top);
-   stack_top = PAGE_ALIGN(stack_top);
+   stack_top = personality_page_align(stack_top);
stack_shift = vma-vm_end - stack_top;
 
bprm-p -= stack_shift;
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 9a71d4c..eed37d7 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -95,6 +95,9 @@ static inline unsigned long hugetlb_total_pages(void)
 #ifndef HPAGE_MASK
 #define HPAGE_MASK PAGE_MASK   /* Keep the compiler happy */
 #define HPAGE_SIZE PAGE_SIZE
+
+/* to align the pointer to the (next) huge page boundary */
+#define HPAGE_ALIGN(addr)  ALIGN(addr, HPAGE_SIZE)
 #endif
 
 #endif /* !CONFIG_HUGETLB_PAGE */
diff --git a/include/linux/personality.h b/include/linux/personality.h
index a84e9ff..2bb0f95 100644
--- a/include/linux/personality.h
+++ b/include/linux/personality.h
@@ -22,6 +22,9 @@ extern int__set_personality(unsigned long);
  * These occupy the top three bytes.
  */
 enum {
+   HUGETLB_STACK = 0x002,  /* Attempt to use hugetlb pages
+* for the process stack
+*/
ADDR_NO_RANDOMIZE = 0x004,  /* disable randomization of VA 
space */
FDPIC_FUNCPTRS =0x008,  /* userspace function ptrs 
point to descriptors
 * (signal handling)
-- 
1.5.6.1

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH 3/5] Split boundary checking from body of do_munmap

2008-07-28 Thread Eric Munson
Currently do_unmap pre-checks the unmapped address range against the
valid address range for the process size.  However during initial setup
the stack may actually be outside this range, particularly it may be
initially placed at the 64 bit stack address and later moved to the
normal 32 bit stack location.  In a later patch we will want to unmap
the stack as part of relocating it into huge pages.

This patch moves the bulk of do_munmap into __do_munmap which will not
be protected by the boundary checking.  When an area that would normally
fail at these checks needs to be unmapped (e.g. unmapping a stack that
was setup at 64 bit TASK_SIZE for a 32 bit process) __do_munmap should
be called directly.  do_munmap will continue to do the boundary checking
and will call __do_munmap as appropriate.

Signed-off-by: Eric Munson [EMAIL PROTECTED]

---
Based on 2.6.26-rc8-mm1

 include/linux/mm.h |1 +
 mm/mmap.c  |   11 +--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a4eeb3c..59c6f89 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1152,6 +1152,7 @@ out:
return ret;
 }
 
+extern int __do_munmap(struct mm_struct *, unsigned long, size_t);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t);
 
 extern unsigned long do_brk(unsigned long, unsigned long);
diff --git a/mm/mmap.c b/mm/mmap.c
index 5b62e5d..4e56369 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1881,17 +1881,24 @@ int split_vma(struct mm_struct * mm, struct 
vm_area_struct * vma,
return 0;
 }
 
+int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
+{
+   if (start  TASK_SIZE || len  TASK_SIZE-start)
+   return -EINVAL;
+   return __do_munmap(mm, start, len);
+}
+
 /* Munmap is split into 2 main parts -- this part which finds
  * what needs doing, and the areas themselves, which do the
  * work.  This now handles partial unmappings.
  * Jeremy Fitzhardinge [EMAIL PROTECTED]
  */
-int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
+int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 {
unsigned long end;
struct vm_area_struct *vma, *prev, *last;
 
-   if ((start  ~PAGE_MASK) || start  TASK_SIZE || len  TASK_SIZE-start)
+   if (start  ~PAGE_MASK)
return -EINVAL;
 
if ((len = PAGE_ALIGN(len)) == 0)
-- 
1.5.6.1

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH 2/5 V2] Add shared and reservation control to hugetlb_file_setup

2008-07-28 Thread Eric Munson
There are two kinds of Shared hugetlbfs mappings:
   1. using internal vfsmount use ipc/shm.c and shmctl()
   2. mmap() of /hugetlbfs/file with MAP_SHARED

There is one kind of private: mmap() of /hugetlbfs/file file with
MAP_PRIVATE

This patch adds a second class of private hugetlb-backed mapping.  But
we do it by sharing code with the ipc shm.  This is mostly because we
need to do our stack setup at execve() time and can't go opening files
from hugetlbfs.  The kernel-internal vfsmount for shm lets us get around
this.  We truly want anonymous memory, but MAP_PRIVATE is close enough
for now.

Currently, if the mapping on an internal mount is larger than a single
huge page, one page is allocated, one is reserved, and the rest are
faulted as needed.  For hugetlb backed stacks we do not want any
reserved pages.  This patch gives the caller of hugetlb_file_steup the
ability to control this behavior by specifying flags for private inodes
and page reservations.

Signed-off-by: Eric Munson [EMAIL PROTECTED]

---
Based on 2.6.26-rc8-mm1

Changes from V1:
Add creat_flags to struct hugetlbfs_inode_info
Check if space should be reserved in hugetlbfs_file_mmap
Rebase to 2.6.26-rc8-mm1

 fs/hugetlbfs/inode.c|   52 ++
 include/linux/hugetlb.h |   18 ---
 ipc/shm.c   |2 +-
 3 files changed, 49 insertions(+), 23 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index dbd01d2..2e960d6 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -92,7 +92,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct 
vm_area_struct *vma)
 * way when do_mmap_pgoff unwinds (may be important on powerpc
 * and ia64).
 */
-   vma-vm_flags |= VM_HUGETLB | VM_RESERVED;
+   vma-vm_flags |= VM_HUGETLB;
vma-vm_ops = hugetlb_vm_ops;
 
if (vma-vm_pgoff  ~(huge_page_mask(h)  PAGE_SHIFT))
@@ -106,10 +106,13 @@ static int hugetlbfs_file_mmap(struct file *file, struct 
vm_area_struct *vma)
ret = -ENOMEM;
len = vma_len + ((loff_t)vma-vm_pgoff  PAGE_SHIFT);
 
-   if (hugetlb_reserve_pages(inode,
+   if (HUGETLBFS_I(inode)-creat_flags  HUGETLB_RESERVE) {
+   vma-vm_flags |= VM_RESERVED;
+   if (hugetlb_reserve_pages(inode,
vma-vm_pgoff  huge_page_order(h),
len  huge_page_shift(h), vma))
-   goto out;
+   goto out;
+   }
 
ret = 0;
hugetlb_prefault_arch_hook(vma-vm_mm);
@@ -496,7 +499,8 @@ out:
 }
 
 static struct inode *hugetlbfs_get_inode(struct super_block *sb, uid_t uid, 
-   gid_t gid, int mode, dev_t dev)
+   gid_t gid, int mode, dev_t dev,
+   unsigned long creat_flags)
 {
struct inode *inode;
 
@@ -512,7 +516,9 @@ static struct inode *hugetlbfs_get_inode(struct super_block 
*sb, uid_t uid,
inode-i_atime = inode-i_mtime = inode-i_ctime = CURRENT_TIME;
INIT_LIST_HEAD(inode-i_mapping-private_list);
info = HUGETLBFS_I(inode);
-   mpol_shared_policy_init(info-policy, NULL);
+   info-creat_flags = creat_flags;
+   if (!(creat_flags  HUGETLB_PRIVATE_INODE))
+   mpol_shared_policy_init(info-policy, NULL);
switch (mode  S_IFMT) {
default:
init_special_inode(inode, mode, dev);
@@ -553,7 +559,8 @@ static int hugetlbfs_mknod(struct inode *dir,
} else {
gid = current-fsgid;
}
-   inode = hugetlbfs_get_inode(dir-i_sb, current-fsuid, gid, mode, dev);
+   inode = hugetlbfs_get_inode(dir-i_sb, current-fsuid, gid, mode, dev,
+   HUGETLB_RESERVE);
if (inode) {
dir-i_ctime = dir-i_mtime = CURRENT_TIME;
d_instantiate(dentry, inode);
@@ -589,7 +596,8 @@ static int hugetlbfs_symlink(struct inode *dir,
gid = current-fsgid;
 
inode = hugetlbfs_get_inode(dir-i_sb, current-fsuid,
-   gid, S_IFLNK|S_IRWXUGO, 0);
+   gid, S_IFLNK|S_IRWXUGO, 0,
+   HUGETLB_RESERVE);
if (inode) {
int l = strlen(symname)+1;
error = page_symlink(inode, symname, l);
@@ -693,7 +701,8 @@ static struct inode *hugetlbfs_alloc_inode(struct 
super_block *sb)
 static void hugetlbfs_destroy_inode(struct inode *inode)
 {
hugetlbfs_inc_free_inodes(HUGETLBFS_SB(inode-i_sb));
-   mpol_free_shared_policy(HUGETLBFS_I(inode)-policy);
+   if (!(HUGETLBFS_I(inode)-creat_flags  HUGETLB_PRIVATE_INODE))
+   mpol_free_shared_policy(HUGETLBFS_I(inode)-policy);
kmem_cache_free(hugetlbfs_inode_cachep, HUGETLBFS_I

[PATCH 5/5 V2] [PPC] Setup stack memory segment for hugetlb pages

2008-07-28 Thread Eric Munson
Currently the memory slice that holds the process stack is always initialized
to hold small pages.  This patch defines the weak function that is declared
in the previos patch to convert the stack slice to hugetlb pages.

Signed-off-by: Eric Munson [EMAIL PROTECTED]

---
Based on 2.6.26-rc8-mm1

Changes from V1:
Instead of setting the mm-wide page size to huge pages, set only the relavent
 slice psize using an arch defined weak function.

 arch/powerpc/mm/hugetlbpage.c |6 ++
 arch/powerpc/mm/slice.c   |   11 +++
 include/asm-powerpc/hugetlb.h |3 +++
 3 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index fb42c4d..bd7f777 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -152,6 +152,12 @@ pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, 
unsigned long addr,
 }
 #endif
 
+void hugetlb_mm_setup(struct mm_struct *mm, unsigned long addr,
+   unsigned long len)
+{
+   slice_convert_address(mm, addr, len, shift_to_mmu_psize(HPAGE_SHIFT));
+}
+
 /* Build list of addresses of gigantic pages.  This function is used in early
  * boot before the buddy or bootmem allocator is setup.
  */
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 583be67..d984733 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -30,6 +30,7 @@
 #include linux/err.h
 #include linux/spinlock.h
 #include linux/module.h
+#include linux/hugetlb.h
 #include asm/mman.h
 #include asm/mmu.h
 #include asm/spu.h
@@ -397,6 +398,16 @@ static unsigned long slice_find_area(struct mm_struct *mm, 
unsigned long len,
 #define MMU_PAGE_BASE  MMU_PAGE_4K
 #endif
 
+void slice_convert_address(struct mm_struct *mm, unsigned long addr,
+   unsigned long len, unsigned int psize)
+{
+   struct slice_mask mask;
+
+   mask = slice_range_to_mask(addr, len);
+   slice_convert(mm, mask, psize);
+   slice_flush_segments(mm);
+}
+
 unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len,
  unsigned long flags, unsigned int psize,
  int topdown, int use_cache)
diff --git a/include/asm-powerpc/hugetlb.h b/include/asm-powerpc/hugetlb.h
index 26f0d0a..10ef089 100644
--- a/include/asm-powerpc/hugetlb.h
+++ b/include/asm-powerpc/hugetlb.h
@@ -17,6 +17,9 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
  pte_t *ptep);
 
+void slice_convert_address(struct mm_struct *mm, unsigned long addr,
+   unsigned long len, unsigned int psize);
+
 /*
  * If the arch doesn't supply something else, assume that hugepage
  * size aligned regions are ok without further preparation.
-- 
1.5.6.1

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH V3] Keep 3 high personality bytes across exec

2008-06-30 Thread Eric Munson
Currently when a 32 bit process is exec'd on a powerpc 64 bit host the value
in the top three bytes of the personality is clobbered.  This patch adds a
check in the SET_PERSONALITY macro that will carry all the values in the top
three bytes across the exec.

These three bytes currently carry flags to disable address randomisation,
limit the address space, force zeroing of an mmapped page, etc.  Should an
application set any of these bits they will be maintained and honoured on
homogeneous environment but discarded and ignored on a heterogeneous
environment.  So if an application requires all mmapped pages to be initialised
to zero and a wrapper is used to setup the personality and exec the target,
these flags will remain set on an all 32 or all 64 bit envrionment, but they
will be lost in the exec on a mixed 32/64 bit environment.  Losing these bits
means that the same application would behave differently in different
environments.  Tested on a POWER5+ machine with 64bit kernel and a mixed
64/32 bit user space.

Signed-off-by: Eric B Munson [EMAIL PROTECTED]
---
V3
Based on 2.6.26-rc8

Changes from V2:
Use ~PER_MASK instead of PER_INHERIT
Remove PER_INHERIT
Rebase to 2.6.26-rc8

Changes from V1:
Updated changelog with a better description of why this change is useful

 include/asm-powerpc/elf.h |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/include/asm-powerpc/elf.h b/include/asm-powerpc/elf.h
index 9080d85..5eee73e 100644
--- a/include/asm-powerpc/elf.h
+++ b/include/asm-powerpc/elf.h
@@ -257,7 +257,8 @@ do {
\
else\
clear_thread_flag(TIF_ABI_PENDING); \
if (personality(current-personality) != PER_LINUX32)   \
-   set_personality(PER_LINUX); \
+   set_personality(PER_LINUX | \
+   (current-personality  (~PER_MASK)));  \
 } while (0)
 /*
  * An executable for which elf_read_implies_exec() returns TRUE will
-- 
1.5.6.1

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev