[RFC] [PATCH 0/5 V2] Huge page backed user-space stacks
Certain workloads benefit if their data or text segments are backed by huge pages. The stack is no exception to this rule but there is no mechanism currently that allows the backing of a stack reliably with huge pages. Doing this from userspace is excessively messy and has some awkward restrictions. Particularly on POWER where 256MB of address space gets wasted if the stack is setup there. This patch stack introduces a personality flag that indicates the kernel should setup the stack as a hugetlbfs-backed region. A userspace utility may set this flag then exec a process whose stack is to be backed by hugetlb pages. Eric Munson (5): Align stack boundaries based on personality Add shared and reservation control to hugetlb_file_setup Split boundary checking from body of do_munmap Build hugetlb backed process stacks [PPC] Setup stack memory segment for hugetlb pages arch/powerpc/mm/hugetlbpage.c |6 + arch/powerpc/mm/slice.c | 11 ++ fs/exec.c | 209 ++--- fs/hugetlbfs/inode.c | 52 +++ include/asm-powerpc/hugetlb.h |3 + include/linux/hugetlb.h | 22 - include/linux/mm.h|1 + include/linux/personality.h |3 + ipc/shm.c |2 +- mm/mmap.c | 11 ++- 10 files changed, 284 insertions(+), 36 deletions(-) ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 4/5 V2] Build hugetlb backed process stacks
This patch allows a processes stack to be backed by huge pages on request. The personality flag defined in a previous patch should be set before exec is called for the target process to use a huge page backed stack. When the hugetlb file is setup to back the stack it is sized to fit the ulimit for stack size or 256 MB if ulimit is unlimited. The GROWSUP and GROWSDOWN VM flags are turned off because a hugetlb backed vma is not resizable so it will be appropriately sized when created. When a process exceeds stack size it recieves a segfault as it would if it exceeded the ulimit. Also certain architectures require special setup for a memory region before huge pages can be used in that region. This patch defines a function with __attribute__ ((weak)) set that can be defined by these architectures to do any necessary setup. If it exists, it will be called right before the hugetlb file is mmapped. Signed-off-by: Eric Munson [EMAIL PROTECTED] --- Based on 2.6.26-rc8-mm1 Changes from V1: Add comment about not padding huge stacks Break personality_page_align helper and personality flag into separate patch Add move_to_huge_pages function that moves the stack onto huge pages Add hugetlb_mm_setup weak function for archs that require special setup to use hugetlb pages Rebase to 2.6.26-rc8-mm1 fs/exec.c | 194 --- include/linux/hugetlb.h |5 + 2 files changed, 187 insertions(+), 12 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index c99ba24..bf9ead2 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -50,6 +50,7 @@ #include linux/cn_proc.h #include linux/audit.h #include linux/hugetlb.h +#include linux/mman.h #include asm/uaccess.h #include asm/mmu_context.h @@ -59,6 +60,8 @@ #include linux/kmod.h #endif +#define HUGE_STACK_MAX (256*1024*1024) + #ifdef __alpha__ /* for /sbin/loader handling in search_binary_handler() */ #include linux/a.out.h @@ -189,7 +192,12 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos, return NULL; if (write) { - unsigned long size = bprm-vma-vm_end - bprm-vma-vm_start; + /* +* Args are always placed at the high end of the stack space +* so this calculation will give the proper size and it is +* compatible with huge page stacks. +*/ + unsigned long size = bprm-vma-vm_end - pos; struct rlimit *rlim; /* @@ -255,7 +263,10 @@ static int __bprm_mm_init(struct linux_binprm *bprm) * configured yet. */ vma-vm_end = STACK_TOP_MAX; - vma-vm_start = vma-vm_end - PAGE_SIZE; + if (current-personality HUGETLB_STACK) + vma-vm_start = vma-vm_end - HPAGE_SIZE; + else + vma-vm_start = vma-vm_end - PAGE_SIZE; vma-vm_flags = VM_STACK_FLAGS; vma-vm_page_prot = vm_get_page_prot(vma-vm_flags); @@ -574,6 +585,156 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift) return 0; } +static struct file *hugetlb_stack_file(int stack_hpages) +{ + struct file *hugefile = NULL; + + if (!stack_hpages) { + set_personality(current-personality (~HUGETLB_STACK)); + printk(KERN_DEBUG + Stack rlimit set too low for huge page backed stack.\n); + return NULL; + } + + hugefile = hugetlb_file_setup(HUGETLB_STACK_FILE, + HPAGE_SIZE * stack_hpages, + HUGETLB_PRIVATE_INODE); + if (unlikely(IS_ERR(hugefile))) { + /* +* If huge pages are not available for this stack fall +* fall back to normal pages for execution instead of +* failing. +*/ + printk(KERN_DEBUG + Huge page backed stack unavailable for process %lu.\n, + (unsigned long)current-pid); + set_personality(current-personality (~HUGETLB_STACK)); + return NULL; + } + return hugefile; +} + +static int move_to_huge_pages(struct linux_binprm *bprm, + struct vm_area_struct *vma, unsigned long shift) +{ + struct mm_struct *mm = vma-vm_mm; + struct vm_area_struct *new_vma; + unsigned long old_end = vma-vm_end; + unsigned long old_start = vma-vm_start; + unsigned long new_end = old_end - shift; + unsigned long new_start, length; + unsigned long arg_size = new_end - bprm-p; + unsigned long flags = vma-vm_flags; + struct file *hugefile = NULL; + unsigned int stack_hpages = 0; + struct page **from_pages = NULL; + struct page **to_pages = NULL; + unsigned long num_pages = (arg_size / PAGE_SIZE) + 1; + int ret; + int i; + +#ifdef
[PATCH 1/5 V2] Align stack boundaries based on personality
This patch adds a personality flag that requests hugetlb pages be used for a processes stack. It adds a helper function that chooses the proper ALIGN macro based on tthe process personality and calls this function from setup_arg_pages when aligning the stack address. Signed-off-by: Andy Whitcroft [EMAIL PROTECTED] Signed-off-by: Eric Munson [EMAIL PROTECTED] --- Based on 2.6.26-rc8-mm1 Changes from V1: Rebase to 2.6.26-rc8-mm1 fs/exec.c | 15 ++- include/linux/hugetlb.h |3 +++ include/linux/personality.h |3 +++ 3 files changed, 20 insertions(+), 1 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index af9b29c..c99ba24 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -49,6 +49,7 @@ #include linux/tsacct_kern.h #include linux/cn_proc.h #include linux/audit.h +#include linux/hugetlb.h #include asm/uaccess.h #include asm/mmu_context.h @@ -155,6 +156,18 @@ exit: goto out; } +static unsigned long personality_page_align(unsigned long addr) +{ + if (current-personality HUGETLB_STACK) +#ifdef CONFIG_STACK_GROWSUP + return HPAGE_ALIGN(addr); +#else + return addr HPAGE_MASK; +#endif + + return PAGE_ALIGN(addr); +} + #ifdef CONFIG_MMU static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos, @@ -596,7 +609,7 @@ int setup_arg_pages(struct linux_binprm *bprm, bprm-p = vma-vm_end - stack_shift; #else stack_top = arch_align_stack(stack_top); - stack_top = PAGE_ALIGN(stack_top); + stack_top = personality_page_align(stack_top); stack_shift = vma-vm_end - stack_top; bprm-p -= stack_shift; diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 9a71d4c..eed37d7 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -95,6 +95,9 @@ static inline unsigned long hugetlb_total_pages(void) #ifndef HPAGE_MASK #define HPAGE_MASK PAGE_MASK /* Keep the compiler happy */ #define HPAGE_SIZE PAGE_SIZE + +/* to align the pointer to the (next) huge page boundary */ +#define HPAGE_ALIGN(addr) ALIGN(addr, HPAGE_SIZE) #endif #endif /* !CONFIG_HUGETLB_PAGE */ diff --git a/include/linux/personality.h b/include/linux/personality.h index a84e9ff..2bb0f95 100644 --- a/include/linux/personality.h +++ b/include/linux/personality.h @@ -22,6 +22,9 @@ extern int__set_personality(unsigned long); * These occupy the top three bytes. */ enum { + HUGETLB_STACK = 0x002, /* Attempt to use hugetlb pages +* for the process stack +*/ ADDR_NO_RANDOMIZE = 0x004, /* disable randomization of VA space */ FDPIC_FUNCPTRS =0x008, /* userspace function ptrs point to descriptors * (signal handling) -- 1.5.6.1 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 3/5] Split boundary checking from body of do_munmap
Currently do_unmap pre-checks the unmapped address range against the valid address range for the process size. However during initial setup the stack may actually be outside this range, particularly it may be initially placed at the 64 bit stack address and later moved to the normal 32 bit stack location. In a later patch we will want to unmap the stack as part of relocating it into huge pages. This patch moves the bulk of do_munmap into __do_munmap which will not be protected by the boundary checking. When an area that would normally fail at these checks needs to be unmapped (e.g. unmapping a stack that was setup at 64 bit TASK_SIZE for a 32 bit process) __do_munmap should be called directly. do_munmap will continue to do the boundary checking and will call __do_munmap as appropriate. Signed-off-by: Eric Munson [EMAIL PROTECTED] --- Based on 2.6.26-rc8-mm1 include/linux/mm.h |1 + mm/mmap.c | 11 +-- 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index a4eeb3c..59c6f89 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1152,6 +1152,7 @@ out: return ret; } +extern int __do_munmap(struct mm_struct *, unsigned long, size_t); extern int do_munmap(struct mm_struct *, unsigned long, size_t); extern unsigned long do_brk(unsigned long, unsigned long); diff --git a/mm/mmap.c b/mm/mmap.c index 5b62e5d..4e56369 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1881,17 +1881,24 @@ int split_vma(struct mm_struct * mm, struct vm_area_struct * vma, return 0; } +int do_munmap(struct mm_struct *mm, unsigned long start, size_t len) +{ + if (start TASK_SIZE || len TASK_SIZE-start) + return -EINVAL; + return __do_munmap(mm, start, len); +} + /* Munmap is split into 2 main parts -- this part which finds * what needs doing, and the areas themselves, which do the * work. This now handles partial unmappings. * Jeremy Fitzhardinge [EMAIL PROTECTED] */ -int do_munmap(struct mm_struct *mm, unsigned long start, size_t len) +int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len) { unsigned long end; struct vm_area_struct *vma, *prev, *last; - if ((start ~PAGE_MASK) || start TASK_SIZE || len TASK_SIZE-start) + if (start ~PAGE_MASK) return -EINVAL; if ((len = PAGE_ALIGN(len)) == 0) -- 1.5.6.1 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 2/5 V2] Add shared and reservation control to hugetlb_file_setup
There are two kinds of Shared hugetlbfs mappings: 1. using internal vfsmount use ipc/shm.c and shmctl() 2. mmap() of /hugetlbfs/file with MAP_SHARED There is one kind of private: mmap() of /hugetlbfs/file file with MAP_PRIVATE This patch adds a second class of private hugetlb-backed mapping. But we do it by sharing code with the ipc shm. This is mostly because we need to do our stack setup at execve() time and can't go opening files from hugetlbfs. The kernel-internal vfsmount for shm lets us get around this. We truly want anonymous memory, but MAP_PRIVATE is close enough for now. Currently, if the mapping on an internal mount is larger than a single huge page, one page is allocated, one is reserved, and the rest are faulted as needed. For hugetlb backed stacks we do not want any reserved pages. This patch gives the caller of hugetlb_file_steup the ability to control this behavior by specifying flags for private inodes and page reservations. Signed-off-by: Eric Munson [EMAIL PROTECTED] --- Based on 2.6.26-rc8-mm1 Changes from V1: Add creat_flags to struct hugetlbfs_inode_info Check if space should be reserved in hugetlbfs_file_mmap Rebase to 2.6.26-rc8-mm1 fs/hugetlbfs/inode.c| 52 ++ include/linux/hugetlb.h | 18 --- ipc/shm.c |2 +- 3 files changed, 49 insertions(+), 23 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index dbd01d2..2e960d6 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -92,7 +92,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) * way when do_mmap_pgoff unwinds (may be important on powerpc * and ia64). */ - vma-vm_flags |= VM_HUGETLB | VM_RESERVED; + vma-vm_flags |= VM_HUGETLB; vma-vm_ops = hugetlb_vm_ops; if (vma-vm_pgoff ~(huge_page_mask(h) PAGE_SHIFT)) @@ -106,10 +106,13 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) ret = -ENOMEM; len = vma_len + ((loff_t)vma-vm_pgoff PAGE_SHIFT); - if (hugetlb_reserve_pages(inode, + if (HUGETLBFS_I(inode)-creat_flags HUGETLB_RESERVE) { + vma-vm_flags |= VM_RESERVED; + if (hugetlb_reserve_pages(inode, vma-vm_pgoff huge_page_order(h), len huge_page_shift(h), vma)) - goto out; + goto out; + } ret = 0; hugetlb_prefault_arch_hook(vma-vm_mm); @@ -496,7 +499,8 @@ out: } static struct inode *hugetlbfs_get_inode(struct super_block *sb, uid_t uid, - gid_t gid, int mode, dev_t dev) + gid_t gid, int mode, dev_t dev, + unsigned long creat_flags) { struct inode *inode; @@ -512,7 +516,9 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, uid_t uid, inode-i_atime = inode-i_mtime = inode-i_ctime = CURRENT_TIME; INIT_LIST_HEAD(inode-i_mapping-private_list); info = HUGETLBFS_I(inode); - mpol_shared_policy_init(info-policy, NULL); + info-creat_flags = creat_flags; + if (!(creat_flags HUGETLB_PRIVATE_INODE)) + mpol_shared_policy_init(info-policy, NULL); switch (mode S_IFMT) { default: init_special_inode(inode, mode, dev); @@ -553,7 +559,8 @@ static int hugetlbfs_mknod(struct inode *dir, } else { gid = current-fsgid; } - inode = hugetlbfs_get_inode(dir-i_sb, current-fsuid, gid, mode, dev); + inode = hugetlbfs_get_inode(dir-i_sb, current-fsuid, gid, mode, dev, + HUGETLB_RESERVE); if (inode) { dir-i_ctime = dir-i_mtime = CURRENT_TIME; d_instantiate(dentry, inode); @@ -589,7 +596,8 @@ static int hugetlbfs_symlink(struct inode *dir, gid = current-fsgid; inode = hugetlbfs_get_inode(dir-i_sb, current-fsuid, - gid, S_IFLNK|S_IRWXUGO, 0); + gid, S_IFLNK|S_IRWXUGO, 0, + HUGETLB_RESERVE); if (inode) { int l = strlen(symname)+1; error = page_symlink(inode, symname, l); @@ -693,7 +701,8 @@ static struct inode *hugetlbfs_alloc_inode(struct super_block *sb) static void hugetlbfs_destroy_inode(struct inode *inode) { hugetlbfs_inc_free_inodes(HUGETLBFS_SB(inode-i_sb)); - mpol_free_shared_policy(HUGETLBFS_I(inode)-policy); + if (!(HUGETLBFS_I(inode)-creat_flags HUGETLB_PRIVATE_INODE)) + mpol_free_shared_policy(HUGETLBFS_I(inode)-policy); kmem_cache_free(hugetlbfs_inode_cachep, HUGETLBFS_I
[PATCH 5/5 V2] [PPC] Setup stack memory segment for hugetlb pages
Currently the memory slice that holds the process stack is always initialized to hold small pages. This patch defines the weak function that is declared in the previos patch to convert the stack slice to hugetlb pages. Signed-off-by: Eric Munson [EMAIL PROTECTED] --- Based on 2.6.26-rc8-mm1 Changes from V1: Instead of setting the mm-wide page size to huge pages, set only the relavent slice psize using an arch defined weak function. arch/powerpc/mm/hugetlbpage.c |6 ++ arch/powerpc/mm/slice.c | 11 +++ include/asm-powerpc/hugetlb.h |3 +++ 3 files changed, 20 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index fb42c4d..bd7f777 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -152,6 +152,12 @@ pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr, } #endif +void hugetlb_mm_setup(struct mm_struct *mm, unsigned long addr, + unsigned long len) +{ + slice_convert_address(mm, addr, len, shift_to_mmu_psize(HPAGE_SHIFT)); +} + /* Build list of addresses of gigantic pages. This function is used in early * boot before the buddy or bootmem allocator is setup. */ diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c index 583be67..d984733 100644 --- a/arch/powerpc/mm/slice.c +++ b/arch/powerpc/mm/slice.c @@ -30,6 +30,7 @@ #include linux/err.h #include linux/spinlock.h #include linux/module.h +#include linux/hugetlb.h #include asm/mman.h #include asm/mmu.h #include asm/spu.h @@ -397,6 +398,16 @@ static unsigned long slice_find_area(struct mm_struct *mm, unsigned long len, #define MMU_PAGE_BASE MMU_PAGE_4K #endif +void slice_convert_address(struct mm_struct *mm, unsigned long addr, + unsigned long len, unsigned int psize) +{ + struct slice_mask mask; + + mask = slice_range_to_mask(addr, len); + slice_convert(mm, mask, psize); + slice_flush_segments(mm); +} + unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len, unsigned long flags, unsigned int psize, int topdown, int use_cache) diff --git a/include/asm-powerpc/hugetlb.h b/include/asm-powerpc/hugetlb.h index 26f0d0a..10ef089 100644 --- a/include/asm-powerpc/hugetlb.h +++ b/include/asm-powerpc/hugetlb.h @@ -17,6 +17,9 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep); +void slice_convert_address(struct mm_struct *mm, unsigned long addr, + unsigned long len, unsigned int psize); + /* * If the arch doesn't supply something else, assume that hugepage * size aligned regions are ok without further preparation. -- 1.5.6.1 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH V3] Keep 3 high personality bytes across exec
Currently when a 32 bit process is exec'd on a powerpc 64 bit host the value in the top three bytes of the personality is clobbered. This patch adds a check in the SET_PERSONALITY macro that will carry all the values in the top three bytes across the exec. These three bytes currently carry flags to disable address randomisation, limit the address space, force zeroing of an mmapped page, etc. Should an application set any of these bits they will be maintained and honoured on homogeneous environment but discarded and ignored on a heterogeneous environment. So if an application requires all mmapped pages to be initialised to zero and a wrapper is used to setup the personality and exec the target, these flags will remain set on an all 32 or all 64 bit envrionment, but they will be lost in the exec on a mixed 32/64 bit environment. Losing these bits means that the same application would behave differently in different environments. Tested on a POWER5+ machine with 64bit kernel and a mixed 64/32 bit user space. Signed-off-by: Eric B Munson [EMAIL PROTECTED] --- V3 Based on 2.6.26-rc8 Changes from V2: Use ~PER_MASK instead of PER_INHERIT Remove PER_INHERIT Rebase to 2.6.26-rc8 Changes from V1: Updated changelog with a better description of why this change is useful include/asm-powerpc/elf.h |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/include/asm-powerpc/elf.h b/include/asm-powerpc/elf.h index 9080d85..5eee73e 100644 --- a/include/asm-powerpc/elf.h +++ b/include/asm-powerpc/elf.h @@ -257,7 +257,8 @@ do { \ else\ clear_thread_flag(TIF_ABI_PENDING); \ if (personality(current-personality) != PER_LINUX32) \ - set_personality(PER_LINUX); \ + set_personality(PER_LINUX | \ + (current-personality (~PER_MASK))); \ } while (0) /* * An executable for which elf_read_implies_exec() returns TRUE will -- 1.5.6.1 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev