Re: [PATCH] mm: don't invoke __alloc_pages_direct_compact when order 0
And in almost invoking case, order is 0, so return immediately. You can't make sure it. Okay. Let's not invoke it when order 0 Let's not ruin git blame. Hmm... When I do git blame, I can't find anything related to this. I mean if we merge the pointless patch, it could be showed firstly instead of meaningful patch when we do git blame. It makes us bothering when we find blame-patch. Oh... I see. Thanks for comments. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: don't invoke __alloc_pages_direct_compact when order 0
2012/7/7 David Rientjes rient...@google.com: On Sat, 7 Jul 2012, Joonsoo Kim wrote: __alloc_pages_direct_compact has many arguments so invoking it is very costly. And in almost invoking case, order is 0, so return immediately. If zero cost is very costly, then this might make sense. __alloc_pages_direct_compact() is inlined by gcc. In my kernel image, __alloc_pages_direct_compact() is not inlined by gcc. So I send this patch. But, currently I think it is not useful, so drop it. Thanks for comments. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: WARNING: __GFP_FS allocations with IRQs disabled (kmemcheck_alloc_shadow)
2012/7/8 Fengguang Wu fengguang...@intel.com: Hi Vegard, This warning code is triggered for the attached config: __lockdep_trace_alloc(): /* * Oi! Can't be having __GFP_FS allocations with IRQs disabled. */ if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))) return; Where the irq is possibly disabled at the beginning of __slab_alloc(): local_irq_save(flags); Currently, in slub code, kmemcheck_alloc_shadow is always invoked with irq_disabled. I think that something like below is needed. diff --git a/mm/slub.c b/mm/slub.c index 8c691fa..5d41cad 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1324,8 +1324,14 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) !(s-flags (SLAB_NOTRACK | DEBUG_DEFAULT_FLAGS))) { int pages = 1 oo_order(oo); + if (flags __GFP_WAIT) + local_irq_enable(); + kmemcheck_alloc_shadow(page, oo_order(oo), flags, node); + if (flags __GFP_WAIT) + local_irq_disable(); + /* * Objects from caches that have a constructor don't get * cleared when they're allocated, so we need to do it here. -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] slub: release a lock if freeing object with a lock is failed in __slab_free()
2012/7/7 Christoph Lameter c...@linux.com: On Fri, 6 Jul 2012, JoonSoo Kim wrote: At CPU2, we don't need lock anymore, because this slab already in partial list. For that scenario we could also simply do a trylock there and redo the loop if we fail. But still what guarantees that another process will not modify the page struct between fetching the data and a successful trylock? I'm not familiar with English, so take my ability to understand into consideration. I have a hard time understanding what you want to accomplish here. we don't need guarantees that another process will not modify the page struct between fetching the data and a successful trylock. No we do not need that since the cmpxchg will then fail. Maybe it would be useful to split this patch into two? One where you introduce the dropping of the lock and the other where you get rid of certain code paths? Dropping of the lock is need for getting rid of certain code paths. So, I can't split this patch into two. Sorry for confusing all the people. I think that I don't explain my purpose well. I will prepare new version in which I explain purpose of patch better. Thanks for kind review. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: WARNING: __GFP_FS allocations with IRQs disabled (kmemcheck_alloc_shadow)
2012/7/9 David Rientjes rient...@google.com: On Mon, 9 Jul 2012, JoonSoo Kim wrote: diff --git a/mm/slub.c b/mm/slub.c index 8c691fa..5d41cad 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1324,8 +1324,14 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) !(s-flags (SLAB_NOTRACK | DEBUG_DEFAULT_FLAGS))) { int pages = 1 oo_order(oo); + if (flags __GFP_WAIT) + local_irq_enable(); + kmemcheck_alloc_shadow(page, oo_order(oo), flags, node); + if (flags __GFP_WAIT) + local_irq_disable(); + /* * Objects from caches that have a constructor don't get * cleared when they're allocated, so we need to do it here. This patch is suboptimal when the branch is taken since you just disabled irqs and now are immediately reenabling them and then disabling them again. (And your patch is also whitespace damaged, has no changelog, and isn't signed off so it can't be applied.) My intent is just to provide reference, because there is no replay to this thread when I see it. The correct fix is what I proposed at http://marc.info/?l=linux-kernelm=133754837703630 and was awaiting testing. If Rus, Steven, or Fengguang could test this then we could add it as a stable backport as well. Your patch looks good to me. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: don't invoke __alloc_pages_direct_compact when order 0
2012/7/9 David Rientjes rient...@google.com: On Sun, 8 Jul 2012, JoonSoo Kim wrote: __alloc_pages_direct_compact has many arguments so invoking it is very costly. And in almost invoking case, order is 0, so return immediately. If zero cost is very costly, then this might make sense. __alloc_pages_direct_compact() is inlined by gcc. In my kernel image, __alloc_pages_direct_compact() is not inlined by gcc. Adding Andrew and Mel to the thread since this would require that we revert 11e33f6a55ed (page allocator: break up the allocator entry point into fast and slow paths) which would obviously not be a clean revert since there have been several changes to these functions over the past three years. Only __alloc_pages_direct_compact() is not inlined. All others (__alloc_pages_high_priority, __alloc_pages_direct_reclaim, ...) are inlined correctly. So revert is not needed. I think __alloc_pages_direct_compact() can't be inlined by gcc, because it is so big and is invoked two times in __alloc_pages_nodemask(). I'm stunned (and skeptical) that __alloc_pages_direct_compact() is not inlined by your gcc, especially since the kernel must be compiled with optimization (either -O1 or -O2 which causes these functions to be inlined). What version of gcc are you using and on what architecture? Please do make mm/page_alloc.s and send it to me privately, I'll file this and fix it up on gcc-bugs. I will send result of make mm/page_alloc.s to you privately. My environments is x86_64, GNU C version 4.6.3 I'll definitely be following up on this. Thanks for comments. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: don't invoke __alloc_pages_direct_compact when order 0
2012/7/10 Mel Gorman mgor...@suse.de: You say that invoking the function is very costly. I agree that a function call with that many parameters is hefty but it is also in the slow path of the allocator. For order-0 allocations we are about to enter direct reclaim where I would expect the cost far exceeds the cost of a function call. Yes, I agree. If the cost is indeed high and you have seen this in profiles then I suggest you create a forced inline function alloc_pages_direct_compact that does this; if (order != 0) __alloc_pages_direct_compact(...) and then call alloc_pages_direct_compact instead of __alloc_pages_direct_compact. After that, recheck the profiles (although I expect the difference to be marginal) and the size of vmlinux (if it gets bigger, it's probably not worth it). That would be functionally similar to your patch but it will preserve git blame, churn less code and be harder to make mistakes with in the unlikely event a third call to alloc_pages_direct_compact is ever added. Your suggestion looks good. But, the size of page_alloc.o is more than before. I test 3 approaches, vanilla, always_inline and wrapping(alloc_page_direct_compact which is your suggestion). In my environment (v3.5-rc5, gcc 4.6.3, x86_64), page_alloc.o shows below number. total, .text section, .text.unlikely page_alloc_vanilla.o: 93432, 0x510a,0x243 page_alloc_inline.o: 93336, 0x52ca, 0xa4 page_alloc_wrapping.o: 93528, 0x515a,0x238 Andrew said that inlining add only 26 bytes to .text of page_alloc.o, but in my system, need more bytes. Currently, I think this patch doesn't have obvious benefit, so I want to drop it. Any objections? Thanks for good comments. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch 4/7] softirq: Use hotplug thread infrastructure
2012/7/16 Thomas Gleixner t...@linutronix.de: - static const struct sched_param param = { - .sched_priority = MAX_RT_PRIO-1 - }; - - p = per_cpu(ksoftirqd, hotcpu); - per_cpu(ksoftirqd, hotcpu) = NULL; - sched_setscheduler_nocheck(p, SCHED_FIFO, param); - kthread_stop(p); + int hotcpu = (unsigned long)hcpu; + takeover_tasklets(hotcpu); break; Currently, int hotcpu = (unsigned long)hcpu; is somewhat strange. takeover_tasklets((unsigned long)hcpu) is sufficient for me. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch 0/7] Per cpu thread hotplug infrastructure - V3
2012/7/16 Thomas Gleixner t...@linutronix.de: The following series implements the infrastructure for parking and unparking kernel threads to avoid the full teardown and fork on cpu hotplug operations along with management infrastructure for hotplug and users. Changes vs. V2: Use callbacks for all functionality. Thanks to Rusty for pointing that out. It makes the use sites nice and simple and keeps all the code which would be duplicated otherwise on the core. Thanks, tglx I'm just a kernel newbie, so I have a dumb question. Why this patchset doesn't include patch for migrationd in kernel/stop_machine.c? Is there any reason for that? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 0/8] remove vm_struct list management
This patchset remove vm_struct list management after initializing vmalloc. Adding and removing an entry to vmlist is linear time complexity, so it is inefficient. If we maintain this list, overall time complexity of adding and removing area to vmalloc space is O(N), although we use rbtree for finding vacant place and it's time complexity is just O(logN). And vmlist and vmlist_lock is used many places of outside of vmalloc.c. It is preferable that we hide this raw data structure and provide well-defined function for supporting them, because it makes that they cannot mistake when manipulating theses structure and it makes us easily maintain vmalloc layer. I'm not sure that 7/8: makes vmlist only for kexec is fine. Because it is related to userspace program. As far as I know, makedumpfile use kexec's output information and it only need first address of vmalloc layer. So my implementation reflect this fact, but I'm not sure. And now, I don't fully test this patchset. Basic operation work well, but I don't test kexec. So I send this patchset with 'RFC'. Please let me know what I am missing. This series based on v3.7-rc7 and on top of submitted patchset for ARM. 'introduce static_vm for ARM-specific static mapped area' https://lkml.org/lkml/2012/11/27/356 But, running properly on x86 without ARM patchset. Joonsoo Kim (8): mm, vmalloc: change iterating a vmlist to find_vm_area() mm, vmalloc: move get_vmalloc_info() to vmalloc.c mm, vmalloc: protect va-vm by vmap_area_lock mm, vmalloc: iterate vmap_area_list, instead of vmlist in vread/vwrite() mm, vmalloc: iterate vmap_area_list in get_vmalloc_info() mm, vmalloc: iterate vmap_area_list, instead of vmlist, in vmallocinfo() mm, vmalloc: makes vmlist only for kexec mm, vmalloc: remove list management operation after initializing vmalloc arch/tile/mm/pgtable.c |7 +- arch/unicore32/mm/ioremap.c | 17 +-- arch/x86/mm/ioremap.c |7 +- fs/proc/Makefile|2 +- fs/proc/internal.h | 18 --- fs/proc/meminfo.c |1 + fs/proc/mmu.c | 60 -- include/linux/vmalloc.h | 19 +++- mm/vmalloc.c| 258 +-- 9 files changed, 204 insertions(+), 185 deletions(-) delete mode 100644 fs/proc/mmu.c -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 2/8] mm, vmalloc: move get_vmalloc_info() to vmalloc.c
Now get_vmalloc_info() is in fs/proc/mmu.c. There is no reason that this code must be here and it's implementation needs vmlist_lock and it iterate a vmlist which may be internal data structure for vmalloc. It is preferable that vmlist_lock and vmlist is only used in vmalloc.c for maintainability. So move the code to vmalloc.c Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/fs/proc/Makefile b/fs/proc/Makefile index 99349ef..88092c1 100644 --- a/fs/proc/Makefile +++ b/fs/proc/Makefile @@ -5,7 +5,7 @@ obj-y += proc.o proc-y := nommu.o task_nommu.o -proc-$(CONFIG_MMU) := mmu.o task_mmu.o +proc-$(CONFIG_MMU) := task_mmu.o proc-y += inode.o root.o base.o generic.o array.o \ proc_tty.o fd.o diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 43973b0..5a1eda2 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -28,24 +28,6 @@ extern int proc_net_init(void); static inline int proc_net_init(void) { return 0; } #endif -struct vmalloc_info { - unsigned long used; - unsigned long largest_chunk; -}; - -#ifdef CONFIG_MMU -#define VMALLOC_TOTAL (VMALLOC_END - VMALLOC_START) -extern void get_vmalloc_info(struct vmalloc_info *vmi); -#else - -#define VMALLOC_TOTAL 0UL -#define get_vmalloc_info(vmi) \ -do { \ - (vmi)-used = 0;\ - (vmi)-largest_chunk = 0; \ -} while(0) -#endif - extern int proc_tid_stat(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task); extern int proc_tgid_stat(struct seq_file *m, struct pid_namespace *ns, diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index 80e4645..c594dfb 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -11,6 +11,7 @@ #include linux/swap.h #include linux/vmstat.h #include linux/atomic.h +#include linux/vmalloc.h #include asm/page.h #include asm/pgtable.h #include internal.h diff --git a/fs/proc/mmu.c b/fs/proc/mmu.c deleted file mode 100644 index 8ae221d..000 --- a/fs/proc/mmu.c +++ /dev/null @@ -1,60 +0,0 @@ -/* mmu.c: mmu memory info files - * - * Copyright (C) 2004 Red Hat, Inc. All Rights Reserved. - * Written by David Howells (dhowe...@redhat.com) - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - */ -#include linux/spinlock.h -#include linux/vmalloc.h -#include linux/highmem.h -#include asm/pgtable.h -#include internal.h - -void get_vmalloc_info(struct vmalloc_info *vmi) -{ - struct vm_struct *vma; - unsigned long free_area_size; - unsigned long prev_end; - - vmi-used = 0; - - if (!vmlist) { - vmi-largest_chunk = VMALLOC_TOTAL; - } - else { - vmi-largest_chunk = 0; - - prev_end = VMALLOC_START; - - read_lock(vmlist_lock); - - for (vma = vmlist; vma; vma = vma-next) { - unsigned long addr = (unsigned long) vma-addr; - - /* -* Some archs keep another range for modules in vmlist -*/ - if (addr VMALLOC_START) - continue; - if (addr = VMALLOC_END) - break; - - vmi-used += vma-size; - - free_area_size = addr - prev_end; - if (vmi-largest_chunk free_area_size) - vmi-largest_chunk = free_area_size; - - prev_end = vma-size + addr; - } - - if (VMALLOC_END - prev_end vmi-largest_chunk) - vmi-largest_chunk = VMALLOC_END - prev_end; - - read_unlock(vmlist_lock); - } -} diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 6071e91..698b1e5 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -158,4 +158,22 @@ pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms) # endif #endif +struct vmalloc_info { + unsigned long used; + unsigned long largest_chunk; +}; + +#ifdef CONFIG_MMU +#define VMALLOC_TOTAL (VMALLOC_END - VMALLOC_START) +extern void get_vmalloc_info(struct vmalloc_info *vmi); +#else + +#define VMALLOC_TOTAL 0UL +#define get_vmalloc_info(vmi) \ +do { \ + (vmi)-used = 0;\ + (vmi)-largest_chunk = 0; \ +} while (0) +#endif + #endif /* _LINUX_VMALLOC_H */ diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 78e0830..16147bc 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2642,5 +2642,49 @@ static int __init proc_vmalloc_init(void
[RFC PATCH 5/8] mm, vmalloc: iterate vmap_area_list in get_vmalloc_info()
This patch is preparing step for removing vmlist entirely. For above purpose, we change iterating a vmap_list codes to iterating a vmap_area_list. It is somewhat trivial change, but just one thing should be noticed. vmlist is lack of information about some areas in vmalloc address space. For example, vm_map_ram() allocate area in vmalloc address space, but it doesn't make a link with vmlist. To provide full information about vmalloc address space is better idea, so we don't use va-vm and use vmap_area directly. This makes get_vmalloc_info() more precise. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/mm/vmalloc.c b/mm/vmalloc.c index d21167f..f7f4a35 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2668,46 +2668,47 @@ module_init(proc_vmalloc_init); void get_vmalloc_info(struct vmalloc_info *vmi) { - struct vm_struct *vma; + struct vmap_area *va; unsigned long free_area_size; unsigned long prev_end; vmi-used = 0; + vmi-largest_chunk = 0; - if (!vmlist) { - vmi-largest_chunk = VMALLOC_TOTAL; - } else { - vmi-largest_chunk = 0; + prev_end = VMALLOC_START; - prev_end = VMALLOC_START; + spin_lock(vmap_area_lock); - read_lock(vmlist_lock); + if (list_empty(vmap_area_list)) { + vmi-largest_chunk = VMALLOC_TOTAL; + goto out; + } - for (vma = vmlist; vma; vma = vma-next) { - unsigned long addr = (unsigned long) vma-addr; + list_for_each_entry(va, vmap_area_list, list) { + unsigned long addr = va-va_start; - /* -* Some archs keep another range for modules in vmlist -*/ - if (addr VMALLOC_START) - continue; - if (addr = VMALLOC_END) - break; + /* +* Some archs keep another range for modules in vmalloc space +*/ + if (addr VMALLOC_START) + continue; + if (addr = VMALLOC_END) + break; - vmi-used += vma-size; + vmi-used += (va-va_end - va-va_start); - free_area_size = addr - prev_end; - if (vmi-largest_chunk free_area_size) - vmi-largest_chunk = free_area_size; + free_area_size = addr - prev_end; + if (vmi-largest_chunk free_area_size) + vmi-largest_chunk = free_area_size; - prev_end = vma-size + addr; - } + prev_end = va-va_end; + } - if (VMALLOC_END - prev_end vmi-largest_chunk) - vmi-largest_chunk = VMALLOC_END - prev_end; + if (VMALLOC_END - prev_end vmi-largest_chunk) + vmi-largest_chunk = VMALLOC_END - prev_end; - read_unlock(vmlist_lock); - } +out: + spin_unlock(vmap_area_lock); } #endif -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 7/8] mm, vmalloc: makes vmlist only for kexec
Although our intention remove vmlist entirely, but there is one exception. kexec use vmlist symbol, and we can't remove it, because it is related to userspace program. When kexec dumps system information, it write vmlist address and vm_struct's address offset. In userspace program, these information is used for getting first address in vmalloc space. Then it dumps memory content in vmalloc space which is higher than this address. For supporting this optimization, we should maintain a vmlist. But this doesn't means that we should maintain full vmlist. Just one vm_struct for vmlist is sufficient. So use vmlist_early for full chain of vm_struct and assign a dummy_vm to vmlist for supporting kexec. Cc: Eric Biederman ebied...@xmission.com Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/mm/vmalloc.c b/mm/vmalloc.c index f134950..8a1b959 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -272,6 +272,27 @@ static unsigned long cached_align; static unsigned long vmap_area_pcpu_hole; +/*** Old vmalloc interfaces ***/ +DEFINE_RWLOCK(vmlist_lock); +/* vmlist is only for kexec */ +struct vm_struct *vmlist; +static struct vm_struct dummy_vm; + +/* This is only for kexec. + * It wants to know first vmalloc address for optimization */ +static void setup_vmlist(void) +{ + struct vmap_area *va; + + if (list_empty(vmap_area_list)) { + vmlist = NULL; + } else { + va = list_entry((vmap_area_list)-next, typeof(*va), list); + dummy_vm.addr = (void *)va-va_start; + vmlist = dummy_vm; + } +} + static struct vmap_area *__find_vmap_area(unsigned long addr) { struct rb_node *n = vmap_area_root.rb_node; @@ -313,7 +334,7 @@ static void __insert_vmap_area(struct vmap_area *va) rb_link_node(va-rb_node, parent, p); rb_insert_color(va-rb_node, vmap_area_root); - /* address-sort this list so it is usable like the vmlist */ + /* address-sort this list so it is usable like the vmlist_early */ tmp = rb_prev(va-rb_node); if (tmp) { struct vmap_area *prev; @@ -321,6 +342,8 @@ static void __insert_vmap_area(struct vmap_area *va) list_add_rcu(va-list, prev-list); } else list_add_rcu(va-list, vmap_area_list); + + setup_vmlist(); } static void purge_vmap_area_lazy(void); @@ -485,6 +508,8 @@ static void __free_vmap_area(struct vmap_area *va) vmap_area_pcpu_hole = max(vmap_area_pcpu_hole, va-va_end); kfree_rcu(va, rcu_head); + + setup_vmlist(); } /* @@ -1125,11 +1150,13 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t pro } EXPORT_SYMBOL(vm_map_ram); +static struct vm_struct *vmlist_early; + /** * vm_area_add_early - add vmap area early during boot * @vm: vm_struct to add * - * This function is used to add fixed kernel vm area to vmlist before + * This function is used to add fixed kernel vm area to vmlist_early before * vmalloc_init() is called. @vm-addr, @vm-size, and @vm-flags * should contain proper values and the other fields should be zero. * @@ -1140,7 +1167,7 @@ void __init vm_area_add_early(struct vm_struct *vm) struct vm_struct *tmp, **p; BUG_ON(vmap_initialized); - for (p = vmlist; (tmp = *p) != NULL; p = tmp-next) { + for (p = vmlist_early; (tmp = *p) != NULL; p = tmp-next) { if (tmp-addr = vm-addr) { BUG_ON(tmp-addr vm-addr + vm-size); break; @@ -1190,8 +1217,8 @@ void __init vmalloc_init(void) INIT_LIST_HEAD(vbq-free); } - /* Import existing vmlist entries. */ - for (tmp = vmlist; tmp; tmp = tmp-next) { + /* Import existing vmlist_early entries. */ + for (tmp = vmlist_early; tmp; tmp = tmp-next) { va = kzalloc(sizeof(struct vmap_area), GFP_NOWAIT); va-flags = VM_VM_AREA; va-va_start = (unsigned long)tmp-addr; @@ -1283,10 +1310,6 @@ int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page ***pages) } EXPORT_SYMBOL_GPL(map_vm_area); -/*** Old vmalloc interfaces ***/ -DEFINE_RWLOCK(vmlist_lock); -struct vm_struct *vmlist; - static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va, unsigned long flags, const void *caller) { @@ -1313,7 +1336,7 @@ static void insert_vmalloc_vmlist(struct vm_struct *vm) vm-flags = ~VM_UNLIST; write_lock(vmlist_lock); - for (p = vmlist; (tmp = *p) != NULL; p = tmp-next) { + for (p = vmlist_early; (tmp = *p) != NULL; p = tmp-next) { if (tmp-addr = vm-addr) break; } @@ -1369,7 +1392,7 @@ static struct vm_struct *__get_vm_area_node(unsigned long size, /* * When this function is called from __vmalloc_node_range, -* we do not add vm_struct to vmlist here to avoid
[RFC PATCH 8/8] mm, vmalloc: remove list management operation after initializing vmalloc
Now, there is no need to maintain vmlist_early after initializing vmalloc. So remove related code and data structure. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 698b1e5..10d19c9 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -130,7 +130,6 @@ extern long vwrite(char *buf, char *addr, unsigned long count); /* * Internals. Dont't use.. */ -extern rwlock_t vmlist_lock; extern struct vm_struct *vmlist; extern __init void vm_area_add_early(struct vm_struct *vm); extern __init void vm_area_register_early(struct vm_struct *vm, size_t align); diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 8a1b959..957a098 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -272,8 +272,6 @@ static unsigned long cached_align; static unsigned long vmap_area_pcpu_hole; -/*** Old vmalloc interfaces ***/ -DEFINE_RWLOCK(vmlist_lock); /* vmlist is only for kexec */ struct vm_struct *vmlist; static struct vm_struct dummy_vm; @@ -334,7 +332,7 @@ static void __insert_vmap_area(struct vmap_area *va) rb_link_node(va-rb_node, parent, p); rb_insert_color(va-rb_node, vmap_area_root); - /* address-sort this list so it is usable like the vmlist_early */ + /* address-sort this list */ tmp = rb_prev(va-rb_node); if (tmp) { struct vmap_area *prev; @@ -1150,7 +1148,7 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t pro } EXPORT_SYMBOL(vm_map_ram); -static struct vm_struct *vmlist_early; +static struct vm_struct *vmlist_early __initdata; /** * vm_area_add_early - add vmap area early during boot @@ -1323,7 +1321,7 @@ static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va, spin_unlock(vmap_area_lock); } -static void insert_vmalloc_vmlist(struct vm_struct *vm) +static void remove_vm_unlist(struct vm_struct *vm) { struct vm_struct *tmp, **p; @@ -1334,22 +1332,13 @@ static void insert_vmalloc_vmlist(struct vm_struct *vm) */ smp_wmb(); vm-flags = ~VM_UNLIST; - - write_lock(vmlist_lock); - for (p = vmlist_early; (tmp = *p) != NULL; p = tmp-next) { - if (tmp-addr = vm-addr) - break; - } - vm-next = *p; - *p = vm; - write_unlock(vmlist_lock); } static void insert_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va, unsigned long flags, const void *caller) { setup_vmalloc_vm(vm, va, flags, caller); - insert_vmalloc_vmlist(vm); + remove_vm_unlist(vm); } static struct vm_struct *__get_vm_area_node(unsigned long size, @@ -1392,10 +1381,9 @@ static struct vm_struct *__get_vm_area_node(unsigned long size, /* * When this function is called from __vmalloc_node_range, -* we do not add vm_struct to vmlist_early here to avoid -* accessing uninitialized members of vm_struct such as -* pages and nr_pages fields. They will be set later. -* To distinguish it from others, we use a VM_UNLIST flag. +* we add VM_UNLIST flag to avoid accessing uninitialized +* members of vm_struct such as pages and nr_pages fields. +* They will be set later. */ if (flags VM_UNLIST) setup_vmalloc_vm(area, va, flags, caller); @@ -1483,21 +1471,6 @@ struct vm_struct *remove_vm_area(const void *addr) va-flags = ~VM_VM_AREA; spin_unlock(vmap_area_lock); - if (!(vm-flags VM_UNLIST)) { - struct vm_struct *tmp, **p; - /* -* remove from list and disallow access to -* this vm_struct before unmap. (address range -* confliction is maintained by vmap.) -*/ - write_lock(vmlist_lock); - for (p = vmlist_early; (tmp = *p) != vm; - p = tmp-next) - ; - *p = tmp-next; - write_unlock(vmlist_lock); - } - vmap_debug_free_range(va-va_start, va-va_end); free_unmap_vmap_area(va); vm-size -= PAGE_SIZE; @@ -1717,10 +1690,11 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, return NULL; /* -* In this function, newly allocated vm_struct is not added -* to vmlist_early at __get_vm_area_node(). so, it is added here. +* In this function, newly allocated vm_struct has VM_UNLIST flag. +* It means that vm_struct is not fully initialized. +* Now, it is fully initialized, so remove this flag here. */ - insert_vmalloc_vmlist(area); + remove_vm_unlist(area); /* * A ref_count = 3
[RFC PATCH 6/8] mm, vmalloc: iterate vmap_area_list, instead of vmlist, in vmallocinfo()
This patch is preparing step for removing vmlist entirely. For above purpose, we change iterating a vmap_list codes to iterating a vmap_area_list. It is somewhat trivial change, but just one thing should be noticed. Using vmap_area_list in vmallocinfo() introduce ordering problem in SMP system. In s_show(), we retrieve some values from vm_struct. vm_struct's values is not fully setup when va-vm is assigned. Full setup is notified by removing VM_UNLIST flag without holding a lock. When we see that VM_UNLIST is removed, it is not ensured that vm_struct has proper values in view of other CPUs. So we need smp_[rw]mb for ensuring that proper values is assigned when we see that VM_UNLIST is removed. Therefore, this patch not only change a iteration list, but also add a appropriate smp_[rw]mb to right places. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/mm/vmalloc.c b/mm/vmalloc.c index f7f4a35..f134950 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1304,7 +1304,14 @@ static void insert_vmalloc_vmlist(struct vm_struct *vm) { struct vm_struct *tmp, **p; + /* +* Before removing VM_UNLIST, +* we should make sure that vm has proper values. +* Pair with smp_rmb() in show_numa_info(). +*/ + smp_wmb(); vm-flags = ~VM_UNLIST; + write_lock(vmlist_lock); for (p = vmlist; (tmp = *p) != NULL; p = tmp-next) { if (tmp-addr = vm-addr) @@ -2539,19 +2546,19 @@ void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms) #ifdef CONFIG_PROC_FS static void *s_start(struct seq_file *m, loff_t *pos) - __acquires(vmlist_lock) + __acquires(vmap_area_lock) { loff_t n = *pos; - struct vm_struct *v; + struct vmap_area *va; - read_lock(vmlist_lock); - v = vmlist; - while (n 0 v) { + spin_lock(vmap_area_lock); + va = list_entry((vmap_area_list)-next, typeof(*va), list); + while (n 0 va-list != vmap_area_list) { n--; - v = v-next; + va = list_entry(va-list.next, typeof(*va), list); } - if (!n) - return v; + if (!n va-list != vmap_area_list) + return va; return NULL; @@ -2559,16 +2566,20 @@ static void *s_start(struct seq_file *m, loff_t *pos) static void *s_next(struct seq_file *m, void *p, loff_t *pos) { - struct vm_struct *v = p; + struct vmap_area *va = p, *next; ++*pos; - return v-next; + next = list_entry(va-list.next, typeof(*va), list); + if (next-list != vmap_area_list) + return next; + + return NULL; } static void s_stop(struct seq_file *m, void *p) - __releases(vmlist_lock) + __releases(vmap_area_lock) { - read_unlock(vmlist_lock); + spin_unlock(vmap_area_lock); } static void show_numa_info(struct seq_file *m, struct vm_struct *v) @@ -2579,6 +2590,11 @@ static void show_numa_info(struct seq_file *m, struct vm_struct *v) if (!counters) return; + /* Pair with smp_wmb() in insert_vmalloc_vmlist() */ + smp_rmb(); + if (v-flags VM_UNLIST) + return; + memset(counters, 0, nr_node_ids * sizeof(unsigned int)); for (nr = 0; nr v-nr_pages; nr++) @@ -2592,36 +2608,50 @@ static void show_numa_info(struct seq_file *m, struct vm_struct *v) static int s_show(struct seq_file *m, void *p) { - struct vm_struct *v = p; + struct vmap_area *va = p; + struct vm_struct *vm; + + if (!(va-flags VM_VM_AREA)) { + seq_printf(m, 0x%pK-0x%pK %7ld, + (void *)va-va_start, (void *)va-va_end, + va-va_end - va-va_start); + if (va-flags (VM_LAZY_FREE | VM_LAZY_FREEING)) + seq_printf(m, (freeing)); + + seq_putc(m, '\n'); + return 0; + } + + vm = va-vm; seq_printf(m, 0x%pK-0x%pK %7ld, - v-addr, v-addr + v-size, v-size); + vm-addr, vm-addr + vm-size, vm-size); - if (v-caller) - seq_printf(m, %pS, v-caller); + if (vm-caller) + seq_printf(m, %pS, vm-caller); - if (v-nr_pages) - seq_printf(m, pages=%d, v-nr_pages); + if (vm-nr_pages) + seq_printf(m, pages=%d, vm-nr_pages); - if (v-phys_addr) - seq_printf(m, phys=%llx, (unsigned long long)v-phys_addr); + if (vm-phys_addr) + seq_printf(m, phys=%llx, (unsigned long long)vm-phys_addr); - if (v-flags VM_IOREMAP) + if (vm-flags VM_IOREMAP) seq_printf(m, ioremap); - if (v-flags VM_ALLOC) + if (vm-flags VM_ALLOC) seq_printf(m, vmalloc); - if (v-flags VM_MAP) + if (vm-flags VM_MAP
[RFC PATCH 4/8] mm, vmalloc: iterate vmap_area_list, instead of vmlist in vread/vwrite()
Now, when we hold a vmap_area_lock, va-vm can't be discarded. So we can safely access to va-vm when iterating a vmap_area_list with holding a vmap_area_lock. With this property, change iterating vmlist codes in vread/vwrite() to iterating vmap_area_list. There is a little difference relate to lock, because vmlist_lock is mutex, but, vmap_area_lock is spin_lock. It may introduce a spinning overhead during vread/vwrite() is executing. But, these are debug-oriented functions, so this overhead is not real problem for common case. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/mm/vmalloc.c b/mm/vmalloc.c index a0b85a6..d21167f 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2009,7 +2009,8 @@ static int aligned_vwrite(char *buf, char *addr, unsigned long count) long vread(char *buf, char *addr, unsigned long count) { - struct vm_struct *tmp; + struct vmap_area *va; + struct vm_struct *vm; char *vaddr, *buf_start = buf; unsigned long buflen = count; unsigned long n; @@ -2018,10 +2019,17 @@ long vread(char *buf, char *addr, unsigned long count) if ((unsigned long) addr + count count) count = -(unsigned long) addr; - read_lock(vmlist_lock); - for (tmp = vmlist; count tmp; tmp = tmp-next) { - vaddr = (char *) tmp-addr; - if (addr = vaddr + tmp-size - PAGE_SIZE) + spin_lock(vmap_area_lock); + list_for_each_entry(va, vmap_area_list, list) { + if (!count) + break; + + if (!(va-flags VM_VM_AREA)) + continue; + + vm = va-vm; + vaddr = (char *) vm-addr; + if (addr = vaddr + vm-size - PAGE_SIZE) continue; while (addr vaddr) { if (count == 0) @@ -2031,10 +2039,10 @@ long vread(char *buf, char *addr, unsigned long count) addr++; count--; } - n = vaddr + tmp-size - PAGE_SIZE - addr; + n = vaddr + vm-size - PAGE_SIZE - addr; if (n count) n = count; - if (!(tmp-flags VM_IOREMAP)) + if (!(vm-flags VM_IOREMAP)) aligned_vread(buf, addr, n); else /* IOREMAP area is treated as memory hole */ memset(buf, 0, n); @@ -2043,7 +2051,7 @@ long vread(char *buf, char *addr, unsigned long count) count -= n; } finished: - read_unlock(vmlist_lock); + spin_unlock(vmap_area_lock); if (buf == buf_start) return 0; @@ -2082,7 +2090,8 @@ finished: long vwrite(char *buf, char *addr, unsigned long count) { - struct vm_struct *tmp; + struct vmap_area *va; + struct vm_struct *vm; char *vaddr; unsigned long n, buflen; int copied = 0; @@ -2092,10 +2101,17 @@ long vwrite(char *buf, char *addr, unsigned long count) count = -(unsigned long) addr; buflen = count; - read_lock(vmlist_lock); - for (tmp = vmlist; count tmp; tmp = tmp-next) { - vaddr = (char *) tmp-addr; - if (addr = vaddr + tmp-size - PAGE_SIZE) + spin_lock(vmap_area_lock); + list_for_each_entry(va, vmap_area_list, list) { + if (!count) + break; + + if (!(va-flags VM_VM_AREA)) + continue; + + vm = va-vm; + vaddr = (char *) vm-addr; + if (addr = vaddr + vm-size - PAGE_SIZE) continue; while (addr vaddr) { if (count == 0) @@ -2104,10 +2120,10 @@ long vwrite(char *buf, char *addr, unsigned long count) addr++; count--; } - n = vaddr + tmp-size - PAGE_SIZE - addr; + n = vaddr + vm-size - PAGE_SIZE - addr; if (n count) n = count; - if (!(tmp-flags VM_IOREMAP)) { + if (!(vm-flags VM_IOREMAP)) { aligned_vwrite(buf, addr, n); copied++; } @@ -2116,7 +2132,7 @@ long vwrite(char *buf, char *addr, unsigned long count) count -= n; } finished: - read_unlock(vmlist_lock); + spin_unlock(vmap_area_lock); if (!copied) return 0; return buflen; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 1/8] mm, vmalloc: change iterating a vmlist to find_vm_area()
The purpose of iterating a vmlist is finding vm area with specific virtual address. find_vm_area() is provided for this purpose and more efficient, because it uses a rbtree. So change it. Cc: Chris Metcalf cmetc...@tilera.com Cc: Guan Xuetao g...@mprc.pku.edu.cn Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@redhat.com Cc: H. Peter Anvin h...@zytor.com Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c index de0de0c..862782d 100644 --- a/arch/tile/mm/pgtable.c +++ b/arch/tile/mm/pgtable.c @@ -592,12 +592,7 @@ void iounmap(volatile void __iomem *addr_in) in parallel. Reuse of the virtual address is prevented by leaving it in the global lists until we're done with it. cpa takes care of the direct mappings. */ - read_lock(vmlist_lock); - for (p = vmlist; p; p = p-next) { - if (p-addr == addr) - break; - } - read_unlock(vmlist_lock); + p = find_vm_area((void *)addr); if (!p) { pr_err(iounmap: bad address %p\n, addr); diff --git a/arch/unicore32/mm/ioremap.c b/arch/unicore32/mm/ioremap.c index b7a6055..13068ee 100644 --- a/arch/unicore32/mm/ioremap.c +++ b/arch/unicore32/mm/ioremap.c @@ -235,7 +235,7 @@ EXPORT_SYMBOL(__uc32_ioremap_cached); void __uc32_iounmap(volatile void __iomem *io_addr) { void *addr = (void *)(PAGE_MASK (unsigned long)io_addr); - struct vm_struct **p, *tmp; + struct vm_struct *vm; /* * If this is a section based mapping we need to handle it @@ -244,17 +244,10 @@ void __uc32_iounmap(volatile void __iomem *io_addr) * all the mappings before the area can be reclaimed * by someone else. */ - write_lock(vmlist_lock); - for (p = vmlist ; (tmp = *p) ; p = tmp-next) { - if ((tmp-flags VM_IOREMAP) (tmp-addr == addr)) { - if (tmp-flags VM_UNICORE_SECTION_MAPPING) { - unmap_area_sections((unsigned long)tmp-addr, - tmp-size); - } - break; - } - } - write_unlock(vmlist_lock); + vm = find_vm_area(addr); + if (vm (vm-flags VM_IOREMAP) + (vm-flags VM_UNICORE_SECTION_MAPPING)) + unmap_area_sections((unsigned long)vm-addr, vm-size); vunmap(addr); } diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c index 78fe3f1..9a1e658 100644 --- a/arch/x86/mm/ioremap.c +++ b/arch/x86/mm/ioremap.c @@ -282,12 +282,7 @@ void iounmap(volatile void __iomem *addr) in parallel. Reuse of the virtual address is prevented by leaving it in the global lists until we're done with it. cpa takes care of the direct mappings. */ - read_lock(vmlist_lock); - for (p = vmlist; p; p = p-next) { - if (p-addr == (void __force *)addr) - break; - } - read_unlock(vmlist_lock); + p = find_vm_area((void __force *)addr); if (!p) { printk(KERN_ERR iounmap: bad address %p\n, addr); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 3/8] mm, vmalloc: protect va-vm by vmap_area_lock
Inserting and removing an entry to vmlist is linear time complexity, so it is inefficient. Following patches will try to remove vmlist entirely. This patch is preparing step for it. For removing vmlist, iterating vmlist codes should be changed to iterating a vmap_area_list. Before implementing that, we should make sure that when we iterate a vmap_area_list, accessing to va-vm doesn't cause a race condition. This patch ensure that when iterating a vmap_area_list, there is no race condition for accessing to vm_struct. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 16147bc..a0b85a6 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1290,12 +1290,14 @@ struct vm_struct *vmlist; static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va, unsigned long flags, const void *caller) { + spin_lock(vmap_area_lock); vm-flags = flags; vm-addr = (void *)va-va_start; vm-size = va-va_end - va-va_start; vm-caller = caller; va-vm = vm; va-flags |= VM_VM_AREA; + spin_unlock(vmap_area_lock); } static void insert_vmalloc_vmlist(struct vm_struct *vm) @@ -1446,6 +1448,11 @@ struct vm_struct *remove_vm_area(const void *addr) if (va va-flags VM_VM_AREA) { struct vm_struct *vm = va-vm; + spin_lock(vmap_area_lock); + va-vm = NULL; + va-flags = ~VM_VM_AREA; + spin_unlock(vmap_area_lock); + if (!(vm-flags VM_UNLIST)) { struct vm_struct *tmp, **p; /* -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/3] introduce static_vm for ARM-specific static mapped area
2012/11/28 Joonsoo Kim js1...@gmail.com: In current implementation, we used ARM-specific flag, that is, VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area. The purpose of static mapped area is to re-use static mapped area when entire physical address range of the ioremap request can be covered by this area. This implementation causes needless overhead for some cases. For example, assume that there is only one static mapped area and vmlist has 300 areas. Every time we call ioremap, we check 300 areas for deciding whether it is matched or not. Moreover, even if there is no static mapped area and vmlist has 300 areas, every time we call ioremap, we check 300 areas in now. If we construct a extra list for static mapped area, we can eliminate above mentioned overhead. With a extra list, if there is one static mapped area, we just check only one area and proceed next operation quickly. In fact, it is not a critical problem, because ioremap is not frequently used. But reducing overhead is better idea. Another reason for doing this work is for removing architecture dependency on vmalloc layer. I think that vmlist and vmlist_lock is internal data structure for vmalloc layer. Some codes for debugging and stat inevitably use vmlist and vmlist_lock. But it is preferable that they are used as least as possible in outside of vmalloc.c Changelog v1-v2: [2/3]: patch description is improved. Rebased on v3.7-rc7 Joonsoo Kim (3): ARM: vmregion: remove vmregion code entirely ARM: static_vm: introduce an infrastructure for static mapped area ARM: mm: use static_vm for managing static mapped areas arch/arm/include/asm/mach/static_vm.h | 51 arch/arm/mm/Makefile |2 +- arch/arm/mm/ioremap.c | 69 --- arch/arm/mm/mm.h | 10 -- arch/arm/mm/mmu.c | 55 + arch/arm/mm/static_vm.c | 97 arch/arm/mm/vmregion.c| 205 - arch/arm/mm/vmregion.h| 31 - 8 files changed, 208 insertions(+), 312 deletions(-) create mode 100644 arch/arm/include/asm/mach/static_vm.h create mode 100644 arch/arm/mm/static_vm.c delete mode 100644 arch/arm/mm/vmregion.c delete mode 100644 arch/arm/mm/vmregion.h -- 1.7.9.5 Hello, Russell. Could you review this patchset, please? I send another patchset to mm community on top of this. That one is related to this patchset, so I want to get a review about this patchset :) Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/8] remove vm_struct list management
Hello, Andrew. 2012/12/7 Andrew Morton a...@linux-foundation.org: On Fri, 7 Dec 2012 01:09:27 +0900 Joonsoo Kim js1...@gmail.com wrote: This patchset remove vm_struct list management after initializing vmalloc. Adding and removing an entry to vmlist is linear time complexity, so it is inefficient. If we maintain this list, overall time complexity of adding and removing area to vmalloc space is O(N), although we use rbtree for finding vacant place and it's time complexity is just O(logN). And vmlist and vmlist_lock is used many places of outside of vmalloc.c. It is preferable that we hide this raw data structure and provide well-defined function for supporting them, because it makes that they cannot mistake when manipulating theses structure and it makes us easily maintain vmalloc layer. I'm not sure that 7/8: makes vmlist only for kexec is fine. Because it is related to userspace program. As far as I know, makedumpfile use kexec's output information and it only need first address of vmalloc layer. So my implementation reflect this fact, but I'm not sure. And now, I don't fully test this patchset. Basic operation work well, but I don't test kexec. So I send this patchset with 'RFC'. Please let me know what I am missing. This series based on v3.7-rc7 and on top of submitted patchset for ARM. 'introduce static_vm for ARM-specific static mapped area' https://lkml.org/lkml/2012/11/27/356 But, running properly on x86 without ARM patchset. This all looks rather nice, but not mergeable into anything at this stage in the release cycle. What are the implications of on top of submitted patchset for ARM? Does it depens on the ARM patches in any way, or it it independently mergeable and testable? Yes. It depends on ARM patches. There is a code to manipulate a vmlist in ARM. So without applying ARM patches, this patchset makes compile error for ARM. But, build for x86 works fine with this patchset :) In ARM patches, a method used for removing vmlist related code is same as 1/8 of this patchset. But, it includes some optimization for ARM, so I sent it separately. If it can't be accepted, I can rework ARM patches like as 1/8 of this patchset. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/8] remove vm_struct list management
2012/12/7 Andrew Morton a...@linux-foundation.org: On Fri, 7 Dec 2012 01:09:27 +0900 Joonsoo Kim js1...@gmail.com wrote: I'm not sure that 7/8: makes vmlist only for kexec is fine. Because it is related to userspace program. As far as I know, makedumpfile use kexec's output information and it only need first address of vmalloc layer. So my implementation reflect this fact, but I'm not sure. And now, I don't fully test this patchset. Basic operation work well, but I don't test kexec. So I send this patchset with 'RFC'. Yes, this is irritating. Perhaps Vivek or one of the other kexec people could take a look at this please - if would obviously be much better if we can avoid merging [patch 7/8] at all. I'm not sure, but I almost sure that [patch 7/8] have no problem. In kexec.c, they write an address of vmlist and offset of vm_struct's address field. It imply that user for this information doesn't have any other information about vm_struct, and they can't use other field of vm_struct. They can use *only* address field. So, remaining just one vm_struct for vmlist which represent first area of vmalloc layer may be safe. But, kexec people may be very helpful to validate this patch. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/8] remove vm_struct list management
Hello, Bob. 2012/12/7 Bob Liu lliu...@gmail.com: Hi Joonsoo, On Fri, Dec 7, 2012 at 12:09 AM, Joonsoo Kim js1...@gmail.com wrote: This patchset remove vm_struct list management after initializing vmalloc. Adding and removing an entry to vmlist is linear time complexity, so it is inefficient. If we maintain this list, overall time complexity of adding and removing area to vmalloc space is O(N), although we use rbtree for finding vacant place and it's time complexity is just O(logN). And vmlist and vmlist_lock is used many places of outside of vmalloc.c. It is preferable that we hide this raw data structure and provide well-defined function for supporting them, because it makes that they cannot mistake when manipulating theses structure and it makes us easily maintain vmalloc layer. I'm not sure that 7/8: makes vmlist only for kexec is fine. Because it is related to userspace program. As far as I know, makedumpfile use kexec's output information and it only need first address of vmalloc layer. So my implementation reflect this fact, but I'm not sure. And now, I don't fully test this patchset. Basic operation work well, but I don't test kexec. So I send this patchset with 'RFC'. Please let me know what I am missing. Nice work! I also thought about this several weeks ago but I think the efficiency may be a problem. As you know two locks(vmap_area_lock and vmlist_lock) are used currently so that some work may be done in parallel(not proved). If removed vmlist, i'm afraid vmap_area_lock will become a bottleneck which will reduce the efficiency. Thanks for comment! Yes, there were some place that work may be done in parallel. For example, access to '/proc/meminfo', '/proc/vmallocinfo' and '/proc/kcore' may be done in parallel. But, access to these are not main functionality of vmalloc layer. Optimizing main function like vmalloc, vfree is more preferable than above. And this patchset optimize main function with removing vmlist iteration. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 1/8] mm, vmalloc: change iterating a vmlist to find_vm_area()
Hello, Pekka. 2012/12/7 Pekka Enberg penb...@kernel.org: On Thu, Dec 6, 2012 at 6:09 PM, Joonsoo Kim js1...@gmail.com wrote: The purpose of iterating a vmlist is finding vm area with specific virtual address. find_vm_area() is provided for this purpose and more efficient, because it uses a rbtree. So change it. You no longer take the 'vmlist_lock'. This is safe, because...? As Bob mentioned, find_vm_area() hold a 'vmap_area_lock' during searching a area. When we hold a 'vmap_area_lock', area can't be removed. So this change is safe. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/8] remove vm_struct list management
Hello, Vivek. 2012/12/7 Vivek Goyal vgo...@redhat.com: On Fri, Dec 07, 2012 at 10:16:55PM +0900, JoonSoo Kim wrote: 2012/12/7 Andrew Morton a...@linux-foundation.org: On Fri, 7 Dec 2012 01:09:27 +0900 Joonsoo Kim js1...@gmail.com wrote: I'm not sure that 7/8: makes vmlist only for kexec is fine. Because it is related to userspace program. As far as I know, makedumpfile use kexec's output information and it only need first address of vmalloc layer. So my implementation reflect this fact, but I'm not sure. And now, I don't fully test this patchset. Basic operation work well, but I don't test kexec. So I send this patchset with 'RFC'. Yes, this is irritating. Perhaps Vivek or one of the other kexec people could take a look at this please - if would obviously be much better if we can avoid merging [patch 7/8] at all. I'm not sure, but I almost sure that [patch 7/8] have no problem. In kexec.c, they write an address of vmlist and offset of vm_struct's address field. It imply that user for this information doesn't have any other information about vm_struct, and they can't use other field of vm_struct. They can use *only* address field. So, remaining just one vm_struct for vmlist which represent first area of vmalloc layer may be safe. I browsed through makedumpfile source quickly. So yes it does look like that we look at first vmlist element -addr field to figure out where vmalloc area is starting. Can we get the same information from this rb-tree of vmap_area? Is -va_start field communication same information as vmlist was communicating? What's the difference between vmap_area_root and vmlist. Thanks for comment. Yes. vmap_area's va_start field represent same information as vm_struct's addr. vmap_area_root is data structure for fast searching an area. vmap_area_list is address sorted list, so we can use it like as vmlist. There is a little difference vmap_area_list and vmlist. vmlist is lack of information about some areas in vmalloc address space. For example, vm_map_ram() allocate area in vmalloc address space, but it doesn't make a link with vmlist. To provide full information about vmalloc address space, using vmap_area_list is more adequate. So without knowing details of both the data structures, I think if vmlist is going away, then user space tools should be able to traverse vmap_area_root rb tree. I am assuming it is sorted using -addr field and we should be able to get vmalloc area start from there. It will just be a matter of exporting right fields to user space (instead of vmlist). There is address sorted list of vmap_area, vmap_area_list. So we can use it for traversing vmalloc areas if it is necessary. But, as I mentioned before, kexec write *just* address of vmlist and offset of vm_struct's address field. It imply that they don't traverse vmlist, because they didn't write vm_struct's next field which is needed for traversing. Without vm_struct's next field, they have no method for traversing. So, IMHO, assigning dummy vm_struct to vmlist which is implemented by [7/8] is a safe way to maintain a compatibility of userspace tool. :) Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 1/2] scripts/tags.sh: Support subarch for ARM
Current tags.sh doesn't handle subarch for ARM. There are too many subarch on ARM, it is hard that we locate some functions which are defined in every subarch with tags util family. Therefore support subarch for removing this unconvenience. We can use ARM subarch functionality like below. make cscope O=. SRCARCH=arm SUBARCH=xxx Signed-off-by: Joonsoo Kim js1...@gmail.com --- v2: change bash specific '[[]]' to 'case in' statement. v3: quote the patterns. diff --git a/scripts/tags.sh b/scripts/tags.sh index 79fdafb..8fb18d1 100755 --- a/scripts/tags.sh +++ b/scripts/tags.sh @@ -48,13 +48,14 @@ find_arch_sources() for i in $archincludedir; do prune=$prune -wholename $i -prune -o done - find ${tree}arch/$1 $ignore $prune -name $2 -print; + find ${tree}arch/$1 $ignore $subarchprune $prune -name $2 -print; } # find sources in arch/$1/include find_arch_include_sources() { - include=$(find ${tree}arch/$1/ -name include -type d); + include=$(find ${tree}arch/$1/ $subarchprune \ + -name include -type d -print); if [ -n $include ]; then archincludedir=$archincludedir $include find $include $ignore -name $2 -print; @@ -234,6 +235,21 @@ if [ ${ARCH} = um ]; then else archinclude=${SUBARCH} fi +elif [ ${SRCARCH} = arm -a ${SUBARCH} != ]; then + subarchdir=$(find ${tree}arch/$SRCARCH/ -name mach-* -type d -o \ + -name plat-* -type d); + for i in $subarchdir; do + case $i in + *mach-${SUBARCH}) + ;; + *plat-${SUBARCH}) + ;; + *) + subarchprune=$subarchprune \ + -wholename $i -prune -o + ;; + esac + done fi remove_structs= -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 2/2] scripts/tags.sh: Support compiled source
We usually have interst in compiled files only, because they are strongly related to individual's work. Current tags.sh can't select compiled files, so support it. We can use this functionality like below. make cscope O=. SRCARCH= COMPILED_SOURCE=compiled It must be executed after building the kernel. Signed-off-by: Joonsoo Kim js1...@gmail.com --- v2: change bash specific '[[]]' to 'case in' statement. use COMPILED_SOURCE env var, instead of abusing SUBARCH v3: change [ $COMPILED_SOURCE=compiled ] to [ -n $COMPILED_SOURCE ] diff --git a/scripts/tags.sh b/scripts/tags.sh index 8fb18d1..08f06c0 100755 --- a/scripts/tags.sh +++ b/scripts/tags.sh @@ -96,6 +96,32 @@ all_sources() find_other_sources '*.[chS]' } +all_compiled_sources() +{ + for i in $(all_sources); do + case $i in + *.[cS]) + j=${i/\.[cS]/\.o} + if [ -e $j ]; then + echo $i + fi + ;; + *) + echo $i + ;; + esac + done +} + +all_target_sources() +{ + if [ -n $COMPILED_SOURCE ]; then + all_compiled_sources + else + all_sources + fi +} + all_kconfigs() { for arch in $ALLSOURCE_ARCHS; do @@ -111,18 +137,18 @@ all_defconfigs() docscope() { - (echo \-k; echo \-q; all_sources) cscope.files + (echo \-k; echo \-q; all_target_sources) cscope.files cscope -b -f cscope.out } dogtags() { - all_sources | gtags -i -f - + all_target_sources | gtags -i -f - } exuberant() { - all_sources | xargs $1 -a \ + all_target_sources | xargs $1 -a\ -I __initdata,__exitdata,__acquires,__releases \ -I __read_mostly,cacheline_aligned \ -I cacheline_aligned_in_smp \ @@ -174,7 +200,7 @@ exuberant() emacs() { - all_sources | xargs $1 -a \ + all_target_sources | xargs $1 -a\ --regex='/^(ENTRY|_GLOBAL)(\([^)]*\)).*/\2/'\ --regex='/^SYSCALL_DEFINE[0-9]?(\([^,)]*\).*/sys_\1/' \ --regex='/^TRACE_EVENT(\([^,)]*\).*/trace_\1/' \ @@ -221,11 +247,10 @@ xtags() elif $1 --version 21 | grep -iq emacs; then emacs $1 else - all_sources | xargs $1 -a + all_target_sources | xargs $1 -a fi } - # Support um (which uses SUBARCH) if [ ${ARCH} = um ]; then if [ $SUBARCH = i386 ]; then -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry
Hello, Andrew. 2012/11/20 Minchan Kim minc...@kernel.org: Hi Joonsoo, Sorry for the delay. On Thu, Nov 15, 2012 at 02:09:04AM +0900, JoonSoo Kim wrote: Hi, Minchan. 2012/11/14 Minchan Kim minc...@kernel.org: On Tue, Nov 13, 2012 at 11:12:28PM +0900, JoonSoo Kim wrote: 2012/11/13 Minchan Kim minc...@kernel.org: On Tue, Nov 13, 2012 at 09:30:57AM +0900, JoonSoo Kim wrote: 2012/11/3 Minchan Kim minc...@kernel.org: Hi Joonsoo, On Sat, Nov 03, 2012 at 04:07:25AM +0900, JoonSoo Kim wrote: Hello, Minchan. 2012/11/1 Minchan Kim minc...@kernel.org: On Thu, Nov 01, 2012 at 01:56:36AM +0900, Joonsoo Kim wrote: In current code, after flush_all_zero_pkmaps() is invoked, then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps() return index of first flushed entry. With this index, we can immediately map highmem page to virtual address represented by index. So change return type of flush_all_zero_pkmaps() and return index of first flushed entry. Additionally, update last_pkmap_nr to this index. It is certain that entry which is below this index is occupied by other mapping, therefore updating last_pkmap_nr to this index is reasonable optimization. Cc: Mel Gorman m...@csn.ul.ie Cc: Peter Zijlstra a.p.zijls...@chello.nl Cc: Minchan Kim minc...@kernel.org Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/include/linux/highmem.h b/include/linux/highmem.h index ef788b5..97ad208 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size) #ifdef CONFIG_HIGHMEM #include asm/highmem.h +#define PKMAP_INVALID_INDEX (LAST_PKMAP) /* declarations for linux/mm/highmem.c */ unsigned int nr_free_highpages(void); diff --git a/mm/highmem.c b/mm/highmem.c index d98b0a9..b365f7b 100644 --- a/mm/highmem.c +++ b/mm/highmem.c @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr) return virt_to_page(addr); } -static void flush_all_zero_pkmaps(void) +static unsigned int flush_all_zero_pkmaps(void) { int i; - int need_flush = 0; + unsigned int index = PKMAP_INVALID_INDEX; flush_cache_kmaps(); @@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void) pkmap_page_table[i]); set_page_address(page, NULL); - need_flush = 1; + if (index == PKMAP_INVALID_INDEX) + index = i; } - if (need_flush) + if (index != PKMAP_INVALID_INDEX) flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP)); + + return index; } /** @@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void) */ void kmap_flush_unused(void) { + unsigned int index; + lock_kmap(); - flush_all_zero_pkmaps(); + index = flush_all_zero_pkmaps(); + if (index != PKMAP_INVALID_INDEX (index last_pkmap_nr)) + last_pkmap_nr = index; I don't know how kmap_flush_unused is really fast path so how my nitpick is effective. Anyway, What problem happens if we do following as? lock() index = flush_all_zero_pkmaps(); if (index != PKMAP_INVALID_INDEX) last_pkmap_nr = index; unlock(); Normally, last_pkmap_nr is increased with searching empty slot in map_new_virtual. So I expect return value of flush_all_zero_pkmaps in kmap_flush_unused normally become either less than last_pkmap_nr or last_pkmap_nr + 1. There is a case that return value of kmap_flush_unused() is larger than last_pkmap_nr. I see but why it's problem? kmap_flush_unused returns larger value than last_pkmap_nr means that there is no free slot at below the value. So unconditional last_pkmap_nr update is vaild. I think that this is not true. Look at the slightly different example. Assume last_pkmap = 20 and index 1-9, 12-19 is kmapped. 10, 11 is kunmapped. do kmap_flush_unused() = flush index 10,11 = last_pkmap = 10; do kunmap() with index 17 do kmap_flush_unused() = flush index 17 = last_pkmap = 17? In this case, unconditional last_pkmap_nr update skip one kunmapped index. So, conditional update is needed. Thanks for pouinting out, Joonsoo. You're right. I misunderstood your flush_all_zero_pkmaps change. As your change, flush_all_zero_pkmaps returns first *flushed* free slot index. What's the benefit returning flushed flushed free slot index rather than free slot index? If flush_all_zero_pkmaps
[PATCH v2 0/3] introduce static_vm for ARM-specific static mapped area
In current implementation, we used ARM-specific flag, that is, VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area. The purpose of static mapped area is to re-use static mapped area when entire physical address range of the ioremap request can be covered by this area. This implementation causes needless overhead for some cases. For example, assume that there is only one static mapped area and vmlist has 300 areas. Every time we call ioremap, we check 300 areas for deciding whether it is matched or not. Moreover, even if there is no static mapped area and vmlist has 300 areas, every time we call ioremap, we check 300 areas in now. If we construct a extra list for static mapped area, we can eliminate above mentioned overhead. With a extra list, if there is one static mapped area, we just check only one area and proceed next operation quickly. In fact, it is not a critical problem, because ioremap is not frequently used. But reducing overhead is better idea. Another reason for doing this work is for removing architecture dependency on vmalloc layer. I think that vmlist and vmlist_lock is internal data structure for vmalloc layer. Some codes for debugging and stat inevitably use vmlist and vmlist_lock. But it is preferable that they are used as least as possible in outside of vmalloc.c Changelog v1-v2: [2/3]: patch description is improved. Rebased on v3.7-rc7 Joonsoo Kim (3): ARM: vmregion: remove vmregion code entirely ARM: static_vm: introduce an infrastructure for static mapped area ARM: mm: use static_vm for managing static mapped areas arch/arm/include/asm/mach/static_vm.h | 51 arch/arm/mm/Makefile |2 +- arch/arm/mm/ioremap.c | 69 --- arch/arm/mm/mm.h | 10 -- arch/arm/mm/mmu.c | 55 + arch/arm/mm/static_vm.c | 97 arch/arm/mm/vmregion.c| 205 - arch/arm/mm/vmregion.h| 31 - 8 files changed, 208 insertions(+), 312 deletions(-) create mode 100644 arch/arm/include/asm/mach/static_vm.h create mode 100644 arch/arm/mm/static_vm.c delete mode 100644 arch/arm/mm/vmregion.c delete mode 100644 arch/arm/mm/vmregion.h -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 1/3] ARM: vmregion: remove vmregion code entirely
Now, there is no user for vmregion. So remove it. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile index 8a9c4cb..4e333fa 100644 --- a/arch/arm/mm/Makefile +++ b/arch/arm/mm/Makefile @@ -6,7 +6,7 @@ obj-y := dma-mapping.o extable.o fault.o init.o \ iomap.o obj-$(CONFIG_MMU) += fault-armv.o flush.o idmap.o ioremap.o \ - mmap.o pgd.o mmu.o vmregion.o + mmap.o pgd.o mmu.o ifneq ($(CONFIG_MMU),y) obj-y += nommu.o diff --git a/arch/arm/mm/vmregion.c b/arch/arm/mm/vmregion.c deleted file mode 100644 index a631016..000 --- a/arch/arm/mm/vmregion.c +++ /dev/null @@ -1,205 +0,0 @@ -#include linux/fs.h -#include linux/spinlock.h -#include linux/list.h -#include linux/proc_fs.h -#include linux/seq_file.h -#include linux/slab.h - -#include vmregion.h - -/* - * VM region handling support. - * - * This should become something generic, handling VM region allocations for - * vmalloc and similar (ioremap, module space, etc). - * - * I envisage vmalloc()'s supporting vm_struct becoming: - * - * struct vm_struct { - *struct vmregion region; - *unsigned longflags; - *struct page **pages; - *unsigned int nr_pages; - *unsigned longphys_addr; - * }; - * - * get_vm_area() would then call vmregion_alloc with an appropriate - * struct vmregion head (eg): - * - * struct vmregion vmalloc_head = { - * .vm_list= LIST_HEAD_INIT(vmalloc_head.vm_list), - * .vm_start = VMALLOC_START, - * .vm_end = VMALLOC_END, - * }; - * - * However, vmalloc_head.vm_start is variable (typically, it is dependent on - * the amount of RAM found at boot time.) I would imagine that get_vm_area() - * would have to initialise this each time prior to calling vmregion_alloc(). - */ - -struct arm_vmregion * -arm_vmregion_alloc(struct arm_vmregion_head *head, size_t align, - size_t size, gfp_t gfp, const void *caller) -{ - unsigned long start = head-vm_start, addr = head-vm_end; - unsigned long flags; - struct arm_vmregion *c, *new; - - if (head-vm_end - head-vm_start size) { - printk(KERN_WARNING %s: allocation too big (requested %#x)\n, - __func__, size); - goto out; - } - - new = kmalloc(sizeof(struct arm_vmregion), gfp); - if (!new) - goto out; - - new-caller = caller; - - spin_lock_irqsave(head-vm_lock, flags); - - addr = rounddown(addr - size, align); - list_for_each_entry_reverse(c, head-vm_list, vm_list) { - if (addr = c-vm_end) - goto found; - addr = rounddown(c-vm_start - size, align); - if (addr start) - goto nospc; - } - - found: - /* -* Insert this entry after the one we found. -*/ - list_add(new-vm_list, c-vm_list); - new-vm_start = addr; - new-vm_end = addr + size; - new-vm_active = 1; - - spin_unlock_irqrestore(head-vm_lock, flags); - return new; - - nospc: - spin_unlock_irqrestore(head-vm_lock, flags); - kfree(new); - out: - return NULL; -} - -static struct arm_vmregion *__arm_vmregion_find(struct arm_vmregion_head *head, unsigned long addr) -{ - struct arm_vmregion *c; - - list_for_each_entry(c, head-vm_list, vm_list) { - if (c-vm_active c-vm_start == addr) - goto out; - } - c = NULL; - out: - return c; -} - -struct arm_vmregion *arm_vmregion_find(struct arm_vmregion_head *head, unsigned long addr) -{ - struct arm_vmregion *c; - unsigned long flags; - - spin_lock_irqsave(head-vm_lock, flags); - c = __arm_vmregion_find(head, addr); - spin_unlock_irqrestore(head-vm_lock, flags); - return c; -} - -struct arm_vmregion *arm_vmregion_find_remove(struct arm_vmregion_head *head, unsigned long addr) -{ - struct arm_vmregion *c; - unsigned long flags; - - spin_lock_irqsave(head-vm_lock, flags); - c = __arm_vmregion_find(head, addr); - if (c) - c-vm_active = 0; - spin_unlock_irqrestore(head-vm_lock, flags); - return c; -} - -void arm_vmregion_free(struct arm_vmregion_head *head, struct arm_vmregion *c) -{ - unsigned long flags; - - spin_lock_irqsave(head-vm_lock, flags); - list_del(c-vm_list); - spin_unlock_irqrestore(head-vm_lock, flags); - - kfree(c); -} - -#ifdef CONFIG_PROC_FS -static int arm_vmregion_show(struct seq_file *m, void *p) -{ - struct arm_vmregion *c = list_entry(p, struct arm_vmregion, vm_list); - - seq_printf(m, 0x%08lx-0x%08lx %7lu, c-vm_start, c-vm_end, - c-vm_end - c-vm_start); - if (c-caller
[PATCH v2 2/3] ARM: static_vm: introduce an infrastructure for static mapped area
In current implementation, we used ARM-specific flag, that is, VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area. The purpose of static mapped area is to re-use static mapped area when entire physical address range of the ioremap request can be covered by this area. This implementation causes needless overhead for some cases. For example, assume that there is only one static mapped area and vmlist has 300 areas. Every time we call ioremap, we check 300 areas for deciding whether it is matched or not. Moreover, even if there is no static mapped area and vmlist has 300 areas, every time we call ioremap, we check 300 areas in now. If we construct a extra list for static mapped area, we can eliminate above mentioned overhead. With a extra list, if there is one static mapped area, we just check only one area and proceed next operation quickly. In fact, it is not a critical problem, because ioremap is not frequently used. But reducing overhead is better idea. Another reason for doing this work is for removing architecture dependency on vmalloc layer. I think that vmlist and vmlist_lock is internal data structure for vmalloc layer. Some codes for debugging and stat inevitably use vmlist and vmlist_lock. But it is preferable that they are used as least as possible in outside of vmalloc.c Now, I introduce an ARM-specific infrastructure for static mapped area. In the following patch, we will use this and resolve above mentioned problem. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/arch/arm/include/asm/mach/static_vm.h b/arch/arm/include/asm/mach/static_vm.h new file mode 100644 index 000..1bb6604 --- /dev/null +++ b/arch/arm/include/asm/mach/static_vm.h @@ -0,0 +1,45 @@ +/* + * arch/arm/include/asm/mach/static_vm.h + * + * Copyright (C) 2012 LG Electronics, Joonsoo Kim js1...@gmail.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef _ASM_MACH_STATIC_VM_H +#define _ASM_MACH_STATIC_VM_H + +#include linux/types.h +#include linux/vmalloc.h + +struct static_vm { + struct static_vm*next; + void*vaddr; + unsigned long size; + unsigned long flags; + phys_addr_t paddr; + const void *caller; +}; + +extern struct static_vm *static_vmlist; +extern spinlock_t static_vmlist_lock; + +extern struct static_vm *find_static_vm_paddr(phys_addr_t paddr, + size_t size, unsigned long flags); +extern struct static_vm *find_static_vm_vaddr(void *vaddr, unsigned long flags); +extern void init_static_vm(struct static_vm *static_vm, + struct vm_struct *vm, unsigned long flags); +extern void insert_static_vm(struct static_vm *vm); + +#endif /* _ASM_MACH_STATIC_VM_H */ diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile index 4e333fa..57b329a 100644 --- a/arch/arm/mm/Makefile +++ b/arch/arm/mm/Makefile @@ -6,7 +6,7 @@ obj-y := dma-mapping.o extable.o fault.o init.o \ iomap.o obj-$(CONFIG_MMU) += fault-armv.o flush.o idmap.o ioremap.o \ - mmap.o pgd.o mmu.o + mmap.o pgd.o mmu.o static_vm.o ifneq ($(CONFIG_MMU),y) obj-y += nommu.o diff --git a/arch/arm/mm/static_vm.c b/arch/arm/mm/static_vm.c new file mode 100644 index 000..d7677cf --- /dev/null +++ b/arch/arm/mm/static_vm.c @@ -0,0 +1,97 @@ +/* + * arch/arm/mm/static_vm.c + * + * Copyright (C) 2012 LG Electronics, Joonsoo Kim js1...@gmail.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include linux/spinlock.h + +#include asm/mach/static_vm.h + +struct static_vm *static_vmlist
[PATCH v2 3/3] ARM: mm: use static_vm for managing static mapped areas
A static mapped area is ARM-specific, so it is better not to use generic vmalloc data structure, that is, vmlist and vmlist_lock for managing static mapped area. And it causes some needless overhead and reducing this overhead is better idea. Now, we have newly introduced static_vm infrastructure. With it, we don't need to iterate all mapped areas. Instead, we just iterate static mapped areas. It helps to reduce an overhead of finding matched area. And architecture dependency on vmalloc layer is removed, so it will help to maintainability for vmalloc layer. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/arch/arm/include/asm/mach/static_vm.h b/arch/arm/include/asm/mach/static_vm.h index 1bb6604..0d9c685 100644 --- a/arch/arm/include/asm/mach/static_vm.h +++ b/arch/arm/include/asm/mach/static_vm.h @@ -32,6 +32,12 @@ struct static_vm { const void *caller; }; +#define STATIC_VM_MEM 0x0001 +#define STATIC_VM_EMPTY0x0002 + +/* mtype should be less than 28 */ +#define STATIC_VM_MTYPE(mt)(1UL ((mt) + 4)) + extern struct static_vm *static_vmlist; extern spinlock_t static_vmlist_lock; diff --git a/arch/arm/mm/ioremap.c b/arch/arm/mm/ioremap.c index 5dcc2fd..b7f3c27 100644 --- a/arch/arm/mm/ioremap.c +++ b/arch/arm/mm/ioremap.c @@ -36,6 +36,7 @@ #include asm/system_info.h #include asm/mach/map.h +#include asm/mach/static_vm.h #include asm/mach/pci.h #include mm.h @@ -197,7 +198,8 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn, const struct mem_type *type; int err; unsigned long addr; - struct vm_struct * area; + struct vm_struct *area; + phys_addr_t paddr = __pfn_to_phys(pfn); #ifndef CONFIG_ARM_LPAE /* @@ -219,24 +221,17 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn, /* * Try to reuse one of the static mapping whenever possible. */ - read_lock(vmlist_lock); - for (area = vmlist; area; area = area-next) { - if (!size || (sizeof(phys_addr_t) == 4 pfn = 0x10)) - break; - if (!(area-flags VM_ARM_STATIC_MAPPING)) - continue; - if ((area-flags VM_ARM_MTYPE_MASK) != VM_ARM_MTYPE(mtype)) - continue; - if (__phys_to_pfn(area-phys_addr) pfn || - __pfn_to_phys(pfn) + size-1 area-phys_addr + area-size-1) - continue; - /* we can drop the lock here as we know *area is static */ - read_unlock(vmlist_lock); - addr = (unsigned long)area-addr; - addr += __pfn_to_phys(pfn) - area-phys_addr; - return (void __iomem *) (offset + addr); + if (size !((sizeof(phys_addr_t) == 4 pfn = 0x10))) { + struct static_vm *static_vm; + + static_vm = find_static_vm_paddr(__pfn_to_phys(pfn), size, + STATIC_VM_MEM | STATIC_VM_MTYPE(mtype)); + if (static_vm) { + addr = (unsigned long)static_vm-vaddr; + addr += paddr - static_vm-paddr; + return (void __iomem *) (offset + addr); + } } - read_unlock(vmlist_lock); /* * Don't allow RAM to be mapped - this causes problems with ARMv6+ @@ -248,7 +243,7 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn, if (!area) return NULL; addr = (unsigned long)area-addr; - area-phys_addr = __pfn_to_phys(pfn); + area-phys_addr = paddr; #if !defined(CONFIG_SMP) !defined(CONFIG_ARM_LPAE) if (DOMAIN_IO == 0 @@ -346,34 +341,20 @@ __arm_ioremap_exec(unsigned long phys_addr, size_t size, bool cached) void __iounmap(volatile void __iomem *io_addr) { void *addr = (void *)(PAGE_MASK (unsigned long)io_addr); - struct vm_struct *vm; - - read_lock(vmlist_lock); - for (vm = vmlist; vm; vm = vm-next) { - if (vm-addr addr) - break; - if (!(vm-flags VM_IOREMAP)) - continue; - /* If this is a static mapping we must leave it alone */ - if ((vm-flags VM_ARM_STATIC_MAPPING) - (vm-addr = addr) (vm-addr + vm-size addr)) { - read_unlock(vmlist_lock); - return; - } + struct static_vm *static_vm; + + static_vm = find_static_vm_vaddr(addr, STATIC_VM_MEM); + if (static_vm) + return; + #if !defined(CONFIG_SMP) !defined(CONFIG_ARM_LPAE) - /* -* If this is a section based mapping we need to handle it -* specially as the VM subsystem does not know how to handle -* such a beast. -*/ - if ((vm-addr == addr
[PATCH] slub: assign refcount for kmalloc_caches
commit cce89f4f6911286500cf7be0363f46c9b0a12ce0('Move kmem_cache refcounting to common code') moves some refcount manipulation code to common code. Unfortunately, it also removed refcount assignment for kmalloc_caches. So, kmalloc_caches's refcount is initially 0. This makes errornous situation. Paul Hargrove report that when he create a 8-byte kmem_cache and destory it, he encounter below message. 'Objects remaining in kmalloc-8 on kmem_cache_close()' 8-byte kmem_cache merge with 8-byte kmalloc cache and refcount is increased by one. So, resulting refcount is 1. When destory it, it hit refcount = 0, then kmem_cache_close() is executed and error message is printed. This patch assign initial refcount 1 to kmalloc_caches, so fix this errornous situtation. Cc: sta...@vger.kernel.org # v3.7 Cc: Christoph Lameter c...@linux.com Reported-by: Paul Hargrove phhargr...@lbl.gov Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/mm/slub.c b/mm/slub.c index a0d6984..321afab 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3279,6 +3279,7 @@ static struct kmem_cache *__init create_kmalloc_cache(const char *name, if (kmem_cache_open(s, flags)) goto panic; + s-refcount = 1; list_add(s-list, slab_caches); return s; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] slub: assign refcount for kmalloc_caches
2012/12/26 Joonsoo Kim js1...@gmail.com: commit cce89f4f6911286500cf7be0363f46c9b0a12ce0('Move kmem_cache refcounting to common code') moves some refcount manipulation code to common code. Unfortunately, it also removed refcount assignment for kmalloc_caches. So, kmalloc_caches's refcount is initially 0. This makes errornous situation. Paul Hargrove report that when he create a 8-byte kmem_cache and destory it, he encounter below message. 'Objects remaining in kmalloc-8 on kmem_cache_close()' 8-byte kmem_cache merge with 8-byte kmalloc cache and refcount is increased by one. So, resulting refcount is 1. When destory it, it hit refcount = 0, then kmem_cache_close() is executed and error message is printed. This patch assign initial refcount 1 to kmalloc_caches, so fix this errornous situtation. Cc: sta...@vger.kernel.org # v3.7 Cc: Christoph Lameter c...@linux.com Reported-by: Paul Hargrove phhargr...@lbl.gov Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/mm/slub.c b/mm/slub.c index a0d6984..321afab 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3279,6 +3279,7 @@ static struct kmem_cache *__init create_kmalloc_cache(const char *name, if (kmem_cache_open(s, flags)) goto panic; + s-refcount = 1; list_add(s-list, slab_caches); return s; -- 1.7.9.5 I missed some explanation. In v3.8-rc1, this problem is already solved. See create_kmalloc_cache() in mm/slab_common.c. So this patch is just for v3.7 stable. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] mm, bootmem: panic in bootmem alloc functions even if slab is available
Hello, Sasha. 2012/12/28 Sasha Levin sasha.le...@oracle.com: On 12/27/2012 06:04 PM, David Rientjes wrote: On Thu, 27 Dec 2012, Sasha Levin wrote: That's exactly what happens with the patch. Note that in the current upstream version there are several slab checks scattered all over. In this case for example, I'm removing it from __alloc_bootmem_node(), but the first code line of__alloc_bootmem_node_nopanic() is: if (WARN_ON_ONCE(slab_is_available())) return kzalloc(size, GFP_NOWAIT); You're only talking about mm/bootmem.c and not mm/nobootmem.c, and notice that __alloc_bootmem_node() does not call __alloc_bootmem_node_nopanic(), it calls ___alloc_bootmem_node_nopanic(). Holy cow, this is an underscore hell. Thanks, Sasha I have a different idea. How about removing fallback allocation in bootmem.c completely? I don't know why it is there exactly. But, warning for 'slab_is_available()' is there for a long time. So, most people who misuse fallback allocation change their code adequately. I think that removing fallback at this time is valid. Isn't it? Fallback allocation may cause possible bug. If someone free a memory from fallback allocation, it can't be handled properly. So, IMHO, at this time, we should remove fallback allocation in bootmem.c entirely. Please let me know what I misunderstand. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] x86, reboot: skip reboot_fixups in early boot phase
During early boot phase, PCI bus subsystem is not yet initialized. If panic is occured in early boot phase and panic_timeout is set, code flow go into emergency_restart() and hit mach_reboot_fixups(), then encounter another panic. When second panic, we can't hold a panic_lock, so code flow go into panic_smp_self_stop() which prevent system to restart. For avoid second panic, skip reboot_fixups in early boot phase. It makes panic_timeout works in early boot phase. Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@redhat.com Cc: H. Peter Anvin h...@zytor.com Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/arch/x86/kernel/reboot_fixups_32.c b/arch/x86/kernel/reboot_fixups_32.c index c8e41e9..b9b8ec9 100644 --- a/arch/x86/kernel/reboot_fixups_32.c +++ b/arch/x86/kernel/reboot_fixups_32.c @@ -89,6 +89,10 @@ void mach_reboot_fixups(void) if (in_interrupt()) return; + /* During early boot phase, PCI is not yet initialized */ + if (system_state == SYSTEM_BOOTING) + return; + for (i=0; i ARRAY_SIZE(fixups_table); i++) { cur = (fixups_table[i]); dev = pci_get_device(cur-vendor, cur-device, NULL); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/8] remove vm_struct list management
Hello, Atsushi. 2012/12/12 Atsushi Kumagai kumagai-atsu...@mxc.nes.nec.co.jp: Hello, On Tue, 11 Dec 2012 17:17:05 -0500 (EST) Dave Anderson ander...@redhat.com wrote: - Original Message - On Mon, Dec 10, 2012 at 11:40:47PM +0900, JoonSoo Kim wrote: [..] So without knowing details of both the data structures, I think if vmlist is going away, then user space tools should be able to traverse vmap_area_root rb tree. I am assuming it is sorted using -addr field and we should be able to get vmalloc area start from there. It will just be a matter of exporting right fields to user space (instead of vmlist). There is address sorted list of vmap_area, vmap_area_list. So we can use it for traversing vmalloc areas if it is necessary. But, as I mentioned before, kexec write *just* address of vmlist and offset of vm_struct's address field. It imply that they don't traverse vmlist, because they didn't write vm_struct's next field which is needed for traversing. Without vm_struct's next field, they have no method for traversing. So, IMHO, assigning dummy vm_struct to vmlist which is implemented by [7/8] is a safe way to maintain a compatibility of userspace tool. :) Actually the design of makedumpfile and crash tool is that they know about kernel data structures and they adopt to changes. So for major changes they keep track of kernel version numbers and if access the data structures accordingly. Currently we access first element of vmlist to determine start of vmalloc address. True we don't have to traverse the list. But as you mentioned we should be able to get same information by traversing to left most element of vmap_area_list rb tree. So I think instead of trying to retain vmlist first element just for backward compatibility, I will rather prefer get rid of that code completely from kernel and let user space tool traverse rbtree. Just export minimum needed info for traversal in user space. There's no need to traverse the rbtree. There is a vmap_area_list linked list of vmap_area structures that is also sorted by virtual address. All that makedumpfile would have to do is to access the first vmap_area in the vmap_area_list -- as opposed to the way that it does now, which is by accessing the first vm_struct in the to-be-obsoleted vmlist list. So it seems silly to keep the dummy vmlist around. I think so, I will modify makedumpfile to get the start address of vmalloc with vmap_area_list if the related symbols are provided as VMCOREINFO like vmlist. BTW, have we to consider other tools ? If it is clear, I think we can get rid of the dummy vmlist. Good! In next spin, I will remove dummy vmlist and export vmap_area_list symbol for makedumpfile. I don't know any other tools. If anyone knows it, please let me know. Thanks! Atsushi, Dave and Vivek. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/3] introduce static_vm for ARM-specific static mapped area
2012/12/7 JoonSoo Kim js1...@gmail.com: 2012/11/28 Joonsoo Kim js1...@gmail.com: In current implementation, we used ARM-specific flag, that is, VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area. The purpose of static mapped area is to re-use static mapped area when entire physical address range of the ioremap request can be covered by this area. This implementation causes needless overhead for some cases. For example, assume that there is only one static mapped area and vmlist has 300 areas. Every time we call ioremap, we check 300 areas for deciding whether it is matched or not. Moreover, even if there is no static mapped area and vmlist has 300 areas, every time we call ioremap, we check 300 areas in now. If we construct a extra list for static mapped area, we can eliminate above mentioned overhead. With a extra list, if there is one static mapped area, we just check only one area and proceed next operation quickly. In fact, it is not a critical problem, because ioremap is not frequently used. But reducing overhead is better idea. Another reason for doing this work is for removing architecture dependency on vmalloc layer. I think that vmlist and vmlist_lock is internal data structure for vmalloc layer. Some codes for debugging and stat inevitably use vmlist and vmlist_lock. But it is preferable that they are used as least as possible in outside of vmalloc.c Changelog v1-v2: [2/3]: patch description is improved. Rebased on v3.7-rc7 Joonsoo Kim (3): ARM: vmregion: remove vmregion code entirely ARM: static_vm: introduce an infrastructure for static mapped area ARM: mm: use static_vm for managing static mapped areas arch/arm/include/asm/mach/static_vm.h | 51 arch/arm/mm/Makefile |2 +- arch/arm/mm/ioremap.c | 69 --- arch/arm/mm/mm.h | 10 -- arch/arm/mm/mmu.c | 55 + arch/arm/mm/static_vm.c | 97 arch/arm/mm/vmregion.c| 205 - arch/arm/mm/vmregion.h| 31 - 8 files changed, 208 insertions(+), 312 deletions(-) create mode 100644 arch/arm/include/asm/mach/static_vm.h create mode 100644 arch/arm/mm/static_vm.c delete mode 100644 arch/arm/mm/vmregion.c delete mode 100644 arch/arm/mm/vmregion.h -- 1.7.9.5 Hello, Russell. Could you review this patchset, please? I send another patchset to mm community on top of this. That one is related to this patchset, so I want to get a review about this patchset :) Thanks. Hello. Is there anyone to review this patchset? Please let me know what I should do in order to take a review :) Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] mm: introduce numa_zero_pfn
Currently, we use just *one* zero page regardless of user process' node. When user process read zero page, at first, cpu should load this to cpu cache. If node of cpu is not same as node of zero page, loading takes long time. If we make zero pages for each nodes and use them adequetly, we can reduce this overhead. This patch implement basic infrastructure for numa_zero_pfn. It is default disabled, because it doesn't provide page coloring and some architecture use page coloring for zero page. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/mm/Kconfig b/mm/Kconfig index a3f8ddd..de0ab65 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -412,3 +412,8 @@ config FRONTSWAP and swap data is stored as normal on the matching swap device. If unsure, say Y to enable frontswap. + +config NUMA_ZERO_PFN + bool Enable NUMA-aware zero page handling + depends on NUMA + default n diff --git a/mm/memory.c b/mm/memory.c index 221fc9f..e7d3969 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -112,12 +112,43 @@ __setup(norandmaps, disable_randmaps); unsigned long zero_pfn __read_mostly; unsigned long highest_memmap_pfn __read_mostly; +#ifdef CONFIG_NUMA_ZERO_PFN +unsigned long node_to_zero_pfn[MAX_NUMNODES] __read_mostly; + +/* Should be called after zero_pfn initialization */ +static void __init init_numa_zero_pfn(void) +{ + unsigned int node; + + if (nr_node_ids == 1) + return; + + for_each_node_state(node, N_POSSIBLE) { + node_to_zero_pfn[node] = zero_pfn; + } + + for_each_node_state(node, N_HIGH_MEMORY) { + struct page *page; + page = alloc_pages_exact_node(node, + GFP_HIGHUSER | __GFP_ZERO, 0); + if (!page) + continue; + + node_to_zero_pfn[node] = page_to_pfn(page); + } +} +#else +static inline void __init init_numa_zero_pfn(void) {} +#endif + /* * CONFIG_MMU architectures set up ZERO_PAGE in their paging_init() */ static int __init init_zero_pfn(void) { zero_pfn = page_to_pfn(ZERO_PAGE(0)); + init_numa_zero_pfn(); + return 0; } core_initcall(init_zero_pfn); @@ -717,6 +748,24 @@ static inline bool is_cow_mapping(vm_flags_t flags) return (flags (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; } +#ifdef CONFIG_NUMA_ZERO_PFN +static inline int is_numa_zero_pfn(unsigned long pfn) +{ + return zero_pfn == pfn || node_to_zero_pfn[pfn_to_nid(pfn)] == pfn; +} + +static inline unsigned long my_numa_zero_pfn(unsigned long addr) +{ + if (nr_node_ids == 1) + return zero_pfn; + + return node_to_zero_pfn[numa_node_id()]; +} + +#define is_zero_pfn is_numa_zero_pfn +#define my_zero_pfn my_numa_zero_pfn +#endif /* CONFIG_NUMA_ZERO_PFN */ + #ifndef is_zero_pfn static inline int is_zero_pfn(unsigned long pfn) { -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: introduce numa_zero_pfn
2012/12/13 Andi Kleen a...@firstfloor.org: I would expect a processor to fetch the zero page cachelines from the l3 cache from other sockets avoiding memory transactions altogether. The zero page is likely in use somewhere so no typically no memory accesses should occur in a system. It depends on how effectively the workload uses the caches. If something is a cache pig of the L3 cache, then even shareable cache lines may need to be refetched regularly. But if your workloads spends a significant part of its time reading from zero page read only data there is something wrong with the workload. I would do some data profiling first to really prove that is the case. Okay. I didn't know about L3 cache, before. Now, I think that I need some data profiling! Thanks for comment. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 0/3] introduce static_vm for ARM-specific static mapped area
In current implementation, we used ARM-specific flag, that is, VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area. The purpose of static mapped area is to re-use static mapped area when entire physical address range of the ioremap request can be covered by this area. This implementation causes needless overhead for some cases. We unnecessarily iterate vmlist for finding matched area even if there is no static mapped area. And if there are some static mapped areas, iterating whole vmlist is not preferable. In fact, it is not a critical problem, because ioremap is not frequently used. But reducing overhead is better idea. Another reason for doing this work is for removing architecture dependency on vmalloc layer. I think that vmlist and vmlist_lock is internal data structure for vmalloc layer. Some codes for debugging and stat inevitably use vmlist and vmlist_lock. But it is preferable that they are used outside of vmalloc.c as least as possible. In the near future, I will try to remove other architecture dependency on vmalloc layer. This is just RFC patch and I did compile-test only. If you have any good suggestion, please let me know. These are based on v3.7-rc5. Thanks. Joonsoo Kim (3): ARM: vmregion: remove vmregion code entirely ARM: static_vm: introduce an infrastructure for static mapped area ARM: mm: use static_vm for managing static mapped areas arch/arm/include/asm/mach/static_vm.h | 51 arch/arm/mm/Makefile |2 +- arch/arm/mm/ioremap.c | 69 --- arch/arm/mm/mm.h | 10 -- arch/arm/mm/mmu.c | 55 + arch/arm/mm/static_vm.c | 97 arch/arm/mm/vmregion.c| 205 - arch/arm/mm/vmregion.h| 31 - 8 files changed, 208 insertions(+), 312 deletions(-) create mode 100644 arch/arm/include/asm/mach/static_vm.h create mode 100644 arch/arm/mm/static_vm.c delete mode 100644 arch/arm/mm/vmregion.c delete mode 100644 arch/arm/mm/vmregion.h -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 1/3] ARM: vmregion: remove vmregion code entirely
Now, there is no user for vmregion. So remove it. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile index 8a9c4cb..4e333fa 100644 --- a/arch/arm/mm/Makefile +++ b/arch/arm/mm/Makefile @@ -6,7 +6,7 @@ obj-y := dma-mapping.o extable.o fault.o init.o \ iomap.o obj-$(CONFIG_MMU) += fault-armv.o flush.o idmap.o ioremap.o \ - mmap.o pgd.o mmu.o vmregion.o + mmap.o pgd.o mmu.o ifneq ($(CONFIG_MMU),y) obj-y += nommu.o diff --git a/arch/arm/mm/vmregion.c b/arch/arm/mm/vmregion.c deleted file mode 100644 index a631016..000 --- a/arch/arm/mm/vmregion.c +++ /dev/null @@ -1,205 +0,0 @@ -#include linux/fs.h -#include linux/spinlock.h -#include linux/list.h -#include linux/proc_fs.h -#include linux/seq_file.h -#include linux/slab.h - -#include vmregion.h - -/* - * VM region handling support. - * - * This should become something generic, handling VM region allocations for - * vmalloc and similar (ioremap, module space, etc). - * - * I envisage vmalloc()'s supporting vm_struct becoming: - * - * struct vm_struct { - *struct vmregion region; - *unsigned longflags; - *struct page **pages; - *unsigned int nr_pages; - *unsigned longphys_addr; - * }; - * - * get_vm_area() would then call vmregion_alloc with an appropriate - * struct vmregion head (eg): - * - * struct vmregion vmalloc_head = { - * .vm_list= LIST_HEAD_INIT(vmalloc_head.vm_list), - * .vm_start = VMALLOC_START, - * .vm_end = VMALLOC_END, - * }; - * - * However, vmalloc_head.vm_start is variable (typically, it is dependent on - * the amount of RAM found at boot time.) I would imagine that get_vm_area() - * would have to initialise this each time prior to calling vmregion_alloc(). - */ - -struct arm_vmregion * -arm_vmregion_alloc(struct arm_vmregion_head *head, size_t align, - size_t size, gfp_t gfp, const void *caller) -{ - unsigned long start = head-vm_start, addr = head-vm_end; - unsigned long flags; - struct arm_vmregion *c, *new; - - if (head-vm_end - head-vm_start size) { - printk(KERN_WARNING %s: allocation too big (requested %#x)\n, - __func__, size); - goto out; - } - - new = kmalloc(sizeof(struct arm_vmregion), gfp); - if (!new) - goto out; - - new-caller = caller; - - spin_lock_irqsave(head-vm_lock, flags); - - addr = rounddown(addr - size, align); - list_for_each_entry_reverse(c, head-vm_list, vm_list) { - if (addr = c-vm_end) - goto found; - addr = rounddown(c-vm_start - size, align); - if (addr start) - goto nospc; - } - - found: - /* -* Insert this entry after the one we found. -*/ - list_add(new-vm_list, c-vm_list); - new-vm_start = addr; - new-vm_end = addr + size; - new-vm_active = 1; - - spin_unlock_irqrestore(head-vm_lock, flags); - return new; - - nospc: - spin_unlock_irqrestore(head-vm_lock, flags); - kfree(new); - out: - return NULL; -} - -static struct arm_vmregion *__arm_vmregion_find(struct arm_vmregion_head *head, unsigned long addr) -{ - struct arm_vmregion *c; - - list_for_each_entry(c, head-vm_list, vm_list) { - if (c-vm_active c-vm_start == addr) - goto out; - } - c = NULL; - out: - return c; -} - -struct arm_vmregion *arm_vmregion_find(struct arm_vmregion_head *head, unsigned long addr) -{ - struct arm_vmregion *c; - unsigned long flags; - - spin_lock_irqsave(head-vm_lock, flags); - c = __arm_vmregion_find(head, addr); - spin_unlock_irqrestore(head-vm_lock, flags); - return c; -} - -struct arm_vmregion *arm_vmregion_find_remove(struct arm_vmregion_head *head, unsigned long addr) -{ - struct arm_vmregion *c; - unsigned long flags; - - spin_lock_irqsave(head-vm_lock, flags); - c = __arm_vmregion_find(head, addr); - if (c) - c-vm_active = 0; - spin_unlock_irqrestore(head-vm_lock, flags); - return c; -} - -void arm_vmregion_free(struct arm_vmregion_head *head, struct arm_vmregion *c) -{ - unsigned long flags; - - spin_lock_irqsave(head-vm_lock, flags); - list_del(c-vm_list); - spin_unlock_irqrestore(head-vm_lock, flags); - - kfree(c); -} - -#ifdef CONFIG_PROC_FS -static int arm_vmregion_show(struct seq_file *m, void *p) -{ - struct arm_vmregion *c = list_entry(p, struct arm_vmregion, vm_list); - - seq_printf(m, 0x%08lx-0x%08lx %7lu, c-vm_start, c-vm_end, - c-vm_end - c-vm_start); - if (c-caller
[RFC PATCH 2/3] ARM: static_vm: introduce an infrastructure for static mapped area
In current implementation, we used ARM-specific flag, that is, VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area. The purpose of static mapped area is to re-use static mapped area when entire physical address range of the ioremap request can be covered by this area. This implementation causes needless overhead for some cases. We unnecessarily iterate vmlist for finding matched area even if there is no static mapped area. And if there are some static mapped areas, iterating whole vmlist is not preferable. In fact, it is not a critical problem, because ioremap is not frequently used. But reducing overhead is better idea. Another reason for doing this work is for removing architecture dependency on vmalloc layer. I think that vmlist and vmlist_lock is internal data structure for vmalloc layer. Some codes for debugging and stat inevitably use vmlist and vmlist_lock. But it is preferable that they are used outside of vmalloc.c as least as possible. Now, I introduce an ARM-specific infrastructure for static mapped area. In the following patch, we will use this and resolve above mentioned problem. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/arch/arm/include/asm/mach/static_vm.h b/arch/arm/include/asm/mach/static_vm.h new file mode 100644 index 000..1bb6604 --- /dev/null +++ b/arch/arm/include/asm/mach/static_vm.h @@ -0,0 +1,45 @@ +/* + * arch/arm/include/asm/mach/static_vm.h + * + * Copyright (C) 2012 LG Electronics, Joonsoo Kim js1...@gmail.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef _ASM_MACH_STATIC_VM_H +#define _ASM_MACH_STATIC_VM_H + +#include linux/types.h +#include linux/vmalloc.h + +struct static_vm { + struct static_vm*next; + void*vaddr; + unsigned long size; + unsigned long flags; + phys_addr_t paddr; + const void *caller; +}; + +extern struct static_vm *static_vmlist; +extern spinlock_t static_vmlist_lock; + +extern struct static_vm *find_static_vm_paddr(phys_addr_t paddr, + size_t size, unsigned long flags); +extern struct static_vm *find_static_vm_vaddr(void *vaddr, unsigned long flags); +extern void init_static_vm(struct static_vm *static_vm, + struct vm_struct *vm, unsigned long flags); +extern void insert_static_vm(struct static_vm *vm); + +#endif /* _ASM_MACH_STATIC_VM_H */ diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile index 4e333fa..57b329a 100644 --- a/arch/arm/mm/Makefile +++ b/arch/arm/mm/Makefile @@ -6,7 +6,7 @@ obj-y := dma-mapping.o extable.o fault.o init.o \ iomap.o obj-$(CONFIG_MMU) += fault-armv.o flush.o idmap.o ioremap.o \ - mmap.o pgd.o mmu.o + mmap.o pgd.o mmu.o static_vm.o ifneq ($(CONFIG_MMU),y) obj-y += nommu.o diff --git a/arch/arm/mm/static_vm.c b/arch/arm/mm/static_vm.c new file mode 100644 index 000..d7677cf --- /dev/null +++ b/arch/arm/mm/static_vm.c @@ -0,0 +1,97 @@ +/* + * arch/arm/mm/static_vm.c + * + * Copyright (C) 2012 LG Electronics, Joonsoo Kim js1...@gmail.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include linux/spinlock.h + +#include asm/mach/static_vm.h + +struct static_vm *static_vmlist; +DEFINE_SPINLOCK(static_vmlist_lock); + +struct static_vm *find_static_vm_paddr(phys_addr_t paddr, + size_t size, unsigned long flags) +{ + struct static_vm *area; + + spin_lock(static_vmlist_lock); + for (area = static_vmlist; area; area = area-next) { + if ((area-flags flags) != flags
[RFC PATCH 3/3] ARM: mm: use static_vm for managing static mapped areas
A static mapped area is ARM-specific, so it is better not to use generic vmalloc data structure, that is, vmlist and vmlist_lock for managing static mapped area. And it causes some needless overhead and reducing this overhead is better idea. Now, we have newly introduced static_vm infrastructure. With it, we don't need to iterate all mapped areas. Instead, we just iterate static mapped areas. It helps to reduce an overhead of finding matched area. And architecture dependency on vmalloc layer is removed, so it will help to maintainability for vmalloc layer. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/arch/arm/include/asm/mach/static_vm.h b/arch/arm/include/asm/mach/static_vm.h index 1bb6604..0d9c685 100644 --- a/arch/arm/include/asm/mach/static_vm.h +++ b/arch/arm/include/asm/mach/static_vm.h @@ -32,6 +32,12 @@ struct static_vm { const void *caller; }; +#define STATIC_VM_MEM 0x0001 +#define STATIC_VM_EMPTY0x0002 + +/* mtype should be less than 28 */ +#define STATIC_VM_MTYPE(mt)(1UL ((mt) + 4)) + extern struct static_vm *static_vmlist; extern spinlock_t static_vmlist_lock; diff --git a/arch/arm/mm/ioremap.c b/arch/arm/mm/ioremap.c index 5dcc2fd..b7f3c27 100644 --- a/arch/arm/mm/ioremap.c +++ b/arch/arm/mm/ioremap.c @@ -36,6 +36,7 @@ #include asm/system_info.h #include asm/mach/map.h +#include asm/mach/static_vm.h #include asm/mach/pci.h #include mm.h @@ -197,7 +198,8 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn, const struct mem_type *type; int err; unsigned long addr; - struct vm_struct * area; + struct vm_struct *area; + phys_addr_t paddr = __pfn_to_phys(pfn); #ifndef CONFIG_ARM_LPAE /* @@ -219,24 +221,17 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn, /* * Try to reuse one of the static mapping whenever possible. */ - read_lock(vmlist_lock); - for (area = vmlist; area; area = area-next) { - if (!size || (sizeof(phys_addr_t) == 4 pfn = 0x10)) - break; - if (!(area-flags VM_ARM_STATIC_MAPPING)) - continue; - if ((area-flags VM_ARM_MTYPE_MASK) != VM_ARM_MTYPE(mtype)) - continue; - if (__phys_to_pfn(area-phys_addr) pfn || - __pfn_to_phys(pfn) + size-1 area-phys_addr + area-size-1) - continue; - /* we can drop the lock here as we know *area is static */ - read_unlock(vmlist_lock); - addr = (unsigned long)area-addr; - addr += __pfn_to_phys(pfn) - area-phys_addr; - return (void __iomem *) (offset + addr); + if (size !((sizeof(phys_addr_t) == 4 pfn = 0x10))) { + struct static_vm *static_vm; + + static_vm = find_static_vm_paddr(__pfn_to_phys(pfn), size, + STATIC_VM_MEM | STATIC_VM_MTYPE(mtype)); + if (static_vm) { + addr = (unsigned long)static_vm-vaddr; + addr += paddr - static_vm-paddr; + return (void __iomem *) (offset + addr); + } } - read_unlock(vmlist_lock); /* * Don't allow RAM to be mapped - this causes problems with ARMv6+ @@ -248,7 +243,7 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn, if (!area) return NULL; addr = (unsigned long)area-addr; - area-phys_addr = __pfn_to_phys(pfn); + area-phys_addr = paddr; #if !defined(CONFIG_SMP) !defined(CONFIG_ARM_LPAE) if (DOMAIN_IO == 0 @@ -346,34 +341,20 @@ __arm_ioremap_exec(unsigned long phys_addr, size_t size, bool cached) void __iounmap(volatile void __iomem *io_addr) { void *addr = (void *)(PAGE_MASK (unsigned long)io_addr); - struct vm_struct *vm; - - read_lock(vmlist_lock); - for (vm = vmlist; vm; vm = vm-next) { - if (vm-addr addr) - break; - if (!(vm-flags VM_IOREMAP)) - continue; - /* If this is a static mapping we must leave it alone */ - if ((vm-flags VM_ARM_STATIC_MAPPING) - (vm-addr = addr) (vm-addr + vm-size addr)) { - read_unlock(vmlist_lock); - return; - } + struct static_vm *static_vm; + + static_vm = find_static_vm_vaddr(addr, STATIC_VM_MEM); + if (static_vm) + return; + #if !defined(CONFIG_SMP) !defined(CONFIG_ARM_LPAE) - /* -* If this is a section based mapping we need to handle it -* specially as the VM subsystem does not know how to handle -* such a beast. -*/ - if ((vm-addr == addr
Re: [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry
Hi, Minchan. 2012/11/14 Minchan Kim minc...@kernel.org: On Tue, Nov 13, 2012 at 11:12:28PM +0900, JoonSoo Kim wrote: 2012/11/13 Minchan Kim minc...@kernel.org: On Tue, Nov 13, 2012 at 09:30:57AM +0900, JoonSoo Kim wrote: 2012/11/3 Minchan Kim minc...@kernel.org: Hi Joonsoo, On Sat, Nov 03, 2012 at 04:07:25AM +0900, JoonSoo Kim wrote: Hello, Minchan. 2012/11/1 Minchan Kim minc...@kernel.org: On Thu, Nov 01, 2012 at 01:56:36AM +0900, Joonsoo Kim wrote: In current code, after flush_all_zero_pkmaps() is invoked, then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps() return index of first flushed entry. With this index, we can immediately map highmem page to virtual address represented by index. So change return type of flush_all_zero_pkmaps() and return index of first flushed entry. Additionally, update last_pkmap_nr to this index. It is certain that entry which is below this index is occupied by other mapping, therefore updating last_pkmap_nr to this index is reasonable optimization. Cc: Mel Gorman m...@csn.ul.ie Cc: Peter Zijlstra a.p.zijls...@chello.nl Cc: Minchan Kim minc...@kernel.org Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/include/linux/highmem.h b/include/linux/highmem.h index ef788b5..97ad208 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size) #ifdef CONFIG_HIGHMEM #include asm/highmem.h +#define PKMAP_INVALID_INDEX (LAST_PKMAP) /* declarations for linux/mm/highmem.c */ unsigned int nr_free_highpages(void); diff --git a/mm/highmem.c b/mm/highmem.c index d98b0a9..b365f7b 100644 --- a/mm/highmem.c +++ b/mm/highmem.c @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr) return virt_to_page(addr); } -static void flush_all_zero_pkmaps(void) +static unsigned int flush_all_zero_pkmaps(void) { int i; - int need_flush = 0; + unsigned int index = PKMAP_INVALID_INDEX; flush_cache_kmaps(); @@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void) pkmap_page_table[i]); set_page_address(page, NULL); - need_flush = 1; + if (index == PKMAP_INVALID_INDEX) + index = i; } - if (need_flush) + if (index != PKMAP_INVALID_INDEX) flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP)); + + return index; } /** @@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void) */ void kmap_flush_unused(void) { + unsigned int index; + lock_kmap(); - flush_all_zero_pkmaps(); + index = flush_all_zero_pkmaps(); + if (index != PKMAP_INVALID_INDEX (index last_pkmap_nr)) + last_pkmap_nr = index; I don't know how kmap_flush_unused is really fast path so how my nitpick is effective. Anyway, What problem happens if we do following as? lock() index = flush_all_zero_pkmaps(); if (index != PKMAP_INVALID_INDEX) last_pkmap_nr = index; unlock(); Normally, last_pkmap_nr is increased with searching empty slot in map_new_virtual. So I expect return value of flush_all_zero_pkmaps in kmap_flush_unused normally become either less than last_pkmap_nr or last_pkmap_nr + 1. There is a case that return value of kmap_flush_unused() is larger than last_pkmap_nr. I see but why it's problem? kmap_flush_unused returns larger value than last_pkmap_nr means that there is no free slot at below the value. So unconditional last_pkmap_nr update is vaild. I think that this is not true. Look at the slightly different example. Assume last_pkmap = 20 and index 1-9, 12-19 is kmapped. 10, 11 is kunmapped. do kmap_flush_unused() = flush index 10,11 = last_pkmap = 10; do kunmap() with index 17 do kmap_flush_unused() = flush index 17 = last_pkmap = 17? In this case, unconditional last_pkmap_nr update skip one kunmapped index. So, conditional update is needed. Thanks for pouinting out, Joonsoo. You're right. I misunderstood your flush_all_zero_pkmaps change. As your change, flush_all_zero_pkmaps returns first *flushed* free slot index. What's the benefit returning flushed flushed free slot index rather than free slot index? If flush_all_zero_pkmaps() return free slot index rather than first flushed free slot, we need another comparison like as 'if pkmap_count[i] == 0' and need another local variable for determining whether flush is occurred or not. I want to minimize these overhead and churning of the code, although they are negligible. I think
[RFC PATCH] mm: WARN_ON_ONCE if f_op-mmap() change vma's start address
During reviewing the source code, I found a comment which mention that after f_op-mmap(), vma's start address can be changed. I didn't verify that it is really possible, because there are so many f_op-mmap() implementation. But if there are some mmap() which change vma's start address, it is possible error situation, because we already prepare prev vma, rb_link and rb_parent and these are related to original address. So add WARN_ON_ONCE for finding that this situtation really happens. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/mm/mmap.c b/mm/mmap.c index 2d94235..36567b7 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1333,7 +1333,11 @@ munmap_back: * * Answer: Yes, several device drivers can do it in their * f_op-mmap method. -DaveM +* Bug: If addr is changed, prev, rb_link, rb_parent should +* be updated for vma_link() */ + WARN_ON_ONCE(addr != vma-vm_start); + addr = vma-vm_start; pgoff = vma-vm_pgoff; vm_flags = vma-vm_flags; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/3] introduce static_vm for ARM-specific static mapped area
Hello, Russell. Thanks for review. 2012/11/15 Russell King - ARM Linux li...@arm.linux.org.uk: On Thu, Nov 15, 2012 at 01:55:51AM +0900, Joonsoo Kim wrote: In current implementation, we used ARM-specific flag, that is, VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area. The purpose of static mapped area is to re-use static mapped area when entire physical address range of the ioremap request can be covered by this area. This implementation causes needless overhead for some cases. In what cases? For example, assume that there is only one static mapped area and vmlist has 300 areas. Every time we call ioremap, we check 300 areas for deciding whether it is matched or not. Moreover, even if there is no static mapped area and vmlist has 300 areas, every time we call ioremap, we check 300 areas in now. We unnecessarily iterate vmlist for finding matched area even if there is no static mapped area. And if there are some static mapped areas, iterating whole vmlist is not preferable. Why not? Please put some explanation into your message rather than just statements making unexplained assertions. If we construct a extra list for static mapped area, we can eliminate above mentioned overhead. With a extra list, if there is one static mapped area, we just check only one area and proceed next operation quickly. Another reason for doing this work is for removing architecture dependency on vmalloc layer. I think that vmlist and vmlist_lock is internal data structure for vmalloc layer. Some codes for debugging and stat inevitably use vmlist and vmlist_lock. But it is preferable that they are used outside of vmalloc.c as least as possible. The vmalloc layer is also made available for ioremap use, and it is intended that architectures hook into this for ioremap support. Yes. But, I think that it is preferable to use well-defined vmalloc API rather than directly manipulating low-level data structure. IMHO, if there is no suitable vmalloc API, making new one is better than directly manipulating low-level data structure. It makes vmalloc code more maintainable. I'm not expert, so please let me know what I missed. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] slub: assign refcount for kmalloc_caches
On Thu, Jan 10, 2013 at 08:47:39PM -0800, Paul Hargrove wrote: I just had a look at patch-3.7.2-rc1, and this change doesn't appear to have made it in yet. Am I missing something? -Paul I try to check it. Ccing to Greg. Hello, Pekka and Greg. v3.8-rcX has already fixed by another stuff, but it is not simple change. So I made a new patch and sent it. How this kind of patch (only for stable v3.7) go into stable tree? through Pekka's slab tree? or send it to Greg, directly? I don't know how to submit this kind of patch to stable tree exactly. Could anyone help me? Thanks. On Tue, Dec 25, 2012 at 7:30 AM, JoonSoo Kim js1...@gmail.com wrote: 2012/12/26 Joonsoo Kim js1...@gmail.com: commit cce89f4f6911286500cf7be0363f46c9b0a12ce0('Move kmem_cache refcounting to common code') moves some refcount manipulation code to common code. Unfortunately, it also removed refcount assignment for kmalloc_caches. So, kmalloc_caches's refcount is initially 0. This makes errornous situation. Paul Hargrove report that when he create a 8-byte kmem_cache and destory it, he encounter below message. 'Objects remaining in kmalloc-8 on kmem_cache_close()' 8-byte kmem_cache merge with 8-byte kmalloc cache and refcount is increased by one. So, resulting refcount is 1. When destory it, it hit refcount = 0, then kmem_cache_close() is executed and error message is printed. This patch assign initial refcount 1 to kmalloc_caches, so fix this errornous situtation. Cc: sta...@vger.kernel.org # v3.7 Cc: Christoph Lameter c...@linux.com Reported-by: Paul Hargrove phhargr...@lbl.gov Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/mm/slub.c b/mm/slub.c index a0d6984..321afab 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3279,6 +3279,7 @@ static struct kmem_cache *__init create_kmalloc_cache(const char *name, if (kmem_cache_open(s, flags)) goto panic; + s-refcount = 1; list_add(s-list, slab_caches); return s; -- 1.7.9.5 I missed some explanation. In v3.8-rc1, this problem is already solved. See create_kmalloc_cache() in mm/slab_common.c. So this patch is just for v3.7 stable. -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects. After acquire_slab() is executed at first, page-inuse is same as page-objects, then, available is always 0. So, we always go next iteration. Correct it to stop a iteration when we find sufficient objects at first iteration. After that, we don't need return value of put_cpu_partial(). So remove it. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com --- These are based on v3.8-rc3 and there is no dependency between each other. If rebase is needed, please notify me. diff --git a/mm/slub.c b/mm/slub.c index ba2ca53..abef30e 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1493,7 +1493,7 @@ static inline void remove_partial(struct kmem_cache_node *n, */ static inline void *acquire_slab(struct kmem_cache *s, struct kmem_cache_node *n, struct page *page, - int mode) + int mode, int *objects) { void *freelist; unsigned long counters; @@ -1506,6 +1506,7 @@ static inline void *acquire_slab(struct kmem_cache *s, */ freelist = page-freelist; counters = page-counters; + *objects = page-objects - page-inuse; new.counters = counters; if (mode) { new.inuse = page-objects; @@ -1528,7 +1529,7 @@ static inline void *acquire_slab(struct kmem_cache *s, return freelist; } -static int put_cpu_partial(struct kmem_cache *s, struct page *page, int drain); +static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain); static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags); /* @@ -1539,6 +1540,8 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n, { struct page *page, *page2; void *object = NULL; + int available = 0; + int objects; /* * Racy check. If we mistakenly see no partial slabs then we @@ -1552,22 +1555,21 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n, spin_lock(n-list_lock); list_for_each_entry_safe(page, page2, n-partial, lru) { void *t; - int available; if (!pfmemalloc_match(page, flags)) continue; - t = acquire_slab(s, n, page, object == NULL); + t = acquire_slab(s, n, page, object == NULL, objects); if (!t) break; + available += objects; if (!object) { c-page = page; stat(s, ALLOC_FROM_PARTIAL); object = t; - available = page-objects - page-inuse; } else { - available = put_cpu_partial(s, page, 0); + put_cpu_partial(s, page, 0); stat(s, CPU_PARTIAL_NODE); } if (kmem_cache_debug(s) || available s-cpu_partial / 2) @@ -1946,7 +1948,7 @@ static void unfreeze_partials(struct kmem_cache *s, * If we did not find a slot then simply move all the partials to the * per node partial list. */ -static int put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) +static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) { struct page *oldpage; int pages; @@ -1984,7 +1986,6 @@ static int put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) page-next = oldpage; } while (this_cpu_cmpxchg(s-cpu_slab-partial, oldpage, page) != oldpage); - return pobjects; } static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] slub: add 'likely' macro to inc_slabs_node()
After boot phase, 'n' always exist. So add 'likely' macro for helping compiler. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/slub.c b/mm/slub.c index 830348b..6f82070 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1005,7 +1005,7 @@ static inline void inc_slabs_node(struct kmem_cache *s, int node, int objects) * dilemma by deferring the increment of the count during * bootstrap (see early_kmem_cache_node_alloc). */ - if (n) { + if (likely(n)) { atomic_long_inc(n-nr_slabs); atomic_long_add(objects, n-total_objects); } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] slub: correct bootstrap() for kmem_cache, kmem_cache_node
Current implementation of bootstrap() is not sufficient for kmem_cache and kmem_cache_node. First, for kmem_cache. bootstrap() call kmem_cache_zalloc() at first. When kmem_cache_zalloc() is called, kmem_cache's slab is moved to cpu slab for satisfying kmem_cache allocation request. In current implementation, we only consider n-partial slabs, so, we miss this cpu slab for kmem_cache. Second, for kmem_cache_node. When slab_state = PARTIAL, create_boot_cache() is called. And then, kmem_cache_node's slab is moved to cpu slab for satisfying kmem_cache_node allocation request. So, we also miss this slab. These didn't make any error previously, because we normally don't free objects which comes from kmem_cache's first slab and kmem_cache_node's. Problem will be solved if we consider a cpu slab in bootstrap(). This patch implement it. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/slub.c b/mm/slub.c index abef30e..830348b 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3613,11 +3613,22 @@ static int slab_memory_callback(struct notifier_block *self, static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache) { + int cpu; int node; struct kmem_cache *s = kmem_cache_zalloc(kmem_cache, GFP_NOWAIT); memcpy(s, static_cache, kmem_cache-object_size); + for_each_possible_cpu(cpu) { + struct kmem_cache_cpu *c; + struct page *p; + + c = per_cpu_ptr(s-cpu_slab, cpu); + p = c-page; + if (p) + p-slab_cache = s; + } + for_each_node_state(node, N_NORMAL_MEMORY) { struct kmem_cache_node *n = get_node(s, node); struct page *p; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] tools, vm: add .gitignore to ignore built binaries
There is no .gitignore in tools/vm, so 'git status' always show built binaries. To ignore this, add .gitignore. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/tools/vm/.gitignore b/tools/vm/.gitignore new file mode 100644 index 000..44f095f --- /dev/null +++ b/tools/vm/.gitignore @@ -0,0 +1,2 @@ +slabinfo +page-types -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched: factor out code to should_we_balance()
On Fri, Aug 02, 2013 at 12:20:40PM +0200, Peter Zijlstra wrote: On Fri, Aug 02, 2013 at 06:05:51PM +0900, ���ؼ� wrote: What is with you people; have you never learned to trim emails? Seriously, I'm going to write a script which tests to too many quoted lines, too many nested quotes and other such nonsense and drop your emails into /dev/null. Hello, Peter. Sorry about that. I will be careful about this later. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched: factor out code to should_we_balance()
On Mon, Aug 05, 2013 at 09:52:28AM +0530, Preeti U Murthy wrote: On 08/02/2013 04:02 PM, Peter Zijlstra wrote: On Fri, Aug 02, 2013 at 02:56:14PM +0530, Preeti U Murthy wrote: You need to iterate over all the groups of the sched domain env-sd and not just the first group of env-sd like you are doing above. This is to I don't think so. IIRC, env-sd-groups always means local group, so we don't need to find our group by iterating over all the groups. Take a look at update_sd_lb_stats(). That should clarify this. There is an exclusive local_group check there. sd-groups points to the first group in the list of groups under this sd. Take a look at: 88b8dac0a Ah ok! Thanks for this pointer. Apologies for having overlooked the fact that the sd-groups always points to the group to which the balance_cpu belongs. And subsequent dst_cpus for retries of load balancing also belong to the same group as the balance_cpu. This patch thus looks fine to me. Okay! Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 3/3] sched: clean-up struct sd_lb_stat
+ if (busiest-group_imb) { + busiest-sum_weighted_load = + min(busiest-sum_weighted_load, sds-sd_avg_load); Right here we get confused as to why the total load is being compared against load per task (although you are changing it to load per task above). Yes, you are right. I will add load_per_task to struct sg_lb_stats. @@ -4771,12 +4763,13 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * Be careful of negative numbers as they'll appear as very large values * with unsigned longs. */ - max_pull = min(sds-max_load - sds-avg_load, load_above_capacity); + max_pull = min(busiest-avg_load - sds-sd_avg_load, + load_above_capacity); This is ok, but readability wise max_load is much better. max_load signifies the maximum load per task on a group in this sd, and avg_load signifies the total load per task across the sd. You are checking if there is imbalance in the total load in the sd and try to even it out across the sd. Here busiest does not convey this immediately. You may be confused. max_load doesn't means the maximum load per task. It means that the busiest group's load per cpu power. And here max doesn't mean maximum load in this sd, instead, load of busiest sg. IMO, busiest-avg_load convey proper meaning. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 17/18] mm, hugetlb: retry if we fail to allocate a hugepage with use_reserve
Any mapping that doesn't use the reserved pool, not just MAP_NORESERVE. For example, if a process makes a MAP_PRIVATE mapping, then fork()s then the mapping is instantiated in the child, that will not draw from the reserved pool. Should we ensure them to allocate the last hugepage? They map a region with MAP_NORESERVE, so don't assume that their requests always succeed. If the pages are available, people get cranky if it fails for no apparent reason, MAP_NORESERVE or not. They get especially cranky if it sometimes fails and sometimes doesn't due to a race condition. Hello, Hmm... Okay. I will try to implement another way to protect race condition. Maybe it is the best to use a table mutex :) Anyway, please give me a time, guys. Really thank you for pointing that. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/4] mm, migrate: allocation new page lazyily in unmap_and_move()
get_new_page() sets up result to communicate error codes from the following checks. While the existing ones (page freed and thp split failed) don't change rc, somebody else might add a condition whose error code should be propagated back into *result but miss it. Please leave get_new_page() where it is. The win from this change is not big enough to risk these problems. Hello, Johannes. Okay. You are right. I will omit this patch next time. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] mm, page_alloc: add likely macro to help compiler optimization
Hello, Michal. On Fri, Aug 02, 2013 at 11:36:07PM +0200, Michal Hocko wrote: On Fri 02-08-13 16:47:10, Johannes Weiner wrote: On Fri, Aug 02, 2013 at 06:27:22PM +0200, Michal Hocko wrote: On Fri 02-08-13 11:07:56, Joonsoo Kim wrote: We rarely allocate a page with ALLOC_NO_WATERMARKS and it is used in slow path. For making fast path more faster, add likely macro to help compiler optimization. The code is different in mmotm tree (see mm: page_alloc: rearrange watermark checking in get_page_from_freelist) Yes, please rebase this on top. Besides that, make sure you provide numbers which prove your claims about performance optimizations. Isn't that a bit overkill? We know it's a likely path (we would deadlock constantly if a sizable portion of allocations were to ignore the watermarks). Does he have to justify that likely in general makes sense? That was more a generic comment. If there is a claim that something would be faster it would be nice to back that claim by some numbers (e.g. smaller hot path). In this particular case, unlikely(alloc_flags ALLOC_NO_WATERMARKS) doesn't make any change to the generated code with gcc 4.8.1 resp. 4.3.4 I have here. Maybe other versions of gcc would benefit from the hint but changelog didn't tell us. I wouldn't add the anotation if it doesn't make any difference for the resulting code. Hmm, Is there no change with gcc 4.8.1 and 4.3.4? I found a change with gcc 4.6.3 and v3.10 kernel. textdata bss dec hex filename 35683 1461 644 37788939c page_alloc_base.o 35715 1461 644 3782093bc page_alloc_patch.o Slightly larger (32 bytes) than before. And assembly code looks different as I expected. * Original code 17126 .LVL1518: 17127 .loc 2 1904 0 is_stmt 1 17128 testb $4, -116(%rbp) #, %sfp 17129 je .L866 #, (snip) 17974 .L866: 17975 .LBE6053: 17976 .LBE6052: 17977 .LBE6051: 17978 .LBE6073: 17979 .LBE6093: 17980 .LBB6094: 17981 .loc 2 1908 0 17982 movl-116(%rbp), %r14d # %sfp, D.42080 17983 .loc 2 1909 0 17984 movl-116(%rbp), %r8d# %sfp, 17985 movq%rbx, %rdi # prephitmp.1723, 17986 movl-212(%rbp), %ecx# %sfp, 17987 movl-80(%rbp), %esi # %sfp, 17988 .loc 2 1908 0 17989 andl$3, %r14d #, D.42080 17990 movslq %r14d, %rax # D.42080, D.42080 17991 movq(%rbx,%rax,8), %r13 # prephitmp.1723_268-watermark, mark 17992 .LVL1591: 17993 .loc 2 1909 0 17994 movq%r13, %rdx # mark, 17995 callzone_watermark_ok # On 17129 line, we check ALLOC_NO_WATERMARKS and if not matched, then jump to L866. L866 is on 17981 line. * Patched code 17122 .L807: 17123 .LVL1513: 17124 .loc 2 1904 0 is_stmt 1 17125 testb $4, -88(%rbp) #, %sfp 17126 jne .L811 #, 17127 .LBB6092: 17128 .loc 2 1908 0 17129 movl-88(%rbp), %r13d# %sfp, D.42082 17130 .loc 2 1909 0 17131 movl-88(%rbp), %r8d # %sfp, 17132 movq%rbx, %rdi # prephitmp.1723, 17133 movl-160(%rbp), %ecx# %sfp, 17134 movl-80(%rbp), %esi # %sfp, 17135 .loc 2 1908 0 17136 andl$3, %r13d #, D.42082 17137 movslq %r13d, %rax # D.42082, D.42082 17138 movq(%rbx,%rax,8), %r12 # prephitmp.1723_270-watermark, mark 17139 .LVL1514: 17140 .loc 2 1909 0 17141 movq%r12, %rdx # mark, 17142 callzone_watermark_ok # On 17124 line, we check ALLOC_NO_WATERMARKS (0x4) and if not matched, execute following code without jumping. This is effect of 'likely' macro. Isn't it reasonable? Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] mm, page_alloc: add likely macro to help compiler optimization
On Mon, Aug 05, 2013 at 05:10:08PM +0900, Joonsoo Kim wrote: Hello, Michal. On Fri, Aug 02, 2013 at 11:36:07PM +0200, Michal Hocko wrote: On Fri 02-08-13 16:47:10, Johannes Weiner wrote: On Fri, Aug 02, 2013 at 06:27:22PM +0200, Michal Hocko wrote: On Fri 02-08-13 11:07:56, Joonsoo Kim wrote: We rarely allocate a page with ALLOC_NO_WATERMARKS and it is used in slow path. For making fast path more faster, add likely macro to help compiler optimization. The code is different in mmotm tree (see mm: page_alloc: rearrange watermark checking in get_page_from_freelist) Yes, please rebase this on top. Besides that, make sure you provide numbers which prove your claims about performance optimizations. Isn't that a bit overkill? We know it's a likely path (we would deadlock constantly if a sizable portion of allocations were to ignore the watermarks). Does he have to justify that likely in general makes sense? That was more a generic comment. If there is a claim that something would be faster it would be nice to back that claim by some numbers (e.g. smaller hot path). In this particular case, unlikely(alloc_flags ALLOC_NO_WATERMARKS) doesn't make any change to the generated code with gcc 4.8.1 resp. 4.3.4 I have here. Maybe other versions of gcc would benefit from the hint but changelog didn't tell us. I wouldn't add the anotation if it doesn't make any difference for the resulting code. Hmm, Is there no change with gcc 4.8.1 and 4.3.4? I found a change with gcc 4.6.3 and v3.10 kernel. Ah... I did a test on mmotm (Johannes's git) and found that this patch doesn't make any effect. I guess, a change from Johannes ('rearrange watermark checking in get_page_from_freelist') already makes better code for !ALLOC_NO_WATERMARKS case. IMHO, although there is no effect, it is better to add likely macro, because arrangement can be changed from time to time without any consideration of assembly code generation. How about your opinion, Johannes and Michal? Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 0/3] optimization, clean-up about fair.c
Patch 1 is for removing one division operation with multiplication. Patch 2,3 is for clean-up related to load_balance(), there is improvement in terms of code size and readability may be also improved. Feel free to give a comment for this patchset. It's tested on v3.10. On v3.11-rc3, it can be compiled properly. * Change from v2 [2/3]: set initial value 1 to should_balance in rebalance_domains() as Vincent commented [3/3]: not overwrite sum_weighted_load to represent load_per_task. Instead, use newly load_per_task as Preeti commented * Change from v1 Remove 2 patches, because I'm not sure they are right. Joonsoo Kim (3): sched: remove one division operation in find_buiest_queue() sched: factor out code to should_we_balance() sched: clean-up struct sd_lb_stat kernel/sched/fair.c | 326 +-- 1 file changed, 162 insertions(+), 164 deletions(-) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 3/3] sched: clean-up struct sd_lb_stat
There is no reason to maintain separate variables for this_group and busiest_group in sd_lb_stat, except saving some space. But this structure is always allocated in stack, so this saving isn't really benificial. This patch unify these variables, so IMO, readability may be improved. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c6732d2..f8a9660 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4232,36 +4232,6 @@ static unsigned long task_h_load(struct task_struct *p) /** Helpers for find_busiest_group / /* - * sd_lb_stats - Structure to store the statistics of a sched_domain - * during load balancing. - */ -struct sd_lb_stats { - struct sched_group *busiest; /* Busiest group in this sd */ - struct sched_group *this; /* Local group in this sd */ - unsigned long total_load; /* Total load of all groups in sd */ - unsigned long total_pwr; /* Total power of all groups in sd */ - unsigned long avg_load;/* Average load across all groups in sd */ - - /** Statistics of this group */ - unsigned long this_load; - unsigned long this_load_per_task; - unsigned long this_nr_running; - unsigned long this_has_capacity; - unsigned int this_idle_cpus; - - /* Statistics of the busiest group */ - unsigned int busiest_idle_cpus; - unsigned long max_load; - unsigned long busiest_load_per_task; - unsigned long busiest_nr_running; - unsigned long busiest_group_capacity; - unsigned long busiest_has_capacity; - unsigned int busiest_group_weight; - - int group_imb; /* Is there imbalance in this sd */ -}; - -/* * sg_lb_stats - stats of a sched_group required for load_balancing */ struct sg_lb_stats { @@ -4269,6 +4239,7 @@ struct sg_lb_stats { unsigned long group_load; /* Total load over the CPUs of the group */ unsigned long sum_nr_running; /* Nr tasks running in the group */ unsigned long sum_weighted_load; /* Weighted load of group's tasks */ + unsigned long load_per_task; unsigned long group_capacity; unsigned long idle_cpus; unsigned long group_weight; @@ -4276,6 +4247,24 @@ struct sg_lb_stats { int group_has_capacity; /* Is there extra capacity in the group? */ }; +/* + * sd_lb_stats - Structure to store the statistics of a sched_domain + * during load balancing. + */ +struct sd_lb_stats { + struct sched_group *busiest; /* Busiest group in this sd */ + struct sched_group *this; /* Local group in this sd */ + unsigned long total_load; /* Total load of all groups in sd */ + unsigned long total_pwr; /* Total power of all groups in sd */ + unsigned long sd_avg_load; /* Average load across all groups in sd */ + + /* Statistics of this group */ + struct sg_lb_stats this_stat; + + /* Statistics of the busiest group */ + struct sg_lb_stats busiest_stat; +}; + /** * get_sd_load_idx - Obtain the load index for a given sched domain. * @sd: The sched_domain whose load_idx is to be obtained. @@ -4556,7 +4545,7 @@ static bool update_sd_pick_busiest(struct lb_env *env, struct sched_group *sg, struct sg_lb_stats *sgs) { - if (sgs-avg_load = sds-max_load) + if (sgs-avg_load = sds-busiest_stat.avg_load) return false; if (sgs-sum_nr_running sgs-group_capacity) @@ -4593,7 +4582,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, { struct sched_domain *child = env-sd-child; struct sched_group *sg = env-sd-groups; - struct sg_lb_stats sgs; + struct sg_lb_stats tmp_sgs; int load_idx, prefer_sibling = 0; if (child child-flags SD_PREFER_SIBLING) @@ -4603,13 +4592,16 @@ static inline void update_sd_lb_stats(struct lb_env *env, do { int local_group; + struct sg_lb_stats *sgs; local_group = cpumask_test_cpu(env-dst_cpu, sched_group_cpus(sg)); - memset(sgs, 0, sizeof(sgs)); - update_sg_lb_stats(env, sg, load_idx, local_group, sgs); - - sds-total_load += sgs.group_load; - sds-total_pwr += sg-sgp-power; + if (local_group) + sgs = sds-this_stat; + else { + sgs = tmp_sgs; + memset(sgs, 0, sizeof(*sgs)); + } + update_sg_lb_stats(env, sg, load_idx, local_group, sgs); /* * In case the child domain prefers tasks go to siblings @@ -4621,26 +4613,19 @@ static inline void update_sd_lb_stats(struct lb_env *env, * heaviest group when it is already under-utilized (possible * with a large weight task
[PATCH v2 mmotm 3/3] swap: clean-up #ifdef in page_mapping()
PageSwapCache() is always false when !CONFIG_SWAP, so compiler properly discard related code. Therefore, we don't need #ifdef explicitly. Acked-by: Johannes Weiner han...@cmpxchg.org Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/include/linux/swap.h b/include/linux/swap.h index 24db914..c03c139 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -447,6 +447,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout) #else /* CONFIG_SWAP */ +#define swap_address_space(entry) (NULL) #define get_nr_swap_pages()0L #define total_swap_pages 0L #define total_swapcache_pages()0UL diff --git a/mm/util.c b/mm/util.c index 7441c41..eaf63fc2 100644 --- a/mm/util.c +++ b/mm/util.c @@ -388,15 +388,12 @@ struct address_space *page_mapping(struct page *page) struct address_space *mapping = page-mapping; VM_BUG_ON(PageSlab(page)); -#ifdef CONFIG_SWAP if (unlikely(PageSwapCache(page))) { swp_entry_t entry; entry.val = page_private(page); mapping = swap_address_space(entry); - } else -#endif - if ((unsigned long)mapping PAGE_MAPPING_ANON) + } else if ((unsigned long)mapping PAGE_MAPPING_ANON) mapping = NULL; return mapping; } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 mmotm 1/3] mm, page_alloc: add unlikely macro to help compiler optimization
We rarely allocate a page with ALLOC_NO_WATERMARKS and it is used in slow path. For helping compiler optimization, add unlikely macro to ALLOC_NO_WATERMARKS checking. This patch doesn't have any effect now, because gcc already optimize this properly. But we cannot assume that gcc always does right and nobody re-evaluate if gcc do proper optimization with their change, for example, it is not optimized properly on v3.10. So adding compiler hint here is reasonable. Acked-by: Johannes Weiner han...@cmpxchg.org Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f5c549c..04bec49 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1901,7 +1901,7 @@ zonelist_scan: !cpuset_zone_allowed_softwall(zone, gfp_mask)) continue; BUILD_BUG_ON(ALLOC_NO_WATERMARKS NR_WMARK); - if (alloc_flags ALLOC_NO_WATERMARKS) + if (unlikely(alloc_flags ALLOC_NO_WATERMARKS)) goto try_this_zone; /* * Distribute pages in proportion to the individual -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 2/3] sched: factor out code to should_we_balance()
Now checking whether this cpu is appropriate to balance or not is embedded into update_sg_lb_stats() and this checking has no direct relationship to this function. There is not enough reason to place this checking at update_sg_lb_stats(), except saving one iteration for sched_group_cpus. In this patch, I factor out this checking to should_we_balance() function. And before doing actual work for load_balancing, check whether this cpu is appropriate to balance via should_we_balance(). If this cpu is not a candidate for balancing, it quit the work immediately. With this change, we can save two memset cost and can expect better compiler optimization. Below is result of this patch. * Vanilla * textdata bss dec hex filename 344991136 116 357518ba7 kernel/sched/fair.o * Patched * textdata bss dec hex filename 342431136 116 354958aa7 kernel/sched/fair.o In addition, rename @balance to @should_balance in order to represent its purpose more clearly. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 52898dc..c6732d2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4463,22 +4463,17 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group) * @group: sched_group whose statistics are to be updated. * @load_idx: Load index of sched_domain of this_cpu for load calc. * @local_group: Does group contain this_cpu. - * @balance: Should we balance. * @sgs: variable to hold the statistics for this group. */ static inline void update_sg_lb_stats(struct lb_env *env, struct sched_group *group, int load_idx, - int local_group, int *balance, struct sg_lb_stats *sgs) + int local_group, struct sg_lb_stats *sgs) { unsigned long nr_running, max_nr_running, min_nr_running; unsigned long load, max_cpu_load, min_cpu_load; - unsigned int balance_cpu = -1, first_idle_cpu = 0; unsigned long avg_load_per_task = 0; int i; - if (local_group) - balance_cpu = group_balance_cpu(group); - /* Tally up the load of all CPUs in the group */ max_cpu_load = 0; min_cpu_load = ~0UL; @@ -4491,15 +4486,9 @@ static inline void update_sg_lb_stats(struct lb_env *env, nr_running = rq-nr_running; /* Bias balancing toward cpus of our domain */ - if (local_group) { - if (idle_cpu(i) !first_idle_cpu - cpumask_test_cpu(i, sched_group_mask(group))) { - first_idle_cpu = 1; - balance_cpu = i; - } - + if (local_group) load = target_load(i, load_idx); - } else { + else { load = source_load(i, load_idx); if (load max_cpu_load) max_cpu_load = load; @@ -4519,22 +4508,9 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs-idle_cpus++; } - /* -* First idle cpu or the first cpu(busiest) in this sched group -* is eligible for doing load balancing at this and above -* domains. In the newly idle case, we will allow all the cpu's -* to do the newly idle load balance. -*/ - if (local_group) { - if (env-idle != CPU_NEWLY_IDLE) { - if (balance_cpu != env-dst_cpu) { - *balance = 0; - return; - } - update_group_power(env-sd, env-dst_cpu); - } else if (time_after_eq(jiffies, group-sgp-next_update)) - update_group_power(env-sd, env-dst_cpu); - } + if (local_group (env-idle != CPU_NEWLY_IDLE || + time_after_eq(jiffies, group-sgp-next_update))) + update_group_power(env-sd, env-dst_cpu); /* Adjust by relative CPU power of the group */ sgs-avg_load = (sgs-group_load*SCHED_POWER_SCALE) / group-sgp-power; @@ -4613,7 +4589,7 @@ static bool update_sd_pick_busiest(struct lb_env *env, * @sds: variable to hold the statistics for this sched_domain. */ static inline void update_sd_lb_stats(struct lb_env *env, - int *balance, struct sd_lb_stats *sds) + struct sd_lb_stats *sds) { struct sched_domain *child = env-sd-child; struct sched_group *sg = env-sd-groups; @@ -4630,10 +4606,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, local_group = cpumask_test_cpu(env-dst_cpu, sched_group_cpus(sg)); memset(sgs, 0, sizeof(sgs)); - update_sg_lb_stats(env, sg
[PATCH v2 mmotm 2/3] mm: move pgtable related functions to right place
pgtable related functions are mostly in pgtable-generic.c. So move remaining functions from memory.c to pgtable-generic.c. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/memory.c b/mm/memory.c index f2ab2a8..8fd4d42 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -374,30 +374,6 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table) #endif /* CONFIG_HAVE_RCU_TABLE_FREE */ /* - * If a p?d_bad entry is found while walking page tables, report - * the error, before resetting entry to p?d_none. Usually (but - * very seldom) called out from the p?d_none_or_clear_bad macros. - */ - -void pgd_clear_bad(pgd_t *pgd) -{ - pgd_ERROR(*pgd); - pgd_clear(pgd); -} - -void pud_clear_bad(pud_t *pud) -{ - pud_ERROR(*pud); - pud_clear(pud); -} - -void pmd_clear_bad(pmd_t *pmd) -{ - pmd_ERROR(*pmd); - pmd_clear(pmd); -} - -/* * Note: this doesn't free the actual pages themselves. That * has been handled earlier when unmapping all the memory regions. */ diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index e1a6e4f..3929a40 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -10,6 +10,30 @@ #include asm/tlb.h #include asm-generic/pgtable.h +/* + * If a p?d_bad entry is found while walking page tables, report + * the error, before resetting entry to p?d_none. Usually (but + * very seldom) called out from the p?d_none_or_clear_bad macros. + */ + +void pgd_clear_bad(pgd_t *pgd) +{ + pgd_ERROR(*pgd); + pgd_clear(pgd); +} + +void pud_clear_bad(pud_t *pud) +{ + pud_ERROR(*pud); + pud_clear(pud); +} + +void pmd_clear_bad(pmd_t *pmd) +{ + pmd_ERROR(*pmd); + pmd_clear(pmd); +} + #ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS /* * Only sets the access flags (dirty, accessed), as well as write -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 1/3] sched: remove one division operation in find_buiest_queue()
Remove one division operation in find_buiest_queue(). Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9565645..52898dc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4968,7 +4968,7 @@ static struct rq *find_busiest_queue(struct lb_env *env, struct sched_group *group) { struct rq *busiest = NULL, *rq; - unsigned long max_load = 0; + unsigned long busiest_load = 0, busiest_power = SCHED_POWER_SCALE; int i; for_each_cpu(i, sched_group_cpus(group)) { @@ -4999,10 +4999,9 @@ static struct rq *find_busiest_queue(struct lb_env *env, * the load can be moved away from the cpu that is potentially * running at a lower capacity. */ - wl = (wl * SCHED_POWER_SCALE) / power; - - if (wl max_load) { - max_load = wl; + if (wl * busiest_power busiest_load * power) { + busiest_load = wl; + busiest_power = power; busiest = rq; } } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] mm, page_alloc: optimize batch count in free_pcppages_bulk()
If we use a division operation, we can compute a batch count more closed to ideal value. With this value, we can finish our job within MIGRATE_PCPTYPES iteration. In addition, batching to free more pages may be helpful to cache usage. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 26ab229..7f145cc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -634,53 +634,76 @@ static inline int free_pages_check(struct page *page) static void free_pcppages_bulk(struct zone *zone, int count, struct per_cpu_pages *pcp) { - int migratetype = 0; - int batch_free = 0; - int to_free = count; + struct list_head *list; + int batch_free; + int mt; + int nr_list; + bool all = false; + + if (pcp-count == count) + all = true; spin_lock(zone-lock); zone-all_unreclaimable = 0; zone-pages_scanned = 0; - while (to_free) { - struct page *page; - struct list_head *list; +redo: + /* Count non-empty list */ + nr_list = 0; + for (mt = 0; mt MIGRATE_PCPTYPES; mt++) { + list = pcp-lists[mt]; + if (!list_empty(list)) + nr_list++; + } - /* -* Remove pages from lists in a round-robin fashion. A -* batch_free count is maintained that is incremented when an -* empty list is encountered. This is so more pages are freed -* off fuller lists instead of spinning excessively around empty -* lists -*/ - do { - batch_free++; - if (++migratetype == MIGRATE_PCPTYPES) - migratetype = 0; - list = pcp-lists[migratetype]; - } while (list_empty(list)); + /* +* If there is only one non-empty list, free them all. +* Otherwise, remove pages from lists in a round-robin fashion. +* batch_free is set to remove at least one list. +*/ + if (all || nr_list == 1) + batch_free = count; + else if (count = nr_list) + batch_free = 1; + else + batch_free = count / nr_list; - /* This is the only non-empty list. Free them all. */ - if (batch_free == MIGRATE_PCPTYPES) - batch_free = to_free; + for (mt = 0; mt MIGRATE_PCPTYPES; mt++) { + struct page *page; + int i, page_mt; - do { - int mt; /* migratetype of the to-be-freed page */ + list = pcp-lists[mt]; + for (i = 0; i batch_free; i++) { + if (list_empty(list)) + break; + + count--; page = list_entry(list-prev, struct page, lru); + /* must delete as __free_one_page list manipulates */ list_del(page-lru); - mt = get_freepage_migratetype(page); + page_mt = get_freepage_migratetype(page); /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */ - __free_one_page(page, zone, 0, mt); - trace_mm_page_pcpu_drain(page, 0, mt); - if (likely(!is_migrate_isolate_page(page))) { - __mod_zone_page_state(zone, NR_FREE_PAGES, 1); - if (is_migrate_cma(mt)) - __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1); - } - } while (--to_free --batch_free !list_empty(list)); + __free_one_page(page, zone, 0, page_mt); + trace_mm_page_pcpu_drain(page, 0, page_mt); + + if (unlikely(is_migrate_isolate_page(page))) + continue; + + __mod_zone_page_state(zone, NR_FREE_PAGES, 1); + if (is_migrate_cma(page_mt)) + __mod_zone_page_state(zone, + NR_FREE_CMA_PAGES, 1); + } + + if (!count) + break; } + + if (count) + goto redo; + spin_unlock(zone-lock); } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/4] mm, page_alloc: optimize batch count in free_pcppages_bulk()
If we use a division operation, we can compute a batch count more closed to ideal value. With this value, we can finish our job within MIGRATE_PCPTYPES iteration. In addition, batching to free more pages may be helpful to cache usage. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 26ab229..7f145cc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -634,53 +634,76 @@ static inline int free_pages_check(struct page *page) static void free_pcppages_bulk(struct zone *zone, int count, struct per_cpu_pages *pcp) { - int migratetype = 0; - int batch_free = 0; - int to_free = count; + struct list_head *list; + int batch_free; + int mt; + int nr_list; + bool all = false; + + if (pcp-count == count) + all = true; spin_lock(zone-lock); zone-all_unreclaimable = 0; zone-pages_scanned = 0; - while (to_free) { - struct page *page; - struct list_head *list; +redo: + /* Count non-empty list */ + nr_list = 0; + for (mt = 0; mt MIGRATE_PCPTYPES; mt++) { + list = pcp-lists[mt]; + if (!list_empty(list)) + nr_list++; + } - /* -* Remove pages from lists in a round-robin fashion. A -* batch_free count is maintained that is incremented when an -* empty list is encountered. This is so more pages are freed -* off fuller lists instead of spinning excessively around empty -* lists -*/ - do { - batch_free++; - if (++migratetype == MIGRATE_PCPTYPES) - migratetype = 0; - list = pcp-lists[migratetype]; - } while (list_empty(list)); + /* +* If there is only one non-empty list, free them all. +* Otherwise, remove pages from lists in a round-robin fashion. +* batch_free is set to remove at least one list. +*/ + if (all || nr_list == 1) + batch_free = count; + else if (count = nr_list) + batch_free = 1; + else + batch_free = count / nr_list; - /* This is the only non-empty list. Free them all. */ - if (batch_free == MIGRATE_PCPTYPES) - batch_free = to_free; + for (mt = 0; mt MIGRATE_PCPTYPES; mt++) { + struct page *page; + int i, page_mt; - do { - int mt; /* migratetype of the to-be-freed page */ + list = pcp-lists[mt]; + for (i = 0; i batch_free; i++) { + if (list_empty(list)) + break; + + count--; page = list_entry(list-prev, struct page, lru); + /* must delete as __free_one_page list manipulates */ list_del(page-lru); - mt = get_freepage_migratetype(page); + page_mt = get_freepage_migratetype(page); /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */ - __free_one_page(page, zone, 0, mt); - trace_mm_page_pcpu_drain(page, 0, mt); - if (likely(!is_migrate_isolate_page(page))) { - __mod_zone_page_state(zone, NR_FREE_PAGES, 1); - if (is_migrate_cma(mt)) - __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1); - } - } while (--to_free --batch_free !list_empty(list)); + __free_one_page(page, zone, 0, page_mt); + trace_mm_page_pcpu_drain(page, 0, page_mt); + + if (unlikely(is_migrate_isolate_page(page))) + continue; + + __mod_zone_page_state(zone, NR_FREE_PAGES, 1); + if (is_migrate_cma(page_mt)) + __mod_zone_page_state(zone, + NR_FREE_CMA_PAGES, 1); + } + + if (!count) + break; } + + if (count) + goto redo; + spin_unlock(zone-lock); } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/4] mm, rmap: minimize lock hold when unlink_anon_vmas
Currently, we free the avc objects with holding a lock. To minimize lock hold time, we just move the avc objects to another list with holding a lock. Then, iterate them and free objects without holding a lock. This makes lock hold time minimized. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/rmap.c b/mm/rmap.c index 1603f64..9cfb282 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -330,6 +330,7 @@ void unlink_anon_vmas(struct vm_area_struct *vma) { struct anon_vma_chain *avc, *next; struct anon_vma *root = NULL; + LIST_HEAD(avc_list); /* * Unlink each anon_vma chained to the VMA. This list is ordered @@ -348,10 +349,14 @@ void unlink_anon_vmas(struct vm_area_struct *vma) if (RB_EMPTY_ROOT(anon_vma-rb_root)) continue; + list_move(avc-same_vma, avc_list); + } + unlock_anon_vma_root(root); + + list_for_each_entry_safe(avc, next, avc_list, same_vma) { list_del(avc-same_vma); anon_vma_chain_free(avc); } - unlock_anon_vma_root(root); /* * Iterate the list once more, it now only contains empty and unlinked @@ -363,7 +368,6 @@ void unlink_anon_vmas(struct vm_area_struct *vma) put_anon_vma(anon_vma); - list_del(avc-same_vma); anon_vma_chain_free(avc); } } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/4] mm, rmap: do easy-job first in anon_vma_fork
If we fail due to some errorous situation, it is better to quit without doing heavy work. So changing order of execution. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/rmap.c b/mm/rmap.c index a149e3a..c2f51cb 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -278,19 +278,19 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) if (!pvma-anon_vma) return 0; + /* First, allocate required objects */ + avc = anon_vma_chain_alloc(GFP_KERNEL); + if (!avc) + goto out_error; + anon_vma = anon_vma_alloc(); + if (!anon_vma) + goto out_error_free_avc; + /* -* First, attach the new VMA to the parent VMA's anon_vmas, +* Then attach the new VMA to the parent VMA's anon_vmas, * so rmap can find non-COWed pages in child processes. */ if (anon_vma_clone(vma, pvma)) - return -ENOMEM; - - /* Then add our own anon_vma. */ - anon_vma = anon_vma_alloc(); - if (!anon_vma) - goto out_error; - avc = anon_vma_chain_alloc(GFP_KERNEL); - if (!avc) goto out_error_free_anon_vma; /* @@ -312,10 +312,11 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) return 0; - out_error_free_anon_vma: +out_error_free_anon_vma: put_anon_vma(anon_vma); - out_error: - unlink_anon_vmas(vma); +out_error_free_avc: + anon_vma_chain_free(avc); +out_error: return -ENOMEM; } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/4] mm, rmap: allocate anon_vma_chain before starting to link anon_vma_chain
If we allocate anon_vma_chain before starting to link, we can reduce the lock hold time. This patch implement it. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/rmap.c b/mm/rmap.c index c2f51cb..1603f64 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -240,18 +240,21 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) { struct anon_vma_chain *avc, *pavc; struct anon_vma *root = NULL; + LIST_HEAD(avc_list); + + list_for_each_entry(pavc, src-anon_vma_chain, same_vma) { + avc = anon_vma_chain_alloc(GFP_KERNEL); + if (unlikely(!avc)) + goto enomem_failure; + + list_add_tail(avc-same_vma, avc_list); + } list_for_each_entry_reverse(pavc, src-anon_vma_chain, same_vma) { struct anon_vma *anon_vma; - avc = anon_vma_chain_alloc(GFP_NOWAIT | __GFP_NOWARN); - if (unlikely(!avc)) { - unlock_anon_vma_root(root); - root = NULL; - avc = anon_vma_chain_alloc(GFP_KERNEL); - if (!avc) - goto enomem_failure; - } + avc = list_entry((avc_list)-next, typeof(*avc), same_vma); + list_del(avc-same_vma); anon_vma = pavc-anon_vma; root = lock_anon_vma_root(root, anon_vma); anon_vma_chain_link(dst, avc, anon_vma); @@ -259,8 +262,11 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) unlock_anon_vma_root(root); return 0; - enomem_failure: - unlink_anon_vmas(dst); +enomem_failure: + list_for_each_entry_safe(avc, pavc, avc_list, same_vma) { + list_del(avc-same_vma); + anon_vma_chain_free(avc); + } return -ENOMEM; } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] mm, page_alloc: optimize batch count in free_pcppages_bulk()
On Tue, Aug 06, 2013 at 05:43:40PM +0900, Joonsoo Kim wrote: If we use a division operation, we can compute a batch count more closed to ideal value. With this value, we can finish our job within MIGRATE_PCPTYPES iteration. In addition, batching to free more pages may be helpful to cache usage. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com Oops... Sorry. Please ignore this one. This patch is already submitted few seconds ago :) Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] mm, rmap: do easy-job first in anon_vma_fork
Hello, Johannes. On Tue, Aug 06, 2013 at 08:58:54AM -0400, Johannes Weiner wrote: if (anon_vma_clone(vma, pvma)) - return -ENOMEM; - - /* Then add our own anon_vma. */ - anon_vma = anon_vma_alloc(); - if (!anon_vma) - goto out_error; - avc = anon_vma_chain_alloc(GFP_KERNEL); - if (!avc) goto out_error_free_anon_vma; Which heavy work? anon_vma_clone() is anon_vma_chain_alloc() in a loop. Optimizing error paths only makes sense if they are common and you actually could save something by reordering. This matches neither. Yes, you are right. I drop this one. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/4] mm, rmap: allocate anon_vma_chain before starting to link anon_vma_chain
On Wed, Aug 07, 2013 at 02:08:03AM -0400, Johannes Weiner wrote: list_for_each_entry_reverse(pavc, src-anon_vma_chain, same_vma) { struct anon_vma *anon_vma; - avc = anon_vma_chain_alloc(GFP_NOWAIT | __GFP_NOWARN); - if (unlikely(!avc)) { - unlock_anon_vma_root(root); - root = NULL; - avc = anon_vma_chain_alloc(GFP_KERNEL); - if (!avc) - goto enomem_failure; - } + avc = list_entry((avc_list)-next, typeof(*avc), same_vma); list_first_entry() please Okay. I will send next version soon. + list_del(avc-same_vma); anon_vma = pavc-anon_vma; root = lock_anon_vma_root(root, anon_vma); anon_vma_chain_link(dst, avc, anon_vma); @@ -259,8 +262,11 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) unlock_anon_vma_root(root); return 0; - enomem_failure: - unlink_anon_vmas(dst); +enomem_failure: + list_for_each_entry_safe(avc, pavc, avc_list, same_vma) { + list_del(avc-same_vma); + anon_vma_chain_free(avc); + } return -ENOMEM; } Otherwise, looks good. Thank you! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] mm, rmap: minimize lock hold when unlink_anon_vmas
On Wed, Aug 07, 2013 at 02:11:38AM -0400, Johannes Weiner wrote: On Tue, Aug 06, 2013 at 05:43:39PM +0900, Joonsoo Kim wrote: Currently, we free the avc objects with holding a lock. To minimize lock hold time, we just move the avc objects to another list with holding a lock. Then, iterate them and free objects without holding a lock. This makes lock hold time minimized. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com I expect kfree() to be fast enough that we don't think we need to bother batching this outside the lock. Okay! Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 17/18] mm, hugetlb: retry if we fail to allocate a hugepage with use_reserve
On Tue, Aug 06, 2013 at 06:38:49PM -0700, Davidlohr Bueso wrote: On Wed, 2013-08-07 at 11:03 +1000, David Gibson wrote: On Tue, Aug 06, 2013 at 05:18:44PM -0700, Davidlohr Bueso wrote: On Mon, 2013-08-05 at 16:36 +0900, Joonsoo Kim wrote: Any mapping that doesn't use the reserved pool, not just MAP_NORESERVE. For example, if a process makes a MAP_PRIVATE mapping, then fork()s then the mapping is instantiated in the child, that will not draw from the reserved pool. Should we ensure them to allocate the last hugepage? They map a region with MAP_NORESERVE, so don't assume that their requests always succeed. If the pages are available, people get cranky if it fails for no apparent reason, MAP_NORESERVE or not. They get especially cranky if it sometimes fails and sometimes doesn't due to a race condition. Hello, Hmm... Okay. I will try to implement another way to protect race condition. Maybe it is the best to use a table mutex :) Anyway, please give me a time, guys. So another option is to take the mutex table patchset for now as it *does* improve things a great deal, then, when ready, get rid of the instantiation lock all together. We still don't have a solid proposal for doing that. Joonsoo Kim's patchset misses cases (non reserved mappings). I'm also not certain there aren't a few edge cases which can lead to even reserved mappings failing, and if that happens the patches will lead to livelock. Exactly, which is why I suggest minimizing the lock contention until we do have such a proposal. Okay. my proposal is not complete and maybe much time is needed. And I'm not sure that my *retry* approach can eventually cover all the race situations, currently. If you have to hurry, I don't have strong objection to your patches, but, IMHO, we should go slow, because it is not just trivial change. Hugetlb code is too subtle, so it is hard to confirm it's solidness. Following is the race problem what I found with those patches. I assume that nr_free_hugepage is 2. 1. parent process map an 1 hugepage sizeid region with MAP_PRIVATE 2. parent process write something to this region, so fault occur. 3. fault handling. 4. fork 5. parent process write something to this hugepage, so cow-fault occur. 6. while parent allocate a new page and do copy_user_huge_page() in fault handler, child process write something to this hugepage, so cow-fault occur. This access is not protected by table mutex, because mm is different. 7. child process die, because there is no free hugepage. If we have no race, child process would not die, because all we needed is only 2 hugepages, one for parent, and the other for child. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] hugepage: optimize page fault path locking
On Tue, Aug 06, 2013 at 05:08:04PM -0700, Davidlohr Bueso wrote: On Mon, 2013-07-29 at 15:18 +0900, Joonsoo Kim wrote: On Fri, Jul 26, 2013 at 07:27:23AM -0700, Davidlohr Bueso wrote: This patchset attempts to reduce the amount of contention we impose on the hugetlb_instantiation_mutex by replacing the global mutex with a table of mutexes, selected based on a hash. The original discussion can be found here: http://lkml.org/lkml/2013/7/12/428 Hello, Davidlohr. I recently sent a patchset which remove the hugetlb_instantiation_mutex entirely ('mm, hugetlb: remove a hugetlb_instantiation_mutex'). This patchset can be found here: https://lkml.org/lkml/2013/7/29/54 If possible, could you review it and test it whether your problem is disappered with it or not? This patchset applies on top of https://lkml.org/lkml/2013/7/22/96 [PATCH v2 00/10] mm, hugetlb: clean-up and possible bug fix, right? AFAIK those changes are the ones Andrew picked up a few weeks ago and are now in linux-next, right? I was able to apply those just fine, but couldn't apply your 'remove a hugetlb_instantiation_mutex series' (IIRC pach 1/18 failed). I guess you'll send out a v2 anyway so I'll wait until then. In any case I'm not seeing an actual performance issue with the hugetlb_instantiation_mutex, all I noticed was that under large DB workloads that make use of hugepages, such as Oracle, this lock becomes quite hot during the first few minutes of startup, which makes sense in the fault path it is contended. So I'll try out your patches, but, in this particular case, I just cannot compare with the lock vs without the lock situations. Okay. I just want to know that lock contention is reduced by my patches in the first few minutes of startup. I will send v2 soon. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm, page_alloc: optimize batch count in free_pcppages_bulk()
Hello, Andrew. 2013/8/7 Andrew Morton a...@linux-foundation.org: On Tue, 6 Aug 2013 17:40:40 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote: If we use a division operation, we can compute a batch count more closed to ideal value. With this value, we can finish our job within MIGRATE_PCPTYPES iteration. In addition, batching to free more pages may be helpful to cache usage. hm, maybe. The .text got 120 bytes larger so the code now will eject two of someone else's cachelines, which can't be good. I need more convincing, please ;) (bss got larger too - I don't have a clue why this happens). In my testing, it makes .text just 64 byes larger. I think that I cannot avoid such few increasing size. Current round-robin freeing algorithm access 'struct page' at random order, because it change it's migrate type and list on every iteration and a page on different list may be far from each other. If we do more batch free, we have more probability to access adjacent 'struct page' than before, so I think that this is cache-friendly. But this is just theoretical argument, so I'm not sure whether it is useful or not :) Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 8/9] mm, hugetlb: remove decrement_hugepage_resv_vma()
Now, Checking condition of decrement_hugepage_resv_vma() and vma_has_reserves() is same, so we can clean-up this function with vma_has_reserves(). Additionally, decrement_hugepage_resv_vma() has only one call site, so we can remove function and embed it into dequeue_huge_page_vma() directly. This patch implement it. Reviewed-by: Wanpeng Li liw...@linux.vnet.ibm.com Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ca15854..4b1b043 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -434,25 +434,6 @@ static int is_vma_resv_set(struct vm_area_struct *vma, unsigned long flag) return (get_vma_private_data(vma) flag) != 0; } -/* Decrement the reserved pages in the hugepage pool by one */ -static void decrement_hugepage_resv_vma(struct hstate *h, - struct vm_area_struct *vma) -{ - if (vma-vm_flags VM_NORESERVE) - return; - - if (vma-vm_flags VM_MAYSHARE) { - /* Shared mappings always use reserves */ - h-resv_huge_pages--; - } else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) { - /* -* Only the process that called mmap() has reserves for -* private mappings. -*/ - h-resv_huge_pages--; - } -} - /* Reset counters to 0 and clear all HPAGE_RESV_* flags */ void reset_vma_resv_huge_pages(struct vm_area_struct *vma) { @@ -466,10 +447,18 @@ static int vma_has_reserves(struct vm_area_struct *vma) { if (vma-vm_flags VM_NORESERVE) return 0; + + /* Shared mappings always use reserves */ if (vma-vm_flags VM_MAYSHARE) return 1; + + /* +* Only the process that called mmap() has reserves for +* private mappings. +*/ if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) return 1; + return 0; } @@ -564,8 +553,8 @@ retry_cpuset: if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) { page = dequeue_huge_page_node(h, zone_to_nid(zone)); if (page) { - if (!avoid_reserve) - decrement_hugepage_resv_vma(h, vma); + if (!avoid_reserve vma_has_reserves(vma)) + h-resv_huge_pages--; break; } } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 1/9] mm, hugetlb: move up the code which check availability of free huge page
In this time we are holding a hugetlb_lock, so hstate values can't be changed. If we don't have any usable free huge page in this time, we don't need to proceede the processing. So move this code up. Acked-by: Michal Hocko mho...@suse.cz Reviewed-by: Wanpeng Li liw...@linux.vnet.ibm.com Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/hugetlb.c b/mm/hugetlb.c index e2bfbf7..fc4988c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -539,10 +539,6 @@ static struct page *dequeue_huge_page_vma(struct hstate *h, struct zoneref *z; unsigned int cpuset_mems_cookie; -retry_cpuset: - cpuset_mems_cookie = get_mems_allowed(); - zonelist = huge_zonelist(vma, address, - htlb_alloc_mask, mpol, nodemask); /* * A child process with MAP_PRIVATE mappings created by their parent * have no page reserves. This check ensures that reservations are @@ -556,6 +552,11 @@ retry_cpuset: if (avoid_reserve h-free_huge_pages - h-resv_huge_pages == 0) goto err; +retry_cpuset: + cpuset_mems_cookie = get_mems_allowed(); + zonelist = huge_zonelist(vma, address, + htlb_alloc_mask, mpol, nodemask); + for_each_zone_zonelist_nodemask(zone, z, zonelist, MAX_NR_ZONES - 1, nodemask) { if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) { @@ -574,7 +575,6 @@ retry_cpuset: return page; err: - mpol_cond_put(mpol); return NULL; } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 9/9] mm, hugetlb: decrement reserve count if VM_NORESERVE alloc page cache
If a vma with VM_NORESERVE allocate a new page for page cache, we should check whether this area is reserved or not. If this address is already reserved by other process(in case of chg == 0), we should decrement reserve count, because this allocated page will go into page cache and currently, there is no way to know that this page comes from reserved pool or not when releasing inode. This may introduce over-counting problem to reserved count. With following example code, you can easily reproduce this situation. Assume 2MB, nr_hugepages = 100 size = 20 * MB; flag = MAP_SHARED; p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0); if (p == MAP_FAILED) { fprintf(stderr, mmap() failed: %s\n, strerror(errno)); return -1; } flag = MAP_SHARED | MAP_NORESERVE; q = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0); if (q == MAP_FAILED) { fprintf(stderr, mmap() failed: %s\n, strerror(errno)); } q[0] = 'c'; After finish the program, run 'cat /proc/meminfo'. You can see below result. HugePages_Free: 100 HugePages_Rsvd:1 To fix this, we should check our mapping type and tracked region. If our mapping is VM_NORESERVE, VM_MAYSHARE and chg is 0, this imply that current allocated page will go into page cache which is already reserved region when mapping is created. In this case, we should decrease reserve count. As implementing above, this patch solve the problem. Reviewed-by: Wanpeng Li liw...@linux.vnet.ibm.com Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 4b1b043..b3b8252 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -443,10 +443,23 @@ void reset_vma_resv_huge_pages(struct vm_area_struct *vma) } /* Returns true if the VMA has associated reserve pages */ -static int vma_has_reserves(struct vm_area_struct *vma) +static int vma_has_reserves(struct vm_area_struct *vma, long chg) { - if (vma-vm_flags VM_NORESERVE) - return 0; + if (vma-vm_flags VM_NORESERVE) { + /* +* This address is already reserved by other process(chg == 0), +* so, we should decreament reserved count. Without +* decreamenting, reserve count is remained after releasing +* inode, because this allocated page will go into page cache +* and is regarded as coming from reserved pool in releasing +* step. Currently, we don't have any other solution to deal +* with this situation properly, so add work-around here. +*/ + if (vma-vm_flags VM_MAYSHARE chg == 0) + return 1; + else + return 0; + } /* Shared mappings always use reserves */ if (vma-vm_flags VM_MAYSHARE) @@ -520,7 +533,8 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid) static struct page *dequeue_huge_page_vma(struct hstate *h, struct vm_area_struct *vma, - unsigned long address, int avoid_reserve) + unsigned long address, int avoid_reserve, + long chg) { struct page *page = NULL; struct mempolicy *mpol; @@ -535,7 +549,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h, * have no page reserves. This check ensures that reservations are * not stolen. The child may still get SIGKILLed */ - if (!vma_has_reserves(vma) + if (!vma_has_reserves(vma, chg) h-free_huge_pages - h-resv_huge_pages == 0) goto err; @@ -553,8 +567,12 @@ retry_cpuset: if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) { page = dequeue_huge_page_node(h, zone_to_nid(zone)); if (page) { - if (!avoid_reserve vma_has_reserves(vma)) - h-resv_huge_pages--; + if (avoid_reserve) + break; + if (!vma_has_reserves(vma, chg)) + break; + + h-resv_huge_pages--; break; } } @@ -1138,7 +1156,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma, return ERR_PTR(-ENOSPC); } spin_lock(hugetlb_lock); - page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve); + page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg); if (!page) { spin_unlock(hugetlb_lock); page = alloc_buddy_huge_page(h
[PATCH v3 6/9] mm, hugetlb: do not use a page in page cache for cow optimization
Currently, we use a page with mapped count 1 in page cache for cow optimization. If we find this condition, we don't allocate a new page and copy contents. Instead, we map this page directly. This may introduce a problem that writting to private mapping overwrite hugetlb file directly. You can find this situation with following code. size = 20 * MB; flag = MAP_SHARED; p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0); if (p == MAP_FAILED) { fprintf(stderr, mmap() failed: %s\n, strerror(errno)); return -1; } p[0] = 's'; fprintf(stdout, BEFORE STEAL PRIVATE WRITE: %c\n, p[0]); munmap(p, size); flag = MAP_PRIVATE; p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0); if (p == MAP_FAILED) { fprintf(stderr, mmap() failed: %s\n, strerror(errno)); } p[0] = 'c'; munmap(p, size); flag = MAP_SHARED; p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0); if (p == MAP_FAILED) { fprintf(stderr, mmap() failed: %s\n, strerror(errno)); return -1; } fprintf(stdout, AFTER STEAL PRIVATE WRITE: %c\n, p[0]); munmap(p, size); We can see that AFTER STEAL PRIVATE WRITE: c, not AFTER STEAL PRIVATE WRITE: s. If we turn off this optimization to a page in page cache, the problem is disappeared. So, I change the trigger condition of optimization. If this page is not AnonPage, we don't do optimization. This makes this optimization turning off for a page cache. Acked-by: Michal Hocko mho...@suse.cz Reviewed-by: Wanpeng Li liw...@linux.vnet.ibm.com Reviewed-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 2e52afea..1f6b3a6 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2511,7 +2511,6 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, { struct hstate *h = hstate_vma(vma); struct page *old_page, *new_page; - int avoidcopy; int outside_reserve = 0; unsigned long mmun_start; /* For mmu_notifiers */ unsigned long mmun_end; /* For mmu_notifiers */ @@ -2521,10 +2520,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, retry_avoidcopy: /* If no-one else is actually using this page, avoid the copy * and just make the page writable */ - avoidcopy = (page_mapcount(old_page) == 1); - if (avoidcopy) { - if (PageAnon(old_page)) - page_move_anon_rmap(old_page, vma, address); + if (page_mapcount(old_page) == 1 PageAnon(old_page)) { + page_move_anon_rmap(old_page, vma, address); set_huge_ptep_writable(vma, address, ptep); return 0; } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 7/9] mm, hugetlb: add VM_NORESERVE check in vma_has_reserves()
If we map the region with MAP_NORESERVE and MAP_SHARED, we can skip to check reserve counting and eventually we cannot be ensured to allocate a huge page in fault time. With following example code, you can easily find this situation. Assume 2MB, nr_hugepages = 100 fd = hugetlbfs_unlinked_fd(); if (fd 0) return 1; size = 200 * MB; flag = MAP_SHARED; p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0); if (p == MAP_FAILED) { fprintf(stderr, mmap() failed: %s\n, strerror(errno)); return -1; } size = 2 * MB; flag = MAP_ANONYMOUS | MAP_SHARED | MAP_HUGETLB | MAP_NORESERVE; p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, -1, 0); if (p == MAP_FAILED) { fprintf(stderr, mmap() failed: %s\n, strerror(errno)); } p[0] = '0'; sleep(10); During executing sleep(10), run 'cat /proc/meminfo' on another process. HugePages_Free: 99 HugePages_Rsvd: 100 Number of free should be higher or equal than number of reserve, but this aren't. This represent that non reserved shared mapping steal a reserved page. Non reserved shared mapping should not eat into reserve space. If we consider VM_NORESERVE in vma_has_reserve() and return 0 which mean that we don't have reserved pages, then we check that we have enough free pages in dequeue_huge_page_vma(). This prevent to steal a reserved page. With this change, above test generate a SIGBUG which is correct, because all free pages are reserved and non reserved shared mapping can't get a free page. Reviewed-by: Wanpeng Li liw...@linux.vnet.ibm.com Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 1f6b3a6..ca15854 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -464,6 +464,8 @@ void reset_vma_resv_huge_pages(struct vm_area_struct *vma) /* Returns true if the VMA has associated reserve pages */ static int vma_has_reserves(struct vm_area_struct *vma) { + if (vma-vm_flags VM_NORESERVE) + return 0; if (vma-vm_flags VM_MAYSHARE) return 1; if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 0/9] mm, hugetlb: clean-up and possible bug fix
First 5 patches are almost trivial clean-up patches. The others are for fixing three bugs. Perhaps, these problems are minor, because this codes are used for a long time, and there is no bug reporting for these problems. These patches are based on v3.10.0 and passed the libhugetlbfs test suite. Changes from v2. There is no code change in all patches from v2. Omit patch 2('clean-up alloc_huge_page()') which try to remove one goto label. Add more comments to changelog as Michal's opinion. Add reviewed-by or acked-by. Changes from v1. Split patch 1 into two patches to clear it's purpose. Remove useless indentation changes in 'clean-up alloc_huge_page()' Fix new iteration code bug. Add reviewed-by or acked-by. Joonsoo Kim (9): mm, hugetlb: move up the code which check availability of free huge page mm, hugetlb: trivial commenting fix mm, hugetlb: clean-up alloc_huge_page() mm, hugetlb: fix and clean-up node iteration code to alloc or free mm, hugetlb: remove redundant list_empty check in gather_surplus_pages() mm, hugetlb: do not use a page in page cache for cow optimization mm, hugetlb: add VM_NORESERVE check in vma_has_reserves() mm, hugetlb: remove decrement_hugepage_resv_vma() mm, hugetlb: decrement reserve count if VM_NORESERVE alloc page cache mm/hugetlb.c | 243 ++ 1 file changed, 110 insertions(+), 133 deletions(-) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 4/9] mm, hugetlb: fix and clean-up node iteration code to alloc or free
Current node iteration code have a minor problem which do one more node rotation if we can't succeed to allocate. For example, if we start to allocate at node 0, we stop to iterate at node 0. Then we start to allocate at node 1 for next allocation. I introduce new macros for_each_node_mask_to_[alloc|free] and fix and clean-up node iteration code to alloc or free. This makes code more understandable. Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 31d78c5..87d7637 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -755,33 +755,6 @@ static int hstate_next_node_to_alloc(struct hstate *h, return nid; } -static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed) -{ - struct page *page; - int start_nid; - int next_nid; - int ret = 0; - - start_nid = hstate_next_node_to_alloc(h, nodes_allowed); - next_nid = start_nid; - - do { - page = alloc_fresh_huge_page_node(h, next_nid); - if (page) { - ret = 1; - break; - } - next_nid = hstate_next_node_to_alloc(h, nodes_allowed); - } while (next_nid != start_nid); - - if (ret) - count_vm_event(HTLB_BUDDY_PGALLOC); - else - count_vm_event(HTLB_BUDDY_PGALLOC_FAIL); - - return ret; -} - /* * helper for free_pool_huge_page() - return the previously saved * node [this node] from which to free a huge page. Advance the @@ -800,6 +773,40 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed) return nid; } +#define for_each_node_mask_to_alloc(hs, nr_nodes, node, mask) \ + for (nr_nodes = nodes_weight(*mask);\ + nr_nodes 0 \ + ((node = hstate_next_node_to_alloc(hs, mask)) || 1);\ + nr_nodes--) + +#define for_each_node_mask_to_free(hs, nr_nodes, node, mask) \ + for (nr_nodes = nodes_weight(*mask);\ + nr_nodes 0 \ + ((node = hstate_next_node_to_free(hs, mask)) || 1); \ + nr_nodes--) + +static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed) +{ + struct page *page; + int nr_nodes, node; + int ret = 0; + + for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) { + page = alloc_fresh_huge_page_node(h, node); + if (page) { + ret = 1; + break; + } + } + + if (ret) + count_vm_event(HTLB_BUDDY_PGALLOC); + else + count_vm_event(HTLB_BUDDY_PGALLOC_FAIL); + + return ret; +} + /* * Free huge page from pool from next node to free. * Attempt to keep persistent huge pages more or less @@ -809,36 +816,31 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed) static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed, bool acct_surplus) { - int start_nid; - int next_nid; + int nr_nodes, node; int ret = 0; - start_nid = hstate_next_node_to_free(h, nodes_allowed); - next_nid = start_nid; - - do { + for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) { /* * If we're returning unused surplus pages, only examine * nodes with surplus pages. */ - if ((!acct_surplus || h-surplus_huge_pages_node[next_nid]) - !list_empty(h-hugepage_freelists[next_nid])) { + if ((!acct_surplus || h-surplus_huge_pages_node[node]) + !list_empty(h-hugepage_freelists[node])) { struct page *page = - list_entry(h-hugepage_freelists[next_nid].next, + list_entry(h-hugepage_freelists[node].next, struct page, lru); list_del(page-lru); h-free_huge_pages--; - h-free_huge_pages_node[next_nid]--; + h-free_huge_pages_node[node]--; if (acct_surplus) { h-surplus_huge_pages--; - h-surplus_huge_pages_node[next_nid]--; + h-surplus_huge_pages_node[node]--; } update_and_free_page(h, page); ret = 1; break; } - next_nid = hstate_next_node_to_free(h, nodes_allowed); - } while
[PATCH 05/18] mm, hugetlb: protect region tracking via newly introduced resv_map lock
There is a race condition if we map a same file on different processes. Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex. When we do mmap, we don't grab a hugetlb_instantiation_mutex, but, grab a mmap_sem. This doesn't prevent other process to modify region structure, so it can be modified by two processes concurrently. To solve this, I introduce a lock to resv_map and make region manipulation function grab a lock before they do actual work. This makes region tracking safe. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 2677c07..e29e28f 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -26,6 +26,7 @@ struct hugepage_subpool { struct resv_map { struct kref refs; + spinlock_t lock; struct list_head regions; }; extern struct resv_map *resv_map_alloc(void); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 24c0111..bf2ee11 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -134,15 +134,8 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma) * Region tracking -- allows tracking of reservations and instantiated pages *across the pages in a mapping. * - * The region data structures are protected by a combination of the mmap_sem - * and the hugetlb_instantiation_mutex. To access or modify a region the caller - * must either hold the mmap_sem for write, or the mmap_sem for read and - * the hugetlb_instantiation_mutex: - * - * down_write(mm-mmap_sem); - * or - * down_read(mm-mmap_sem); - * mutex_lock(hugetlb_instantiation_mutex); + * The region data structures are embedded into a resv_map and + * protected by a resv_map's lock */ struct file_region { struct list_head link; @@ -155,6 +148,7 @@ static long region_add(struct resv_map *resv, long f, long t) struct list_head *head = resv-regions; struct file_region *rg, *nrg, *trg; + spin_lock(resv-lock); /* Locate the region we are either in or before. */ list_for_each_entry(rg, head, link) if (f = rg-to) @@ -184,6 +178,7 @@ static long region_add(struct resv_map *resv, long f, long t) } nrg-from = f; nrg-to = t; + spin_unlock(resv-lock); return 0; } @@ -193,6 +188,7 @@ static long region_chg(struct resv_map *resv, long f, long t) struct file_region *rg, *nrg; long chg = 0; + spin_lock(resv-lock); /* Locate the region we are before or in. */ list_for_each_entry(rg, head, link) if (f = rg-to) @@ -203,14 +199,18 @@ static long region_chg(struct resv_map *resv, long f, long t) * size such that we can guarantee to record the reservation. */ if (rg-link == head || t rg-from) { nrg = kmalloc(sizeof(*nrg), GFP_KERNEL); - if (!nrg) - return -ENOMEM; + if (!nrg) { + chg = -ENOMEM; + goto out; + } + nrg-from = f; nrg-to = f; INIT_LIST_HEAD(nrg-link); list_add(nrg-link, rg-link.prev); - return t - f; + chg = t - f; + goto out; } /* Round our left edge to the current segment if it encloses us. */ @@ -223,7 +223,7 @@ static long region_chg(struct resv_map *resv, long f, long t) if (rg-link == head) break; if (rg-from t) - return chg; + goto out; /* We overlap with this area, if it extends further than * us then we must extend ourselves. Account for its @@ -234,6 +234,9 @@ static long region_chg(struct resv_map *resv, long f, long t) } chg -= rg-to - rg-from; } + +out: + spin_unlock(resv-lock); return chg; } @@ -243,12 +246,13 @@ static long region_truncate(struct resv_map *resv, long end) struct file_region *rg, *trg; long chg = 0; + spin_lock(resv-lock); /* Locate the region we are either in or before. */ list_for_each_entry(rg, head, link) if (end = rg-to) break; if (rg-link == head) - return 0; + goto out; /* If we are in the middle of a region then adjust it. */ if (end rg-from) { @@ -265,6 +269,9 @@ static long region_truncate(struct resv_map *resv, long end) list_del(rg-link); kfree(rg); } + +out: + spin_unlock(resv-lock); return chg; } @@ -274,6 +281,7 @@ static long region_count(struct resv_map *resv, long f, long t) struct file_region *rg; long chg = 0; + spin_lock(resv-lock); /* Locate each segment we overlap
[PATCH 02/18] mm, hugetlb: change variable name reservations to resv
'reservations' is so long name as a variable and we use 'resv_map' to represent 'struct resv_map' in other place. To reduce confusion and unreadability, change it. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/hugetlb.c b/mm/hugetlb.c index d971233..12b6581 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1095,9 +1095,9 @@ static long vma_needs_reservation(struct hstate *h, } else { long err; pgoff_t idx = vma_hugecache_offset(h, vma, addr); - struct resv_map *reservations = vma_resv_map(vma); + struct resv_map *resv = vma_resv_map(vma); - err = region_chg(reservations-regions, idx, idx + 1); + err = region_chg(resv-regions, idx, idx + 1); if (err 0) return err; return 0; @@ -1115,10 +1115,10 @@ static void vma_commit_reservation(struct hstate *h, } else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) { pgoff_t idx = vma_hugecache_offset(h, vma, addr); - struct resv_map *reservations = vma_resv_map(vma); + struct resv_map *resv = vma_resv_map(vma); /* Mark this page used in the map. */ - region_add(reservations-regions, idx, idx + 1); + region_add(resv-regions, idx, idx + 1); } } @@ -2168,7 +2168,7 @@ out: static void hugetlb_vm_op_open(struct vm_area_struct *vma) { - struct resv_map *reservations = vma_resv_map(vma); + struct resv_map *resv = vma_resv_map(vma); /* * This new VMA should share its siblings reservation map if present. @@ -2178,34 +2178,34 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma) * after this open call completes. It is therefore safe to take a * new reference here without additional locking. */ - if (reservations) - kref_get(reservations-refs); + if (resv) + kref_get(resv-refs); } static void resv_map_put(struct vm_area_struct *vma) { - struct resv_map *reservations = vma_resv_map(vma); + struct resv_map *resv = vma_resv_map(vma); - if (!reservations) + if (!resv) return; - kref_put(reservations-refs, resv_map_release); + kref_put(resv-refs, resv_map_release); } static void hugetlb_vm_op_close(struct vm_area_struct *vma) { struct hstate *h = hstate_vma(vma); - struct resv_map *reservations = vma_resv_map(vma); + struct resv_map *resv = vma_resv_map(vma); struct hugepage_subpool *spool = subpool_vma(vma); unsigned long reserve; unsigned long start; unsigned long end; - if (reservations) { + if (resv) { start = vma_hugecache_offset(h, vma, vma-vm_start); end = vma_hugecache_offset(h, vma, vma-vm_end); reserve = (end - start) - - region_count(reservations-regions, start, end); + region_count(resv-regions, start, end); resv_map_put(vma); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 3/9] mm, hugetlb: clean-up alloc_huge_page()
This patch unifies successful allocation paths to make the code more readable. There are no functional changes. Acked-by: Michal Hocko mho...@suse.cz Reviewed-by: Wanpeng Li liw...@linux.vnet.ibm.com Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 51564a8..31d78c5 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1149,12 +1149,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma, } spin_lock(hugetlb_lock); page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve); - if (page) { - /* update page cgroup details */ - hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), -h_cg, page); - spin_unlock(hugetlb_lock); - } else { + if (!page) { spin_unlock(hugetlb_lock); page = alloc_buddy_huge_page(h, NUMA_NO_NODE); if (!page) { @@ -1165,11 +1160,11 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma, return ERR_PTR(-ENOSPC); } spin_lock(hugetlb_lock); - hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), -h_cg, page); list_move(page-lru, h-hugepage_activelist); - spin_unlock(hugetlb_lock); + /* Fall through */ } + hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page); + spin_unlock(hugetlb_lock); set_page_private(page, (unsigned long)spool); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 5/9] mm, hugetlb: remove redundant list_empty check in gather_surplus_pages()
If list is empty, list_for_each_entry_safe() doesn't do anything. So, this check is redundant. Remove it. Acked-by: Michal Hocko mho...@suse.cz Reviewed-by: Wanpeng Li liw...@linux.vnet.ibm.com Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 87d7637..2e52afea 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1020,11 +1020,8 @@ free: spin_unlock(hugetlb_lock); /* Free unnecessary surplus pages to the buddy allocator */ - if (!list_empty(surplus_list)) { - list_for_each_entry_safe(page, tmp, surplus_list, lru) { - put_page(page); - } - } + list_for_each_entry_safe(page, tmp, surplus_list, lru) + put_page(page); spin_lock(hugetlb_lock); return ret; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 00/18] mm, hugetlb: remove a hugetlb_instantiation_mutex
Without a hugetlb_instantiation_mutex, if parallel fault occur, we can fail to allocate a hugepage, because many threads dequeue a hugepage to handle a fault of same address. This makes reserved pool shortage just for a little while and this cause faulting thread who is ensured to have enough reserved hugepages to get a SIGBUS signal. To solve this problem, we already have a nice solution, that is, a hugetlb_instantiation_mutex. This blocks other threads to dive into a fault handler. This solve the problem clearly, but it introduce performance degradation, because it serialize all fault handling. Now, I try to remove a hugetlb_instantiation_mutex to get rid of performance problem reported by Davidlohr Bueso [1]. It is implemented by following 3-steps. Step 1. Protect region tracking via per region spin_lock. Currently, region tracking is protected by a hugetlb_instantiation_mutex, so before removing it, we should replace it with another solution. Step 2. Decide whether we use reserved page pool or not by an uniform way. We need a graceful failure handling if there is no lock like as hugetlb_instantiation_mutex. To decide whether we need to handle a failure or not, we need to know current status properly. Step 3. Graceful failure handling if we failed with reserved page or failed to allocate with use_reserve. Failure handling consist of two cases. One is if we failed with having reserved page, we return back to reserved pool properly. Current code doesn't recover a reserve count properly, so we need to fix it. The other is if we failed to allocate a new huge page with use_reserve indicator, we return 0 to fault handler, instead of SIGBUS. This makes this thread retrying fault handling. With above handlings, we can succeed to handle a fault on any situation without a hugetlb_instantiation_mutex. Patch 1: Fix a minor problem Patch 2-5: Implement Step 1. Patch 6-11: Implement Step 2. Patch 12-18: Implement Step 3. These patches are based on my previous patchset [2]. [2] is based on v3.10. With applying these, I passed a libhugetlbfs test suite clearly which have allocation-instantiation race test cases. If there is a something I should consider, please let me know! Thanks. [1] http://lwn.net/Articles/558863/ [PATCH] mm/hugetlb: per-vma instantiation mutexes [2] https://lkml.org/lkml/2013/7/22/96 [PATCH v2 00/10] mm, hugetlb: clean-up and possible bug fix Joonsoo Kim (18): mm, hugetlb: protect reserved pages when softofflining requests the pages mm, hugetlb: change variable name reservations to resv mm, hugetlb: unify region structure handling mm, hugetlb: region manipulation functions take resv_map rather list_head mm, hugetlb: protect region tracking via newly introduced resv_map lock mm, hugetlb: remove vma_need_reservation() mm, hugetlb: pass has_reserve to dequeue_huge_page_vma() mm, hugetlb: do hugepage_subpool_get_pages() when avoid_reserve mm, hugetlb: unify has_reserve and avoid_reserve to use_reserve mm, hugetlb: call vma_has_reserve() before entering alloc_huge_page() mm, hugetlb: move down outside_reserve check mm, hugetlb: remove a check for return value of alloc_huge_page() mm, hugetlb: grab a page_table_lock after page_cache_release mm, hugetlb: clean-up error handling in hugetlb_cow() mm, hugetlb: move up anon_vma_prepare() mm, hugetlb: return a reserved page to a reserved pool if failed mm, hugetlb: retry if we fail to allocate a hugepage with use_reserve mm, hugetlb: remove a hugetlb_instantiation_mutex fs/hugetlbfs/inode.c| 12 +- include/linux/hugetlb.h | 10 ++ mm/hugetlb.c| 361 +-- 3 files changed, 217 insertions(+), 166 deletions(-) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 04/18] mm, hugetlb: region manipulation functions take resv_map rather list_head
To change a protection method for region tracking to find grained one, we pass the resv_map, instead of list_head, to region manipulation functions. This doesn't introduce any functional change, and it is just for preparing a next step. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 35f6b12..24c0111 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -150,8 +150,9 @@ struct file_region { long to; }; -static long region_add(struct list_head *head, long f, long t) +static long region_add(struct resv_map *resv, long f, long t) { + struct list_head *head = resv-regions; struct file_region *rg, *nrg, *trg; /* Locate the region we are either in or before. */ @@ -186,8 +187,9 @@ static long region_add(struct list_head *head, long f, long t) return 0; } -static long region_chg(struct list_head *head, long f, long t) +static long region_chg(struct resv_map *resv, long f, long t) { + struct list_head *head = resv-regions; struct file_region *rg, *nrg; long chg = 0; @@ -235,8 +237,9 @@ static long region_chg(struct list_head *head, long f, long t) return chg; } -static long region_truncate(struct list_head *head, long end) +static long region_truncate(struct resv_map *resv, long end) { + struct list_head *head = resv-regions; struct file_region *rg, *trg; long chg = 0; @@ -265,8 +268,9 @@ static long region_truncate(struct list_head *head, long end) return chg; } -static long region_count(struct list_head *head, long f, long t) +static long region_count(struct resv_map *resv, long f, long t) { + struct list_head *head = resv-regions; struct file_region *rg; long chg = 0; @@ -392,7 +396,7 @@ void resv_map_release(struct kref *ref) struct resv_map *resv_map = container_of(ref, struct resv_map, refs); /* Clear out any active regions before we release the map. */ - region_truncate(resv_map-regions, 0); + region_truncate(resv_map, 0); kfree(resv_map); } @@ -1083,7 +1087,7 @@ static long vma_needs_reservation(struct hstate *h, pgoff_t idx = vma_hugecache_offset(h, vma, addr); struct resv_map *resv = inode-i_mapping-private_data; - return region_chg(resv-regions, idx, idx + 1); + return region_chg(resv, idx, idx + 1); } else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) { return 1; @@ -1093,7 +1097,7 @@ static long vma_needs_reservation(struct hstate *h, pgoff_t idx = vma_hugecache_offset(h, vma, addr); struct resv_map *resv = vma_resv_map(vma); - err = region_chg(resv-regions, idx, idx + 1); + err = region_chg(resv, idx, idx + 1); if (err 0) return err; return 0; @@ -1109,14 +1113,14 @@ static void vma_commit_reservation(struct hstate *h, pgoff_t idx = vma_hugecache_offset(h, vma, addr); struct resv_map *resv = inode-i_mapping-private_data; - region_add(resv-regions, idx, idx + 1); + region_add(resv, idx, idx + 1); } else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) { pgoff_t idx = vma_hugecache_offset(h, vma, addr); struct resv_map *resv = vma_resv_map(vma); /* Mark this page used in the map. */ - region_add(resv-regions, idx, idx + 1); + region_add(resv, idx, idx + 1); } } @@ -2203,7 +2207,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma) end = vma_hugecache_offset(h, vma, vma-vm_end); reserve = (end - start) - - region_count(resv-regions, start, end); + region_count(resv, start, end); resv_map_put(vma); @@ -3078,7 +3082,7 @@ int hugetlb_reserve_pages(struct inode *inode, if (!vma || vma-vm_flags VM_MAYSHARE) { resv_map = inode-i_mapping-private_data; - chg = region_chg(resv_map-regions, from, to); + chg = region_chg(resv_map, from, to); } else { resv_map = resv_map_alloc(); @@ -3124,7 +3128,7 @@ int hugetlb_reserve_pages(struct inode *inode, * else has to be done for private mappings here */ if (!vma || vma-vm_flags VM_MAYSHARE) - region_add(resv_map-regions, from, to); + region_add(resv_map, from, to); return 0; out_err: if (vma) @@ -3140,7 +3144,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed) struct hugepage_subpool *spool = subpool_inode(inode); if (resv_map) - chg = region_truncate(resv_map-regions, offset); + chg = region_truncate(resv_map, offset); spin_lock(inode-i_lock
[PATCH 16/18] mm, hugetlb: return a reserved page to a reserved pool if failed
If we fail with a reserved page, just calling put_page() is not sufficient, because put_page() invoke free_huge_page() at last step and it doesn't know whether a page comes from a reserved pool or not. So it doesn't do anything related to reserved count. This makes reserve count lower than how we need, because reserve count already decrease in dequeue_huge_page_vma(). This patch fix this situation. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bb8a45f..6a9ec69 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -649,6 +649,34 @@ struct hstate *size_to_hstate(unsigned long size) return NULL; } +static void put_huge_page(struct page *page, int use_reserve) +{ + struct hstate *h = page_hstate(page); + struct hugepage_subpool *spool = + (struct hugepage_subpool *)page_private(page); + + if (!use_reserve) { + put_page(page); + return; + } + + if (!put_page_testzero(page)) + return; + + set_page_private(page, 0); + page-mapping = NULL; + BUG_ON(page_count(page)); + BUG_ON(page_mapcount(page)); + + spin_lock(hugetlb_lock); + hugetlb_cgroup_uncharge_page(hstate_index(h), +pages_per_huge_page(h), page); + enqueue_huge_page(h, page); + h-resv_huge_pages++; + spin_unlock(hugetlb_lock); + hugepage_subpool_put_pages(spool, 1); +} + static void free_huge_page(struct page *page) { /* @@ -2625,7 +2653,7 @@ retry_avoidcopy: spin_unlock(mm-page_table_lock); mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); - page_cache_release(new_page); + put_huge_page(new_page, use_reserve); out_old_page: page_cache_release(old_page); out_lock: @@ -2725,7 +2753,7 @@ retry: err = add_to_page_cache(page, mapping, idx, GFP_KERNEL); if (err) { - put_page(page); + put_huge_page(page, use_reserve); if (err == -EEXIST) goto retry; goto out; @@ -2798,7 +2826,7 @@ backout: spin_unlock(mm-page_table_lock); backout_unlocked: unlock_page(page); - put_page(page); + put_huge_page(page, use_reserve); goto out; } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 18/18] mm, hugetlb: remove a hugetlb_instantiation_mutex
Now, we have prepared to have an infrastructure in order to remove a this awkward mutex which serialize all faulting tasks, so remove it. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 909075b..4fab047 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2533,9 +2533,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma, /* * Hugetlb_cow() should be called with page lock of the original hugepage held. - * Called with hugetlb_instantiation_mutex held and pte_page locked so we - * cannot race with other handlers or page migration. - * Keep the pte_same checks anyway to make transition from the mutex easier. + * Called with pte_page locked so we cannot race with page migration. */ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t pte, @@ -2844,7 +2842,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, int ret; struct page *page = NULL; struct page *pagecache_page = NULL; - static DEFINE_MUTEX(hugetlb_instantiation_mutex); struct hstate *h = hstate_vma(vma); address = huge_page_mask(h); @@ -2864,17 +2861,9 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (!ptep) return VM_FAULT_OOM; - /* -* Serialize hugepage allocation and instantiation, so that we don't -* get spurious allocation failures if two CPUs race to instantiate -* the same page in the page cache. -*/ - mutex_lock(hugetlb_instantiation_mutex); entry = huge_ptep_get(ptep); - if (huge_pte_none(entry)) { - ret = hugetlb_no_page(mm, vma, address, ptep, flags); - goto out_mutex; - } + if (huge_pte_none(entry)) + return hugetlb_no_page(mm, vma, address, ptep, flags); ret = 0; @@ -2887,10 +2876,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, * consumed. */ if ((flags FAULT_FLAG_WRITE) !huge_pte_write(entry)) { - if (vma_has_reserves(h, vma, address) 0) { - ret = VM_FAULT_OOM; - goto out_mutex; - } + if (vma_has_reserves(h, vma, address) 0) + return VM_FAULT_OOM; if (!(vma-vm_flags VM_MAYSHARE)) pagecache_page = hugetlbfs_pagecache_page(h, @@ -2939,9 +2926,6 @@ out_page_table_lock: unlock_page(page); put_page(page); -out_mutex: - mutex_unlock(hugetlb_instantiation_mutex); - return ret; } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 15/18] mm, hugetlb: move up anon_vma_prepare()
If we fail with a allocated hugepage, it is hard to recover properly. One such example is reserve count. We don't have any method to recover reserve count. Although, I will introduce a function to recover reserve count in following patch, it is better not to allocate a hugepage as much as possible. So move up anon_vma_prepare() which can be failed in OOM situation. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 683fd38..bb8a45f 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2536,6 +2536,15 @@ retry_avoidcopy: /* Drop page_table_lock as buddy allocator may be called */ spin_unlock(mm-page_table_lock); + /* +* When the original hugepage is shared one, it does not have +* anon_vma prepared. +*/ + if (unlikely(anon_vma_prepare(vma))) { + ret = VM_FAULT_OOM; + goto out_old_page; + } + use_reserve = vma_has_reserves(h, vma, address); if (use_reserve == -ENOMEM) { ret = VM_FAULT_OOM; @@ -2590,15 +2599,6 @@ retry_avoidcopy: goto out_lock; } - /* -* When the original hugepage is shared one, it does not have -* anon_vma prepared. -*/ - if (unlikely(anon_vma_prepare(vma))) { - ret = VM_FAULT_OOM; - goto out_new_page; - } - copy_user_huge_page(new_page, old_page, address, vma, pages_per_huge_page(h)); __SetPageUptodate(new_page); @@ -2625,7 +2625,6 @@ retry_avoidcopy: spin_unlock(mm-page_table_lock); mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); -out_new_page: page_cache_release(new_page); out_old_page: page_cache_release(old_page); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 17/18] mm, hugetlb: retry if we fail to allocate a hugepage with use_reserve
If parallel fault occur, we can fail to allocate a hugepage, because many threads dequeue a hugepage to handle a fault of same address. This makes reserved pool shortage just for a little while and this cause faulting thread who is ensured to have enough reserved hugepages to get a SIGBUS signal. To solve this problem, we already have a nice solution, that is, a hugetlb_instantiation_mutex. This blocks other threads to dive into a fault handler. This solve the problem clearly, but it introduce performance degradation, because it serialize all fault handling. Now, I try to remove a hugetlb_instantiation_mutex to get rid of performance degradation. A prerequisite is that other thread should not get a SIGBUS if they are ensured to have enough reserved pages. For this purpose, if we fail to allocate a new hugepage with use_reserve, we return just 0, instead of VM_FAULT_SIGBUS. use_reserve represent that this user is legimate one who are ensured to have enough reserved pages. This prevent these thread not to get a SIGBUS signal and make these thread retrying fault handling. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6a9ec69..909075b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2623,7 +2623,10 @@ retry_avoidcopy: WARN_ON_ONCE(1); } - ret = VM_FAULT_SIGBUS; + if (use_reserve) + ret = 0; + else + ret = VM_FAULT_SIGBUS; goto out_lock; } @@ -2741,7 +2744,10 @@ retry: page = alloc_huge_page(vma, address, use_reserve); if (IS_ERR(page)) { - ret = VM_FAULT_SIGBUS; + if (use_reserve) + ret = 0; + else + ret = VM_FAULT_SIGBUS; goto out; } clear_huge_page(page, address, pages_per_huge_page(h)); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 14/18] mm, hugetlb: clean-up error handling in hugetlb_cow()
Current code include 'Caller expects lock to be held' in every error path. We can clean-up it as we do error handling in one place. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 255bd9e..683fd38 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2516,6 +2516,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, struct hstate *h = hstate_vma(vma); struct page *old_page, *new_page; int use_reserve, outside_reserve = 0; + int ret = 0; unsigned long mmun_start; /* For mmu_notifiers */ unsigned long mmun_end; /* For mmu_notifiers */ @@ -2537,11 +2538,8 @@ retry_avoidcopy: use_reserve = vma_has_reserves(h, vma, address); if (use_reserve == -ENOMEM) { - page_cache_release(old_page); - - /* Caller expects lock to be held */ - spin_lock(mm-page_table_lock); - return VM_FAULT_OOM; + ret = VM_FAULT_OOM; + goto out_old_page; } /* @@ -2588,9 +2586,8 @@ retry_avoidcopy: WARN_ON_ONCE(1); } - /* Caller expects lock to be held */ - spin_lock(mm-page_table_lock); - return VM_FAULT_SIGBUS; + ret = VM_FAULT_SIGBUS; + goto out_lock; } /* @@ -2598,11 +2595,8 @@ retry_avoidcopy: * anon_vma prepared. */ if (unlikely(anon_vma_prepare(vma))) { - page_cache_release(new_page); - page_cache_release(old_page); - /* Caller expects lock to be held */ - spin_lock(mm-page_table_lock); - return VM_FAULT_OOM; + ret = VM_FAULT_OOM; + goto out_new_page; } copy_user_huge_page(new_page, old_page, address, vma, @@ -2630,12 +2624,15 @@ retry_avoidcopy: } spin_unlock(mm-page_table_lock); mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); + +out_new_page: page_cache_release(new_page); +out_old_page: page_cache_release(old_page); - +out_lock: /* Caller expects lock to be held */ spin_lock(mm-page_table_lock); - return 0; + return ret; } /* Return the pagecache page at a given address within a VMA */ -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 01/18] mm, hugetlb: protect reserved pages when softofflining requests the pages
alloc_huge_page_node() use dequeue_huge_page_node() without any validation check, so it can steal reserved page unconditionally. To fix it, check the number of free_huge_page in alloc_huge_page_node(). Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6782b41..d971233 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -935,10 +935,11 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid) */ struct page *alloc_huge_page_node(struct hstate *h, int nid) { - struct page *page; + struct page *page = NULL; spin_lock(hugetlb_lock); - page = dequeue_huge_page_node(h, nid); + if (h-free_huge_pages - h-resv_huge_pages 0) + page = dequeue_huge_page_node(h, nid); spin_unlock(hugetlb_lock); if (!page) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 12/18] mm, hugetlb: remove a check for return value of alloc_huge_page()
Now, alloc_huge_page() only return -ENOSPEC if failed. So, we don't worry about other return value. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 94173e0..35ccdad 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2562,7 +2562,6 @@ retry_avoidcopy: new_page = alloc_huge_page(vma, address, use_reserve); if (IS_ERR(new_page)) { - long err = PTR_ERR(new_page); page_cache_release(old_page); /* @@ -2591,10 +2590,7 @@ retry_avoidcopy: /* Caller expects lock to be held */ spin_lock(mm-page_table_lock); - if (err == -ENOMEM) - return VM_FAULT_OOM; - else - return VM_FAULT_SIGBUS; + return VM_FAULT_SIGBUS; } /* @@ -2720,11 +2716,7 @@ retry: page = alloc_huge_page(vma, address, use_reserve); if (IS_ERR(page)) { - ret = PTR_ERR(page); - if (ret == -ENOMEM) - ret = VM_FAULT_OOM; - else - ret = VM_FAULT_SIGBUS; + ret = VM_FAULT_SIGBUS; goto out; } clear_huge_page(page, address, pages_per_huge_page(h)); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 13/18] mm, hugetlb: grab a page_table_lock after page_cache_release
We don't need to grab a page_table_lock when we try to release a page. So, defer to grab a page_table_lock. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 35ccdad..255bd9e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2630,10 +2630,11 @@ retry_avoidcopy: } spin_unlock(mm-page_table_lock); mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); - /* Caller expects lock to be held */ - spin_lock(mm-page_table_lock); page_cache_release(new_page); page_cache_release(old_page); + + /* Caller expects lock to be held */ + spin_lock(mm-page_table_lock); return 0; } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 11/18] mm, hugetlb: move down outside_reserve check
Just move down outsider_reserve check. This makes code more readable. There is no functional change. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 5f31ca5..94173e0 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2530,20 +2530,6 @@ retry_avoidcopy: return 0; } - /* -* If the process that created a MAP_PRIVATE mapping is about to -* perform a COW due to a shared page count, attempt to satisfy -* the allocation without using the existing reserves. The pagecache -* page is used to determine if the reserve at this address was -* consumed or not. If reserves were used, a partial faulted mapping -* at the time of fork() could consume its reserves on COW instead -* of the full address range. -*/ - if (!(vma-vm_flags VM_MAYSHARE) - is_vma_resv_set(vma, HPAGE_RESV_OWNER) - old_page != pagecache_page) - outside_reserve = 1; - page_cache_get(old_page); /* Drop page_table_lock as buddy allocator may be called */ @@ -2557,6 +2543,20 @@ retry_avoidcopy: spin_lock(mm-page_table_lock); return VM_FAULT_OOM; } + + /* +* If the process that created a MAP_PRIVATE mapping is about to +* perform a COW due to a shared page count, attempt to satisfy +* the allocation without using the existing reserves. The pagecache +* page is used to determine if the reserve at this address was +* consumed or not. If reserves were used, a partial faulted mapping +* at the time of fork() could consume its reserves on COW instead +* of the full address range. +*/ + if (!(vma-vm_flags VM_MAYSHARE) + is_vma_resv_set(vma, HPAGE_RESV_OWNER) + old_page != pagecache_page) + outside_reserve = 1; use_reserve = use_reserve !outside_reserve; new_page = alloc_huge_page(vma, address, use_reserve); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 10/18] mm, hugetlb: call vma_has_reserve() before entering alloc_huge_page()
To implement a graceful failure handling, we need to know whether allocation request is for reserved pool or not, on higher level. In this patch, we just move up vma_has_reseve() to caller function in order to know it. There is no functional change. Following patches implement a grace failure handling and remove a hugetlb_instantiation_mutex. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a66226e..5f31ca5 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1123,12 +1123,12 @@ static void vma_commit_reservation(struct hstate *h, } static struct page *alloc_huge_page(struct vm_area_struct *vma, - unsigned long addr, int avoid_reserve) + unsigned long addr, int use_reserve) { struct hugepage_subpool *spool = subpool_vma(vma); struct hstate *h = hstate_vma(vma); struct page *page; - int ret, idx, use_reserve; + int ret, idx; struct hugetlb_cgroup *h_cg; idx = hstate_index(h); @@ -1140,11 +1140,6 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma, * need pages and subpool limit allocated allocated if no reserve * mapping overlaps. */ - use_reserve = vma_has_reserves(h, vma, addr); - if (use_reserve 0) - return ERR_PTR(-ENOMEM); - - use_reserve = use_reserve !avoid_reserve; if (!use_reserve (hugepage_subpool_get_pages(spool, 1) 0)) return ERR_PTR(-ENOSPC); @@ -2520,7 +2515,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, { struct hstate *h = hstate_vma(vma); struct page *old_page, *new_page; - int outside_reserve = 0; + int use_reserve, outside_reserve = 0; unsigned long mmun_start; /* For mmu_notifiers */ unsigned long mmun_end; /* For mmu_notifiers */ @@ -2553,7 +2548,18 @@ retry_avoidcopy: /* Drop page_table_lock as buddy allocator may be called */ spin_unlock(mm-page_table_lock); - new_page = alloc_huge_page(vma, address, outside_reserve); + + use_reserve = vma_has_reserves(h, vma, address); + if (use_reserve == -ENOMEM) { + page_cache_release(old_page); + + /* Caller expects lock to be held */ + spin_lock(mm-page_table_lock); + return VM_FAULT_OOM; + } + use_reserve = use_reserve !outside_reserve; + + new_page = alloc_huge_page(vma, address, use_reserve); if (IS_ERR(new_page)) { long err = PTR_ERR(new_page); @@ -2679,6 +2685,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page *page; struct address_space *mapping; pte_t new_pte; + int use_reserve = 0; /* * Currently, we are forced to kill the process in the event the @@ -2704,7 +2711,14 @@ retry: size = i_size_read(mapping-host) huge_page_shift(h); if (idx = size) goto out; - page = alloc_huge_page(vma, address, 0); + + use_reserve = vma_has_reserves(h, vma, address); + if (use_reserve == -ENOMEM) { + ret = VM_FAULT_OOM; + goto out; + } + + page = alloc_huge_page(vma, address, use_reserve); if (IS_ERR(page)) { ret = PTR_ERR(page); if (ret == -ENOMEM) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/