from:"Joonsoo Kim"

Re: [PATCH] mm: don't invoke __alloc_pages_direct_compact when order 0

2012-07-07 Thread JoonSoo Kim


  And in almost invoking case, order is 0, so return immediately.
 
  You can't make sure it.

 Okay.

 
  Let's not invoke it when order 0
 
  Let's not ruin git blame.

 Hmm...
 When I do git blame, I can't find anything related to this.

 I mean if we merge the pointless patch, it could be showed firstly instead of
 meaningful patch when we do git blame. It makes us bothering when we find 
 blame-patch.

Oh... I see.

Thanks for comments.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm: don't invoke __alloc_pages_direct_compact when order 0

2012-07-07 Thread JoonSoo Kim

2012/7/7 David Rientjes rient...@google.com:
 On Sat, 7 Jul 2012, Joonsoo Kim wrote:

 __alloc_pages_direct_compact has many arguments so invoking it is very 
 costly.
 And in almost invoking case, order is 0, so return immediately.


 If zero cost is very costly, then this might make sense.

 __alloc_pages_direct_compact() is inlined by gcc.

In my kernel image, __alloc_pages_direct_compact() is not inlined by gcc.
So I send this patch.
But, currently I think it is not useful, so drop it.

Thanks for comments.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: WARNING: __GFP_FS allocations with IRQs disabled (kmemcheck_alloc_shadow)

2012-07-08 Thread JoonSoo Kim

2012/7/8 Fengguang Wu fengguang...@intel.com:
 Hi Vegard,

 This warning code is triggered for the attached config:

 __lockdep_trace_alloc():
 /*
  * Oi! Can't be having __GFP_FS allocations with IRQs disabled.
  */
 if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
 return;

 Where the irq is possibly disabled at the beginning of __slab_alloc():

 local_irq_save(flags);

Currently, in slub code, kmemcheck_alloc_shadow is always invoked with
irq_disabled.
I think that something like below is needed.

diff --git a/mm/slub.c b/mm/slub.c
index 8c691fa..5d41cad 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1324,8 +1324,14 @@ static struct page *allocate_slab(struct
kmem_cache *s, gfp_t flags, int node)
 !(s-flags  (SLAB_NOTRACK | DEBUG_DEFAULT_FLAGS))) {
int pages = 1  oo_order(oo);

+   if (flags  __GFP_WAIT)
+   local_irq_enable();
+
kmemcheck_alloc_shadow(page, oo_order(oo), flags, node);

+   if (flags  __GFP_WAIT)
+   local_irq_disable();
+
/*
 * Objects from caches that have a constructor don't get
 * cleared when they're allocated, so we need to do it here.
-- 
1.7.0.4
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] slub: release a lock if freeing object with a lock is failed in __slab_free()

2012-07-08 Thread JoonSoo Kim

2012/7/7 Christoph Lameter c...@linux.com:
 On Fri, 6 Jul 2012, JoonSoo Kim wrote:

  At CPU2, we don't need lock anymore, because this slab already in partial 
  list.
 
  For that scenario we could also simply do a trylock there and redo
  the loop if we fail. But still what guarantees that another process will
  not modify the page struct between fetching the data and a successful
  trylock?


 I'm not familiar with English, so take my ability to understand into
 consideration.

 I have a hard time understanding what you want to accomplish here.

 we don't need guarantees that another process will not modify
 the page struct between fetching the data and a successful trylock.

 No we do not need that since the cmpxchg will then fail.

 Maybe it would be useful to split this patch into two?

 One where you introduce the dropping of the lock and the other where you
 get rid of certain code paths?


Dropping of the lock is need for getting rid of certain code paths.
So, I can't split this patch into two.

Sorry for confusing all the people.
I think that I don't explain my purpose well.
I will prepare new version in which I explain purpose of patch better.

Thanks for kind review.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: WARNING: __GFP_FS allocations with IRQs disabled (kmemcheck_alloc_shadow)

2012-07-09 Thread JoonSoo Kim

2012/7/9 David Rientjes rient...@google.com:
 On Mon, 9 Jul 2012, JoonSoo Kim wrote:

 diff --git a/mm/slub.c b/mm/slub.c
 index 8c691fa..5d41cad 100644
 --- a/mm/slub.c
 +++ b/mm/slub.c
 @@ -1324,8 +1324,14 @@ static struct page *allocate_slab(struct
 kmem_cache *s, gfp_t flags, int node)
  !(s-flags  (SLAB_NOTRACK | DEBUG_DEFAULT_FLAGS))) {
 int pages = 1  oo_order(oo);

 +   if (flags  __GFP_WAIT)
 +   local_irq_enable();
 +
 kmemcheck_alloc_shadow(page, oo_order(oo), flags, node);

 +   if (flags  __GFP_WAIT)
 +   local_irq_disable();
 +
 /*
  * Objects from caches that have a constructor don't get
  * cleared when they're allocated, so we need to do it here.

 This patch is suboptimal when the branch is taken since you just disabled
 irqs and now are immediately reenabling them and then disabling them
 again.  (And your patch is also whitespace damaged, has no changelog, and
 isn't signed off so it can't be applied.)

My intent is just to provide reference, because there is no replay to
this thread when I see it.


 The correct fix is what I proposed at
 http://marc.info/?l=linux-kernelm=133754837703630 and was awaiting
 testing.  If Rus, Steven, or Fengguang could test this then we could add
 it as a stable backport as well.

Your patch looks good to me.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm: don't invoke __alloc_pages_direct_compact when order 0

2012-07-09 Thread JoonSoo Kim

2012/7/9 David Rientjes rient...@google.com:
 On Sun, 8 Jul 2012, JoonSoo Kim wrote:

  __alloc_pages_direct_compact has many arguments so invoking it is very 
  costly.
  And in almost invoking case, order is 0, so return immediately.
 
 
  If zero cost is very costly, then this might make sense.
 
  __alloc_pages_direct_compact() is inlined by gcc.

 In my kernel image, __alloc_pages_direct_compact() is not inlined by gcc.

 Adding Andrew and Mel to the thread since this would require that we
 revert 11e33f6a55ed (page allocator: break up the allocator entry point
 into fast and slow paths) which would obviously not be a clean revert
 since there have been several changes to these functions over the past
 three years.

Only __alloc_pages_direct_compact() is not inlined.
All others (__alloc_pages_high_priority, __alloc_pages_direct_reclaim,
...) are inlined correctly.
So revert is not needed.

I think __alloc_pages_direct_compact() can't be inlined by gcc,
because it is so big and is invoked two times in __alloc_pages_nodemask().

 I'm stunned (and skeptical) that __alloc_pages_direct_compact() is not
 inlined by your gcc, especially since the kernel must be compiled with
 optimization (either -O1 or -O2 which causes these functions to be
 inlined).  What version of gcc are you using and on what architecture?
 Please do make mm/page_alloc.s and send it to me privately, I'll file
 this and fix it up on gcc-bugs.

I will send result of make mm/page_alloc.s to you privately.
My environments is x86_64, GNU C version 4.6.3

 I'll definitely be following up on this.

Thanks for comments.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm: don't invoke __alloc_pages_direct_compact when order 0

2012-07-10 Thread JoonSoo Kim

2012/7/10 Mel Gorman mgor...@suse.de:
 You say that invoking the function is very costly. I agree that a function
 call with that many parameters is hefty but it is also in the slow path of
 the allocator. For order-0 allocations we are about to enter direct reclaim
 where I would expect the cost far exceeds the cost of a function call.

Yes, I agree.

 If the cost is indeed high and you have seen this in profiles then I
 suggest you create a forced inline function alloc_pages_direct_compact
 that does this;

 if (order != 0)
 __alloc_pages_direct_compact(...)

 and then call alloc_pages_direct_compact instead of
 __alloc_pages_direct_compact. After that, recheck the profiles (although I
 expect the difference to be marginal) and the size of vmlinux (if it gets
 bigger, it's probably not worth it).
 That would be functionally similar to your patch but it will preserve git
 blame, churn less code and be harder to make mistakes with in the unlikely
 event a third call to alloc_pages_direct_compact is ever added.

Your suggestion looks good.
But, the size of page_alloc.o is more than before.

I test 3 approaches, vanilla, always_inline and
wrapping(alloc_page_direct_compact which is your suggestion).
In my environment (v3.5-rc5, gcc 4.6.3, x86_64), page_alloc.o shows
below number.

 total, .text section, .text.unlikely
page_alloc_vanilla.o: 93432,   0x510a,0x243
page_alloc_inline.o:   93336,   0x52ca,  0xa4
page_alloc_wrapping.o: 93528,   0x515a,0x238

Andrew said that inlining add only 26 bytes to .text of page_alloc.o,
but in my system, need more bytes.
Currently, I think this patch doesn't have obvious benefit, so I want
to drop it.
Any objections?

Thanks for good comments.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch 4/7] softirq: Use hotplug thread infrastructure

2012-07-25 Thread JoonSoo Kim

2012/7/16 Thomas Gleixner t...@linutronix.de:
 -   static const struct sched_param param = {
 -   .sched_priority = MAX_RT_PRIO-1
 -   };
 -
 -   p = per_cpu(ksoftirqd, hotcpu);
 -   per_cpu(ksoftirqd, hotcpu) = NULL;
 -   sched_setscheduler_nocheck(p, SCHED_FIFO, param);
 -   kthread_stop(p);
 +   int hotcpu = (unsigned long)hcpu;
 +
 takeover_tasklets(hotcpu);
 break;

Currently, int hotcpu = (unsigned long)hcpu; is somewhat strange.
takeover_tasklets((unsigned long)hcpu) is sufficient for me.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch 0/7] Per cpu thread hotplug infrastructure - V3

2012-07-25 Thread JoonSoo Kim

2012/7/16 Thomas Gleixner t...@linutronix.de:
 The following series implements the infrastructure for parking and
 unparking kernel threads to avoid the full teardown and fork on cpu
 hotplug operations along with management infrastructure for hotplug
 and users.

 Changes vs. V2:

  Use callbacks for all functionality. Thanks to Rusty for pointing
  that out. It makes the use sites nice and simple and keeps all the
  code which would be duplicated otherwise on the core.

 Thanks,

 tglx


I'm just a kernel newbie, so I have a dumb question.
Why this patchset doesn't include patch for migrationd in kernel/stop_machine.c?
Is there any reason for that?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 0/8] remove vm_struct list management

2012-12-06 Thread Joonsoo Kim

This patchset remove vm_struct list management after initializing vmalloc.
Adding and removing an entry to vmlist is linear time complexity, so
it is inefficient. If we maintain this list, overall time complexity of
adding and removing area to vmalloc space is O(N), although we use
rbtree for finding vacant place and it's time complexity is just O(logN).

And vmlist and vmlist_lock is used many places of outside of vmalloc.c.
It is preferable that we hide this raw data structure and provide
well-defined function for supporting them, because it makes that they
cannot mistake when manipulating theses structure and it makes us easily
maintain vmalloc layer.

I'm not sure that 7/8: makes vmlist only for kexec is fine.
Because it is related to userspace program.
As far as I know, makedumpfile use kexec's output information and it only
need first address of vmalloc layer. So my implementation reflect this
fact, but I'm not sure. And now, I don't fully test this patchset.
Basic operation work well, but I don't test kexec. So I send this
patchset with 'RFC'.

Please let me know what I am missing.

This series based on v3.7-rc7 and on top of submitted patchset for ARM.
'introduce static_vm for ARM-specific static mapped area'
https://lkml.org/lkml/2012/11/27/356
But, running properly on x86 without ARM patchset.

Joonsoo Kim (8):
  mm, vmalloc: change iterating a vmlist to find_vm_area()
  mm, vmalloc: move get_vmalloc_info() to vmalloc.c
  mm, vmalloc: protect va-vm by vmap_area_lock
  mm, vmalloc: iterate vmap_area_list, instead of vmlist in
vread/vwrite()
  mm, vmalloc: iterate vmap_area_list in get_vmalloc_info()
  mm, vmalloc: iterate vmap_area_list, instead of vmlist, in
vmallocinfo()
  mm, vmalloc: makes vmlist only for kexec
  mm, vmalloc: remove list management operation after initializing
vmalloc

 arch/tile/mm/pgtable.c  |7 +-
 arch/unicore32/mm/ioremap.c |   17 +--
 arch/x86/mm/ioremap.c   |7 +-
 fs/proc/Makefile|2 +-
 fs/proc/internal.h  |   18 ---
 fs/proc/meminfo.c   |1 +
 fs/proc/mmu.c   |   60 --
 include/linux/vmalloc.h |   19 +++-
 mm/vmalloc.c|  258 +--
 9 files changed, 204 insertions(+), 185 deletions(-)
 delete mode 100644 fs/proc/mmu.c

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 2/8] mm, vmalloc: move get_vmalloc_info() to vmalloc.c

2012-12-06 Thread Joonsoo Kim

Now get_vmalloc_info() is in fs/proc/mmu.c. There is no reason
that this code must be here and it's implementation needs vmlist_lock
and it iterate a vmlist which may be internal data structure for vmalloc.

It is preferable that vmlist_lock and vmlist is only used in vmalloc.c
for maintainability. So move the code to vmalloc.c

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/fs/proc/Makefile b/fs/proc/Makefile
index 99349ef..88092c1 100644
--- a/fs/proc/Makefile
+++ b/fs/proc/Makefile
@@ -5,7 +5,7 @@
 obj-y   += proc.o
 
 proc-y := nommu.o task_nommu.o
-proc-$(CONFIG_MMU) := mmu.o task_mmu.o
+proc-$(CONFIG_MMU) := task_mmu.o
 
 proc-y   += inode.o root.o base.o generic.o array.o \
proc_tty.o fd.o
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 43973b0..5a1eda2 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -28,24 +28,6 @@ extern int proc_net_init(void);
 static inline int proc_net_init(void) { return 0; }
 #endif
 
-struct vmalloc_info {
-   unsigned long   used;
-   unsigned long   largest_chunk;
-};
-
-#ifdef CONFIG_MMU
-#define VMALLOC_TOTAL (VMALLOC_END - VMALLOC_START)
-extern void get_vmalloc_info(struct vmalloc_info *vmi);
-#else
-
-#define VMALLOC_TOTAL 0UL
-#define get_vmalloc_info(vmi)  \
-do {   \
-   (vmi)-used = 0;\
-   (vmi)-largest_chunk = 0;   \
-} while(0)
-#endif
-
 extern int proc_tid_stat(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task);
 extern int proc_tgid_stat(struct seq_file *m, struct pid_namespace *ns,
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 80e4645..c594dfb 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -11,6 +11,7 @@
 #include linux/swap.h
 #include linux/vmstat.h
 #include linux/atomic.h
+#include linux/vmalloc.h
 #include asm/page.h
 #include asm/pgtable.h
 #include internal.h
diff --git a/fs/proc/mmu.c b/fs/proc/mmu.c
deleted file mode 100644
index 8ae221d..000
--- a/fs/proc/mmu.c
+++ /dev/null
@@ -1,60 +0,0 @@
-/* mmu.c: mmu memory info files
- *
- * Copyright (C) 2004 Red Hat, Inc. All Rights Reserved.
- * Written by David Howells (dhowe...@redhat.com)
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License
- * as published by the Free Software Foundation; either version
- * 2 of the License, or (at your option) any later version.
- */
-#include linux/spinlock.h
-#include linux/vmalloc.h
-#include linux/highmem.h
-#include asm/pgtable.h
-#include internal.h
-
-void get_vmalloc_info(struct vmalloc_info *vmi)
-{
-   struct vm_struct *vma;
-   unsigned long free_area_size;
-   unsigned long prev_end;
-
-   vmi-used = 0;
-
-   if (!vmlist) {
-   vmi-largest_chunk = VMALLOC_TOTAL;
-   }
-   else {
-   vmi-largest_chunk = 0;
-
-   prev_end = VMALLOC_START;
-
-   read_lock(vmlist_lock);
-
-   for (vma = vmlist; vma; vma = vma-next) {
-   unsigned long addr = (unsigned long) vma-addr;
-
-   /*
-* Some archs keep another range for modules in vmlist
-*/
-   if (addr  VMALLOC_START)
-   continue;
-   if (addr = VMALLOC_END)
-   break;
-
-   vmi-used += vma-size;
-
-   free_area_size = addr - prev_end;
-   if (vmi-largest_chunk  free_area_size)
-   vmi-largest_chunk = free_area_size;
-
-   prev_end = vma-size + addr;
-   }
-
-   if (VMALLOC_END - prev_end  vmi-largest_chunk)
-   vmi-largest_chunk = VMALLOC_END - prev_end;
-
-   read_unlock(vmlist_lock);
-   }
-}
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 6071e91..698b1e5 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -158,4 +158,22 @@ pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms)
 # endif
 #endif
 
+struct vmalloc_info {
+   unsigned long   used;
+   unsigned long   largest_chunk;
+};
+
+#ifdef CONFIG_MMU
+#define VMALLOC_TOTAL (VMALLOC_END - VMALLOC_START)
+extern void get_vmalloc_info(struct vmalloc_info *vmi);
+#else
+
+#define VMALLOC_TOTAL 0UL
+#define get_vmalloc_info(vmi)  \
+do {   \
+   (vmi)-used = 0;\
+   (vmi)-largest_chunk = 0;   \
+} while (0)
+#endif
+
 #endif /* _LINUX_VMALLOC_H */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 78e0830..16147bc 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2642,5 +2642,49 @@ static int __init proc_vmalloc_init(void

[RFC PATCH 5/8] mm, vmalloc: iterate vmap_area_list in get_vmalloc_info()

2012-12-06 Thread Joonsoo Kim

This patch is preparing step for removing vmlist entirely.
For above purpose, we change iterating a vmap_list codes to iterating a
vmap_area_list. It is somewhat trivial change, but just one thing
should be noticed.

vmlist is lack of information about some areas in vmalloc address space.
For example, vm_map_ram() allocate area in vmalloc address space,
but it doesn't make a link with vmlist. To provide full information about
vmalloc address space is better idea, so we don't use va-vm and use
vmap_area directly.
This makes get_vmalloc_info() more precise.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d21167f..f7f4a35 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2668,46 +2668,47 @@ module_init(proc_vmalloc_init);
 
 void get_vmalloc_info(struct vmalloc_info *vmi)
 {
-   struct vm_struct *vma;
+   struct vmap_area *va;
unsigned long free_area_size;
unsigned long prev_end;
 
vmi-used = 0;
+   vmi-largest_chunk = 0;
 
-   if (!vmlist) {
-   vmi-largest_chunk = VMALLOC_TOTAL;
-   } else {
-   vmi-largest_chunk = 0;
+   prev_end = VMALLOC_START;
 
-   prev_end = VMALLOC_START;
+   spin_lock(vmap_area_lock);
 
-   read_lock(vmlist_lock);
+   if (list_empty(vmap_area_list)) {
+   vmi-largest_chunk = VMALLOC_TOTAL;
+   goto out;
+   }
 
-   for (vma = vmlist; vma; vma = vma-next) {
-   unsigned long addr = (unsigned long) vma-addr;
+   list_for_each_entry(va, vmap_area_list, list) {
+   unsigned long addr = va-va_start;
 
-   /*
-* Some archs keep another range for modules in vmlist
-*/
-   if (addr  VMALLOC_START)
-   continue;
-   if (addr = VMALLOC_END)
-   break;
+   /*
+* Some archs keep another range for modules in vmalloc space
+*/
+   if (addr  VMALLOC_START)
+   continue;
+   if (addr = VMALLOC_END)
+   break;
 
-   vmi-used += vma-size;
+   vmi-used += (va-va_end - va-va_start);
 
-   free_area_size = addr - prev_end;
-   if (vmi-largest_chunk  free_area_size)
-   vmi-largest_chunk = free_area_size;
+   free_area_size = addr - prev_end;
+   if (vmi-largest_chunk  free_area_size)
+   vmi-largest_chunk = free_area_size;
 
-   prev_end = vma-size + addr;
-   }
+   prev_end = va-va_end;
+   }
 
-   if (VMALLOC_END - prev_end  vmi-largest_chunk)
-   vmi-largest_chunk = VMALLOC_END - prev_end;
+   if (VMALLOC_END - prev_end  vmi-largest_chunk)
+   vmi-largest_chunk = VMALLOC_END - prev_end;
 
-   read_unlock(vmlist_lock);
-   }
+out:
+   spin_unlock(vmap_area_lock);
 }
 #endif
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 7/8] mm, vmalloc: makes vmlist only for kexec

2012-12-06 Thread Joonsoo Kim

Although our intention remove vmlist entirely, but there is one exception.
kexec use vmlist symbol, and we can't remove it, because it is related to
userspace program. When kexec dumps system information, it write vmlist
address and vm_struct's address offset. In userspace program, these
information is used for getting first address in vmalloc space. Then it
dumps memory content in vmalloc space which is higher than this address.
For supporting this optimization, we should maintain a vmlist.

But this doesn't means that we should maintain full vmlist.
Just one vm_struct for vmlist is sufficient.
So use vmlist_early for full chain of vm_struct and assign a dummy_vm
to vmlist for supporting kexec.

Cc: Eric Biederman ebied...@xmission.com
Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f134950..8a1b959 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -272,6 +272,27 @@ static unsigned long cached_align;
 
 static unsigned long vmap_area_pcpu_hole;
 
+/*** Old vmalloc interfaces ***/
+DEFINE_RWLOCK(vmlist_lock);
+/* vmlist is only for kexec */
+struct vm_struct *vmlist;
+static struct vm_struct dummy_vm;
+
+/* This is only for kexec.
+ * It wants to know first vmalloc address for optimization */
+static void setup_vmlist(void)
+{
+   struct vmap_area *va;
+
+   if (list_empty(vmap_area_list)) {
+   vmlist = NULL;
+   } else {
+   va = list_entry((vmap_area_list)-next, typeof(*va), list);
+   dummy_vm.addr = (void *)va-va_start;
+   vmlist = dummy_vm;
+   }
+}
+
 static struct vmap_area *__find_vmap_area(unsigned long addr)
 {
struct rb_node *n = vmap_area_root.rb_node;
@@ -313,7 +334,7 @@ static void __insert_vmap_area(struct vmap_area *va)
rb_link_node(va-rb_node, parent, p);
rb_insert_color(va-rb_node, vmap_area_root);
 
-   /* address-sort this list so it is usable like the vmlist */
+   /* address-sort this list so it is usable like the vmlist_early */
tmp = rb_prev(va-rb_node);
if (tmp) {
struct vmap_area *prev;
@@ -321,6 +342,8 @@ static void __insert_vmap_area(struct vmap_area *va)
list_add_rcu(va-list, prev-list);
} else
list_add_rcu(va-list, vmap_area_list);
+
+   setup_vmlist();
 }
 
 static void purge_vmap_area_lazy(void);
@@ -485,6 +508,8 @@ static void __free_vmap_area(struct vmap_area *va)
vmap_area_pcpu_hole = max(vmap_area_pcpu_hole, va-va_end);
 
kfree_rcu(va, rcu_head);
+
+   setup_vmlist();
 }
 
 /*
@@ -1125,11 +1150,13 @@ void *vm_map_ram(struct page **pages, unsigned int 
count, int node, pgprot_t pro
 }
 EXPORT_SYMBOL(vm_map_ram);
 
+static struct vm_struct *vmlist_early;
+
 /**
  * vm_area_add_early - add vmap area early during boot
  * @vm: vm_struct to add
  *
- * This function is used to add fixed kernel vm area to vmlist before
+ * This function is used to add fixed kernel vm area to vmlist_early before
  * vmalloc_init() is called.  @vm-addr, @vm-size, and @vm-flags
  * should contain proper values and the other fields should be zero.
  *
@@ -1140,7 +1167,7 @@ void __init vm_area_add_early(struct vm_struct *vm)
struct vm_struct *tmp, **p;
 
BUG_ON(vmap_initialized);
-   for (p = vmlist; (tmp = *p) != NULL; p = tmp-next) {
+   for (p = vmlist_early; (tmp = *p) != NULL; p = tmp-next) {
if (tmp-addr = vm-addr) {
BUG_ON(tmp-addr  vm-addr + vm-size);
break;
@@ -1190,8 +1217,8 @@ void __init vmalloc_init(void)
INIT_LIST_HEAD(vbq-free);
}
 
-   /* Import existing vmlist entries. */
-   for (tmp = vmlist; tmp; tmp = tmp-next) {
+   /* Import existing vmlist_early entries. */
+   for (tmp = vmlist_early; tmp; tmp = tmp-next) {
va = kzalloc(sizeof(struct vmap_area), GFP_NOWAIT);
va-flags = VM_VM_AREA;
va-va_start = (unsigned long)tmp-addr;
@@ -1283,10 +1310,6 @@ int map_vm_area(struct vm_struct *area, pgprot_t prot, 
struct page ***pages)
 }
 EXPORT_SYMBOL_GPL(map_vm_area);
 
-/*** Old vmalloc interfaces ***/
-DEFINE_RWLOCK(vmlist_lock);
-struct vm_struct *vmlist;
-
 static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
  unsigned long flags, const void *caller)
 {
@@ -1313,7 +1336,7 @@ static void insert_vmalloc_vmlist(struct vm_struct *vm)
vm-flags = ~VM_UNLIST;
 
write_lock(vmlist_lock);
-   for (p = vmlist; (tmp = *p) != NULL; p = tmp-next) {
+   for (p = vmlist_early; (tmp = *p) != NULL; p = tmp-next) {
if (tmp-addr = vm-addr)
break;
}
@@ -1369,7 +1392,7 @@ static struct vm_struct *__get_vm_area_node(unsigned long 
size,
 
/*
 * When this function is called from __vmalloc_node_range,
-* we do not add vm_struct to vmlist here to avoid

[RFC PATCH 8/8] mm, vmalloc: remove list management operation after initializing vmalloc

2012-12-06 Thread Joonsoo Kim

Now, there is no need to maintain vmlist_early after initializing vmalloc.
So remove related code and data structure.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 698b1e5..10d19c9 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -130,7 +130,6 @@ extern long vwrite(char *buf, char *addr, unsigned long 
count);
 /*
  * Internals.  Dont't use..
  */
-extern rwlock_t vmlist_lock;
 extern struct vm_struct *vmlist;
 extern __init void vm_area_add_early(struct vm_struct *vm);
 extern __init void vm_area_register_early(struct vm_struct *vm, size_t align);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 8a1b959..957a098 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -272,8 +272,6 @@ static unsigned long cached_align;
 
 static unsigned long vmap_area_pcpu_hole;
 
-/*** Old vmalloc interfaces ***/
-DEFINE_RWLOCK(vmlist_lock);
 /* vmlist is only for kexec */
 struct vm_struct *vmlist;
 static struct vm_struct dummy_vm;
@@ -334,7 +332,7 @@ static void __insert_vmap_area(struct vmap_area *va)
rb_link_node(va-rb_node, parent, p);
rb_insert_color(va-rb_node, vmap_area_root);
 
-   /* address-sort this list so it is usable like the vmlist_early */
+   /* address-sort this list */
tmp = rb_prev(va-rb_node);
if (tmp) {
struct vmap_area *prev;
@@ -1150,7 +1148,7 @@ void *vm_map_ram(struct page **pages, unsigned int count, 
int node, pgprot_t pro
 }
 EXPORT_SYMBOL(vm_map_ram);
 
-static struct vm_struct *vmlist_early;
+static struct vm_struct *vmlist_early __initdata;
 
 /**
  * vm_area_add_early - add vmap area early during boot
@@ -1323,7 +1321,7 @@ static void setup_vmalloc_vm(struct vm_struct *vm, struct 
vmap_area *va,
spin_unlock(vmap_area_lock);
 }
 
-static void insert_vmalloc_vmlist(struct vm_struct *vm)
+static void remove_vm_unlist(struct vm_struct *vm)
 {
struct vm_struct *tmp, **p;
 
@@ -1334,22 +1332,13 @@ static void insert_vmalloc_vmlist(struct vm_struct *vm)
 */
smp_wmb();
vm-flags = ~VM_UNLIST;
-
-   write_lock(vmlist_lock);
-   for (p = vmlist_early; (tmp = *p) != NULL; p = tmp-next) {
-   if (tmp-addr = vm-addr)
-   break;
-   }
-   vm-next = *p;
-   *p = vm;
-   write_unlock(vmlist_lock);
 }
 
 static void insert_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
  unsigned long flags, const void *caller)
 {
setup_vmalloc_vm(vm, va, flags, caller);
-   insert_vmalloc_vmlist(vm);
+   remove_vm_unlist(vm);
 }
 
 static struct vm_struct *__get_vm_area_node(unsigned long size,
@@ -1392,10 +1381,9 @@ static struct vm_struct *__get_vm_area_node(unsigned 
long size,
 
/*
 * When this function is called from __vmalloc_node_range,
-* we do not add vm_struct to vmlist_early here to avoid
-* accessing uninitialized members of vm_struct such as
-* pages and nr_pages fields. They will be set later.
-* To distinguish it from others, we use a VM_UNLIST flag.
+* we add VM_UNLIST flag to avoid accessing uninitialized
+* members of vm_struct such as pages and nr_pages fields.
+* They will be set later.
 */
if (flags  VM_UNLIST)
setup_vmalloc_vm(area, va, flags, caller);
@@ -1483,21 +1471,6 @@ struct vm_struct *remove_vm_area(const void *addr)
va-flags = ~VM_VM_AREA;
spin_unlock(vmap_area_lock);
 
-   if (!(vm-flags  VM_UNLIST)) {
-   struct vm_struct *tmp, **p;
-   /*
-* remove from list and disallow access to
-* this vm_struct before unmap. (address range
-* confliction is maintained by vmap.)
-*/
-   write_lock(vmlist_lock);
-   for (p = vmlist_early; (tmp = *p) != vm;
-   p = tmp-next)
-   ;
-   *p = tmp-next;
-   write_unlock(vmlist_lock);
-   }
-
vmap_debug_free_range(va-va_start, va-va_end);
free_unmap_vmap_area(va);
vm-size -= PAGE_SIZE;
@@ -1717,10 +1690,11 @@ void *__vmalloc_node_range(unsigned long size, unsigned 
long align,
return NULL;
 
/*
-* In this function, newly allocated vm_struct is not added
-* to vmlist_early at __get_vm_area_node(). so, it is added here.
+* In this function, newly allocated vm_struct has VM_UNLIST flag.
+* It means that vm_struct is not fully initialized.
+* Now, it is fully initialized, so remove this flag here.
 */
-   insert_vmalloc_vmlist(area);
+   remove_vm_unlist(area);
 
/*
 * A ref_count = 3

[RFC PATCH 6/8] mm, vmalloc: iterate vmap_area_list, instead of vmlist, in vmallocinfo()

2012-12-06 Thread Joonsoo Kim

This patch is preparing step for removing vmlist entirely.
For above purpose, we change iterating a vmap_list codes to iterating a
vmap_area_list. It is somewhat trivial change, but just one thing
should be noticed.

Using vmap_area_list in vmallocinfo() introduce ordering problem in SMP
system. In s_show(), we retrieve some values from vm_struct. vm_struct's
values is not fully setup when va-vm is assigned. Full setup is notified
by removing VM_UNLIST flag without holding a lock. When we see that
VM_UNLIST is removed, it is not ensured that vm_struct has proper values
in view of other CPUs. So we need smp_[rw]mb for ensuring that proper
values is assigned when we see that VM_UNLIST is removed.

Therefore, this patch not only change a iteration list, but also add a
appropriate smp_[rw]mb to right places.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f7f4a35..f134950 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1304,7 +1304,14 @@ static void insert_vmalloc_vmlist(struct vm_struct *vm)
 {
struct vm_struct *tmp, **p;
 
+   /*
+* Before removing VM_UNLIST,
+* we should make sure that vm has proper values.
+* Pair with smp_rmb() in show_numa_info().
+*/
+   smp_wmb();
vm-flags = ~VM_UNLIST;
+
write_lock(vmlist_lock);
for (p = vmlist; (tmp = *p) != NULL; p = tmp-next) {
if (tmp-addr = vm-addr)
@@ -2539,19 +2546,19 @@ void pcpu_free_vm_areas(struct vm_struct **vms, int 
nr_vms)
 
 #ifdef CONFIG_PROC_FS
 static void *s_start(struct seq_file *m, loff_t *pos)
-   __acquires(vmlist_lock)
+   __acquires(vmap_area_lock)
 {
loff_t n = *pos;
-   struct vm_struct *v;
+   struct vmap_area *va;
 
-   read_lock(vmlist_lock);
-   v = vmlist;
-   while (n  0  v) {
+   spin_lock(vmap_area_lock);
+   va = list_entry((vmap_area_list)-next, typeof(*va), list);
+   while (n  0  va-list != vmap_area_list) {
n--;
-   v = v-next;
+   va = list_entry(va-list.next, typeof(*va), list);
}
-   if (!n)
-   return v;
+   if (!n  va-list != vmap_area_list)
+   return va;
 
return NULL;
 
@@ -2559,16 +2566,20 @@ static void *s_start(struct seq_file *m, loff_t *pos)
 
 static void *s_next(struct seq_file *m, void *p, loff_t *pos)
 {
-   struct vm_struct *v = p;
+   struct vmap_area *va = p, *next;
 
++*pos;
-   return v-next;
+   next = list_entry(va-list.next, typeof(*va), list);
+   if (next-list != vmap_area_list)
+   return next;
+
+   return NULL;
 }
 
 static void s_stop(struct seq_file *m, void *p)
-   __releases(vmlist_lock)
+   __releases(vmap_area_lock)
 {
-   read_unlock(vmlist_lock);
+   spin_unlock(vmap_area_lock);
 }
 
 static void show_numa_info(struct seq_file *m, struct vm_struct *v)
@@ -2579,6 +2590,11 @@ static void show_numa_info(struct seq_file *m, struct 
vm_struct *v)
if (!counters)
return;
 
+   /* Pair with smp_wmb() in insert_vmalloc_vmlist() */
+   smp_rmb();
+   if (v-flags  VM_UNLIST)
+   return;
+
memset(counters, 0, nr_node_ids * sizeof(unsigned int));
 
for (nr = 0; nr  v-nr_pages; nr++)
@@ -2592,36 +2608,50 @@ static void show_numa_info(struct seq_file *m, struct 
vm_struct *v)
 
 static int s_show(struct seq_file *m, void *p)
 {
-   struct vm_struct *v = p;
+   struct vmap_area *va = p;
+   struct vm_struct *vm;
+
+   if (!(va-flags  VM_VM_AREA)) {
+   seq_printf(m, 0x%pK-0x%pK %7ld,
+   (void *)va-va_start, (void *)va-va_end,
+   va-va_end - va-va_start);
+   if (va-flags  (VM_LAZY_FREE | VM_LAZY_FREEING))
+   seq_printf(m,  (freeing));
+
+   seq_putc(m, '\n');
+   return 0;
+   }
+
+   vm = va-vm;
 
seq_printf(m, 0x%pK-0x%pK %7ld,
-   v-addr, v-addr + v-size, v-size);
+   vm-addr, vm-addr + vm-size, vm-size);
 
-   if (v-caller)
-   seq_printf(m,  %pS, v-caller);
+   if (vm-caller)
+   seq_printf(m,  %pS, vm-caller);
 
-   if (v-nr_pages)
-   seq_printf(m,  pages=%d, v-nr_pages);
+   if (vm-nr_pages)
+   seq_printf(m,  pages=%d, vm-nr_pages);
 
-   if (v-phys_addr)
-   seq_printf(m,  phys=%llx, (unsigned long long)v-phys_addr);
+   if (vm-phys_addr)
+   seq_printf(m,  phys=%llx, (unsigned long long)vm-phys_addr);
 
-   if (v-flags  VM_IOREMAP)
+   if (vm-flags  VM_IOREMAP)
seq_printf(m,  ioremap);
 
-   if (v-flags  VM_ALLOC)
+   if (vm-flags  VM_ALLOC)
seq_printf(m,  vmalloc);
 
-   if (v-flags  VM_MAP)
+   if (vm-flags  VM_MAP

[RFC PATCH 4/8] mm, vmalloc: iterate vmap_area_list, instead of vmlist in vread/vwrite()

2012-12-06 Thread Joonsoo Kim

Now, when we hold a vmap_area_lock, va-vm can't be discarded. So we can
safely access to va-vm when iterating a vmap_area_list with holding a
vmap_area_lock. With this property, change iterating vmlist codes in
vread/vwrite() to iterating vmap_area_list.

There is a little difference relate to lock, because vmlist_lock is mutex,
but, vmap_area_lock is spin_lock. It may introduce a spinning overhead
during vread/vwrite() is executing. But, these are debug-oriented
functions, so this overhead is not real problem for common case.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index a0b85a6..d21167f 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2009,7 +2009,8 @@ static int aligned_vwrite(char *buf, char *addr, unsigned 
long count)
 
 long vread(char *buf, char *addr, unsigned long count)
 {
-   struct vm_struct *tmp;
+   struct vmap_area *va;
+   struct vm_struct *vm;
char *vaddr, *buf_start = buf;
unsigned long buflen = count;
unsigned long n;
@@ -2018,10 +2019,17 @@ long vread(char *buf, char *addr, unsigned long count)
if ((unsigned long) addr + count  count)
count = -(unsigned long) addr;
 
-   read_lock(vmlist_lock);
-   for (tmp = vmlist; count  tmp; tmp = tmp-next) {
-   vaddr = (char *) tmp-addr;
-   if (addr = vaddr + tmp-size - PAGE_SIZE)
+   spin_lock(vmap_area_lock);
+   list_for_each_entry(va, vmap_area_list, list) {
+   if (!count)
+   break;
+
+   if (!(va-flags  VM_VM_AREA))
+   continue;
+
+   vm = va-vm;
+   vaddr = (char *) vm-addr;
+   if (addr = vaddr + vm-size - PAGE_SIZE)
continue;
while (addr  vaddr) {
if (count == 0)
@@ -2031,10 +2039,10 @@ long vread(char *buf, char *addr, unsigned long count)
addr++;
count--;
}
-   n = vaddr + tmp-size - PAGE_SIZE - addr;
+   n = vaddr + vm-size - PAGE_SIZE - addr;
if (n  count)
n = count;
-   if (!(tmp-flags  VM_IOREMAP))
+   if (!(vm-flags  VM_IOREMAP))
aligned_vread(buf, addr, n);
else /* IOREMAP area is treated as memory hole */
memset(buf, 0, n);
@@ -2043,7 +2051,7 @@ long vread(char *buf, char *addr, unsigned long count)
count -= n;
}
 finished:
-   read_unlock(vmlist_lock);
+   spin_unlock(vmap_area_lock);
 
if (buf == buf_start)
return 0;
@@ -2082,7 +2090,8 @@ finished:
 
 long vwrite(char *buf, char *addr, unsigned long count)
 {
-   struct vm_struct *tmp;
+   struct vmap_area *va;
+   struct vm_struct *vm;
char *vaddr;
unsigned long n, buflen;
int copied = 0;
@@ -2092,10 +2101,17 @@ long vwrite(char *buf, char *addr, unsigned long count)
count = -(unsigned long) addr;
buflen = count;
 
-   read_lock(vmlist_lock);
-   for (tmp = vmlist; count  tmp; tmp = tmp-next) {
-   vaddr = (char *) tmp-addr;
-   if (addr = vaddr + tmp-size - PAGE_SIZE)
+   spin_lock(vmap_area_lock);
+   list_for_each_entry(va, vmap_area_list, list) {
+   if (!count)
+   break;
+
+   if (!(va-flags  VM_VM_AREA))
+   continue;
+
+   vm = va-vm;
+   vaddr = (char *) vm-addr;
+   if (addr = vaddr + vm-size - PAGE_SIZE)
continue;
while (addr  vaddr) {
if (count == 0)
@@ -2104,10 +2120,10 @@ long vwrite(char *buf, char *addr, unsigned long count)
addr++;
count--;
}
-   n = vaddr + tmp-size - PAGE_SIZE - addr;
+   n = vaddr + vm-size - PAGE_SIZE - addr;
if (n  count)
n = count;
-   if (!(tmp-flags  VM_IOREMAP)) {
+   if (!(vm-flags  VM_IOREMAP)) {
aligned_vwrite(buf, addr, n);
copied++;
}
@@ -2116,7 +2132,7 @@ long vwrite(char *buf, char *addr, unsigned long count)
count -= n;
}
 finished:
-   read_unlock(vmlist_lock);
+   spin_unlock(vmap_area_lock);
if (!copied)
return 0;
return buflen;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 1/8] mm, vmalloc: change iterating a vmlist to find_vm_area()

2012-12-06 Thread Joonsoo Kim

The purpose of iterating a vmlist is finding vm area with specific
virtual address. find_vm_area() is provided for this purpose
and more efficient, because it uses a rbtree.
So change it.

Cc: Chris Metcalf cmetc...@tilera.com
Cc: Guan Xuetao g...@mprc.pku.edu.cn
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@redhat.com
Cc: H. Peter Anvin h...@zytor.com
Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
index de0de0c..862782d 100644
--- a/arch/tile/mm/pgtable.c
+++ b/arch/tile/mm/pgtable.c
@@ -592,12 +592,7 @@ void iounmap(volatile void __iomem *addr_in)
   in parallel. Reuse of the virtual address is prevented by
   leaving it in the global lists until we're done with it.
   cpa takes care of the direct mappings. */
-   read_lock(vmlist_lock);
-   for (p = vmlist; p; p = p-next) {
-   if (p-addr == addr)
-   break;
-   }
-   read_unlock(vmlist_lock);
+   p = find_vm_area((void *)addr);
 
if (!p) {
pr_err(iounmap: bad address %p\n, addr);
diff --git a/arch/unicore32/mm/ioremap.c b/arch/unicore32/mm/ioremap.c
index b7a6055..13068ee 100644
--- a/arch/unicore32/mm/ioremap.c
+++ b/arch/unicore32/mm/ioremap.c
@@ -235,7 +235,7 @@ EXPORT_SYMBOL(__uc32_ioremap_cached);
 void __uc32_iounmap(volatile void __iomem *io_addr)
 {
void *addr = (void *)(PAGE_MASK  (unsigned long)io_addr);
-   struct vm_struct **p, *tmp;
+   struct vm_struct *vm;
 
/*
 * If this is a section based mapping we need to handle it
@@ -244,17 +244,10 @@ void __uc32_iounmap(volatile void __iomem *io_addr)
 * all the mappings before the area can be reclaimed
 * by someone else.
 */
-   write_lock(vmlist_lock);
-   for (p = vmlist ; (tmp = *p) ; p = tmp-next) {
-   if ((tmp-flags  VM_IOREMAP)  (tmp-addr == addr)) {
-   if (tmp-flags  VM_UNICORE_SECTION_MAPPING) {
-   unmap_area_sections((unsigned long)tmp-addr,
-   tmp-size);
-   }
-   break;
-   }
-   }
-   write_unlock(vmlist_lock);
+   vm = find_vm_area(addr);
+   if (vm  (vm-flags  VM_IOREMAP) 
+   (vm-flags  VM_UNICORE_SECTION_MAPPING))
+   unmap_area_sections((unsigned long)vm-addr, vm-size);
 
vunmap(addr);
 }
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 78fe3f1..9a1e658 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -282,12 +282,7 @@ void iounmap(volatile void __iomem *addr)
   in parallel. Reuse of the virtual address is prevented by
   leaving it in the global lists until we're done with it.
   cpa takes care of the direct mappings. */
-   read_lock(vmlist_lock);
-   for (p = vmlist; p; p = p-next) {
-   if (p-addr == (void __force *)addr)
-   break;
-   }
-   read_unlock(vmlist_lock);
+   p = find_vm_area((void __force *)addr);
 
if (!p) {
printk(KERN_ERR iounmap: bad address %p\n, addr);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 3/8] mm, vmalloc: protect va-vm by vmap_area_lock

2012-12-06 Thread Joonsoo Kim

Inserting and removing an entry to vmlist is linear time complexity, so
it is inefficient. Following patches will try to remove vmlist entirely.
This patch is preparing step for it.

For removing vmlist, iterating vmlist codes should be changed to iterating
a vmap_area_list. Before implementing that, we should make sure that
when we iterate a vmap_area_list, accessing to va-vm doesn't cause a race
condition. This patch ensure that when iterating a vmap_area_list,
there is no race condition for accessing to vm_struct.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 16147bc..a0b85a6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1290,12 +1290,14 @@ struct vm_struct *vmlist;
 static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
  unsigned long flags, const void *caller)
 {
+   spin_lock(vmap_area_lock);
vm-flags = flags;
vm-addr = (void *)va-va_start;
vm-size = va-va_end - va-va_start;
vm-caller = caller;
va-vm = vm;
va-flags |= VM_VM_AREA;
+   spin_unlock(vmap_area_lock);
 }
 
 static void insert_vmalloc_vmlist(struct vm_struct *vm)
@@ -1446,6 +1448,11 @@ struct vm_struct *remove_vm_area(const void *addr)
if (va  va-flags  VM_VM_AREA) {
struct vm_struct *vm = va-vm;
 
+   spin_lock(vmap_area_lock);
+   va-vm = NULL;
+   va-flags = ~VM_VM_AREA;
+   spin_unlock(vmap_area_lock);
+
if (!(vm-flags  VM_UNLIST)) {
struct vm_struct *tmp, **p;
/*
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 0/3] introduce static_vm for ARM-specific static mapped area

2012-12-06 Thread JoonSoo Kim

2012/11/28 Joonsoo Kim js1...@gmail.com:
 In current implementation, we used ARM-specific flag, that is,
 VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area.
 The purpose of static mapped area is to re-use static mapped area when
 entire physical address range of the ioremap request can be covered
 by this area.

 This implementation causes needless overhead for some cases.
 For example, assume that there is only one static mapped area and
 vmlist has 300 areas. Every time we call ioremap, we check 300 areas for
 deciding whether it is matched or not. Moreover, even if there is
 no static mapped area and vmlist has 300 areas, every time we call
 ioremap, we check 300 areas in now.

 If we construct a extra list for static mapped area, we can eliminate
 above mentioned overhead.
 With a extra list, if there is one static mapped area,
 we just check only one area and proceed next operation quickly.

 In fact, it is not a critical problem, because ioremap is not frequently
 used. But reducing overhead is better idea.

 Another reason for doing this work is for removing architecture dependency
 on vmalloc layer. I think that vmlist and vmlist_lock is internal data
 structure for vmalloc layer. Some codes for debugging and stat inevitably
 use vmlist and vmlist_lock. But it is preferable that they are used
 as least as possible in outside of vmalloc.c

 Changelog
 v1-v2:
   [2/3]: patch description is improved.
   Rebased on v3.7-rc7

 Joonsoo Kim (3):
   ARM: vmregion: remove vmregion code entirely
   ARM: static_vm: introduce an infrastructure for static mapped area
   ARM: mm: use static_vm for managing static mapped areas

  arch/arm/include/asm/mach/static_vm.h |   51 
  arch/arm/mm/Makefile  |2 +-
  arch/arm/mm/ioremap.c |   69 ---
  arch/arm/mm/mm.h  |   10 --
  arch/arm/mm/mmu.c |   55 +
  arch/arm/mm/static_vm.c   |   97 
  arch/arm/mm/vmregion.c|  205 
 -
  arch/arm/mm/vmregion.h|   31 -
  8 files changed, 208 insertions(+), 312 deletions(-)
  create mode 100644 arch/arm/include/asm/mach/static_vm.h
  create mode 100644 arch/arm/mm/static_vm.c
  delete mode 100644 arch/arm/mm/vmregion.c
  delete mode 100644 arch/arm/mm/vmregion.h

 --
 1.7.9.5


Hello, Russell.

Could you review this patchset, please?
I send another patchset to mm community on top of this.
That one is related to this patchset,
so I want to get a review about this patchset :)

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/8] remove vm_struct list management

2012-12-07 Thread JoonSoo Kim

Hello, Andrew.

2012/12/7 Andrew Morton a...@linux-foundation.org:
 On Fri,  7 Dec 2012 01:09:27 +0900
 Joonsoo Kim js1...@gmail.com wrote:

 This patchset remove vm_struct list management after initializing vmalloc.
 Adding and removing an entry to vmlist is linear time complexity, so
 it is inefficient. If we maintain this list, overall time complexity of
 adding and removing area to vmalloc space is O(N), although we use
 rbtree for finding vacant place and it's time complexity is just O(logN).

 And vmlist and vmlist_lock is used many places of outside of vmalloc.c.
 It is preferable that we hide this raw data structure and provide
 well-defined function for supporting them, because it makes that they
 cannot mistake when manipulating theses structure and it makes us easily
 maintain vmalloc layer.

 I'm not sure that 7/8: makes vmlist only for kexec is fine.
 Because it is related to userspace program.
 As far as I know, makedumpfile use kexec's output information and it only
 need first address of vmalloc layer. So my implementation reflect this
 fact, but I'm not sure. And now, I don't fully test this patchset.
 Basic operation work well, but I don't test kexec. So I send this
 patchset with 'RFC'.

 Please let me know what I am missing.

 This series based on v3.7-rc7 and on top of submitted patchset for ARM.
 'introduce static_vm for ARM-specific static mapped area'
 https://lkml.org/lkml/2012/11/27/356
 But, running properly on x86 without ARM patchset.

 This all looks rather nice, but not mergeable into anything at this
 stage in the release cycle.

 What are the implications of on top of submitted patchset for ARM?
 Does it depens on the ARM patches in any way, or it it independently
 mergeable and testable?


Yes. It depends on ARM patches.
There is a code to manipulate a vmlist in ARM.
So without applying ARM patches, this patchset makes compile error for ARM.
But, build for x86 works fine with this patchset :)

In ARM patches, a method used for removing vmlist related code is same
as 1/8 of this patchset.
But, it includes some optimization for ARM, so I sent it separately.
If it can't be accepted, I can rework ARM patches like as 1/8 of this patchset.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/8] remove vm_struct list management

2012-12-07 Thread JoonSoo Kim

2012/12/7 Andrew Morton a...@linux-foundation.org:
 On Fri,  7 Dec 2012 01:09:27 +0900
 Joonsoo Kim js1...@gmail.com wrote:

 I'm not sure that 7/8: makes vmlist only for kexec is fine.
 Because it is related to userspace program.
 As far as I know, makedumpfile use kexec's output information and it only
 need first address of vmalloc layer. So my implementation reflect this
 fact, but I'm not sure. And now, I don't fully test this patchset.
 Basic operation work well, but I don't test kexec. So I send this
 patchset with 'RFC'.

 Yes, this is irritating.  Perhaps Vivek or one of the other kexec
 people could take a look at this please - if would obviously be much
 better if we can avoid merging [patch 7/8] at all.

I'm not sure, but I almost sure that [patch 7/8] have no problem.
In kexec.c, they write an address of vmlist and offset of vm_struct's
address field.
It imply that user for this information doesn't have any other
information about vm_struct,
and they can't use other field of vm_struct. They can use *only* address field.
So, remaining just one vm_struct for vmlist which represent first area
of vmalloc layer
may be safe.

But, kexec people may be very helpful to validate this patch.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/8] remove vm_struct list management

2012-12-07 Thread JoonSoo Kim

Hello, Bob.

2012/12/7 Bob Liu lliu...@gmail.com:
 Hi Joonsoo,

 On Fri, Dec 7, 2012 at 12:09 AM, Joonsoo Kim js1...@gmail.com wrote:
 This patchset remove vm_struct list management after initializing vmalloc.
 Adding and removing an entry to vmlist is linear time complexity, so
 it is inefficient. If we maintain this list, overall time complexity of
 adding and removing area to vmalloc space is O(N), although we use
 rbtree for finding vacant place and it's time complexity is just O(logN).

 And vmlist and vmlist_lock is used many places of outside of vmalloc.c.
 It is preferable that we hide this raw data structure and provide
 well-defined function for supporting them, because it makes that they
 cannot mistake when manipulating theses structure and it makes us easily
 maintain vmalloc layer.

 I'm not sure that 7/8: makes vmlist only for kexec is fine.
 Because it is related to userspace program.
 As far as I know, makedumpfile use kexec's output information and it only
 need first address of vmalloc layer. So my implementation reflect this
 fact, but I'm not sure. And now, I don't fully test this patchset.
 Basic operation work well, but I don't test kexec. So I send this
 patchset with 'RFC'.

 Please let me know what I am missing.


 Nice work!
 I also thought about this several weeks ago but I think the efficiency
 may be a problem.

 As you know two locks(vmap_area_lock and vmlist_lock) are used
 currently so that some
 work may be done in parallel(not proved).
 If removed vmlist, i'm afraid vmap_area_lock will become a bottleneck
 which will reduce the efficiency.

Thanks for comment!

Yes, there were some place that work may be done in parallel.
For example, access to '/proc/meminfo', '/proc/vmallocinfo' and '/proc/kcore'
may be done in parallel. But, access to these are not main
functionality of vmalloc layer.
Optimizing main function like vmalloc, vfree is more preferable than above.
And this patchset optimize main function with removing vmlist iteration.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 1/8] mm, vmalloc: change iterating a vmlist to find_vm_area()

2012-12-07 Thread JoonSoo Kim

Hello, Pekka.

2012/12/7 Pekka Enberg penb...@kernel.org:
 On Thu, Dec 6, 2012 at 6:09 PM, Joonsoo Kim js1...@gmail.com wrote:
 The purpose of iterating a vmlist is finding vm area with specific
 virtual address. find_vm_area() is provided for this purpose
 and more efficient, because it uses a rbtree.
 So change it.

 You no longer take the 'vmlist_lock'. This is safe, because...?

As Bob mentioned, find_vm_area() hold a 'vmap_area_lock' during
searching a area.
When we hold a 'vmap_area_lock', area can't be removed.
So this change is safe.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/8] remove vm_struct list management

2012-12-10 Thread JoonSoo Kim

Hello, Vivek.

2012/12/7 Vivek Goyal vgo...@redhat.com:
 On Fri, Dec 07, 2012 at 10:16:55PM +0900, JoonSoo Kim wrote:
 2012/12/7 Andrew Morton a...@linux-foundation.org:
  On Fri,  7 Dec 2012 01:09:27 +0900
  Joonsoo Kim js1...@gmail.com wrote:
 
  I'm not sure that 7/8: makes vmlist only for kexec is fine.
  Because it is related to userspace program.
  As far as I know, makedumpfile use kexec's output information and it only
  need first address of vmalloc layer. So my implementation reflect this
  fact, but I'm not sure. And now, I don't fully test this patchset.
  Basic operation work well, but I don't test kexec. So I send this
  patchset with 'RFC'.
 
  Yes, this is irritating.  Perhaps Vivek or one of the other kexec
  people could take a look at this please - if would obviously be much
  better if we can avoid merging [patch 7/8] at all.

 I'm not sure, but I almost sure that [patch 7/8] have no problem.
 In kexec.c, they write an address of vmlist and offset of vm_struct's
 address field.
 It imply that user for this information doesn't have any other
 information about vm_struct,
 and they can't use other field of vm_struct. They can use *only* address 
 field.
 So, remaining just one vm_struct for vmlist which represent first area
 of vmalloc layer
 may be safe.

 I browsed through makedumpfile source quickly. So yes it does look like
 that we look at first vmlist element -addr field to figure out where
 vmalloc area is starting.

 Can we get the same information from this rb-tree of vmap_area? Is
 -va_start field communication same information as vmlist was
 communicating? What's the difference between vmap_area_root and vmlist.

Thanks for comment.

Yes. vmap_area's va_start field represent same information as vm_struct's addr.
vmap_area_root is data structure for fast searching an area.
vmap_area_list is address sorted list, so we can use it like as vmlist.

There is a little difference vmap_area_list and vmlist.
vmlist is lack of information about some areas in vmalloc address space.
For example, vm_map_ram() allocate area in vmalloc address space,
but it doesn't make a link with vmlist. To provide full information
about vmalloc address space,
using vmap_area_list is more adequate.

 So without knowing details of both the data structures, I think if vmlist
 is going away, then user space tools should be able to traverse vmap_area_root
 rb tree. I am assuming it is sorted using -addr field and we should be
 able to get vmalloc area start from there. It will just be a matter of
 exporting right fields to user space (instead of vmlist).

There is address sorted list of vmap_area, vmap_area_list.
So we can use it for traversing vmalloc areas if it is necessary.
But, as I mentioned before, kexec write *just* address of vmlist and
offset of vm_struct's address field.
It imply that they don't traverse vmlist,
because they didn't write vm_struct's next field which is needed for traversing.
Without vm_struct's next field, they have no method for traversing.
So, IMHO, assigning dummy vm_struct to vmlist which is implemented by [7/8] is
a safe way to maintain a compatibility of userspace tool. :)

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 1/2] scripts/tags.sh: Support subarch for ARM

2012-12-10 Thread Joonsoo Kim

Current tags.sh doesn't handle subarch for ARM.
There are too many subarch on ARM, it is hard that we locate some
functions which are defined in every subarch with tags util family.
Therefore support subarch for removing this unconvenience.

We can use ARM subarch functionality like below.
make cscope O=. SRCARCH=arm SUBARCH=xxx

Signed-off-by: Joonsoo Kim js1...@gmail.com
---
v2: change bash specific '[[]]' to 'case in' statement.
v3: quote the patterns.

diff --git a/scripts/tags.sh b/scripts/tags.sh
index 79fdafb..8fb18d1 100755
--- a/scripts/tags.sh
+++ b/scripts/tags.sh
@@ -48,13 +48,14 @@ find_arch_sources()
for i in $archincludedir; do
prune=$prune -wholename $i -prune -o
done
-   find ${tree}arch/$1 $ignore $prune -name $2 -print;
+   find ${tree}arch/$1 $ignore $subarchprune $prune -name $2 -print;
 }
 
 # find sources in arch/$1/include
 find_arch_include_sources()
 {
-   include=$(find ${tree}arch/$1/ -name include -type d);
+   include=$(find ${tree}arch/$1/ $subarchprune \
+   -name include -type d -print);
if [ -n $include ]; then
archincludedir=$archincludedir $include
find $include $ignore -name $2 -print;
@@ -234,6 +235,21 @@ if [ ${ARCH} = um ]; then
else
archinclude=${SUBARCH}
fi
+elif [ ${SRCARCH} = arm -a ${SUBARCH} !=  ]; then
+   subarchdir=$(find ${tree}arch/$SRCARCH/ -name mach-* -type d -o \
+   -name plat-* -type d);
+   for i in $subarchdir; do
+   case $i in
+   *mach-${SUBARCH})
+   ;;
+   *plat-${SUBARCH})
+   ;;
+   *)
+   subarchprune=$subarchprune \
+   -wholename $i -prune -o
+   ;;
+   esac
+   done
 fi
 
 remove_structs=
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 2/2] scripts/tags.sh: Support compiled source

2012-12-10 Thread Joonsoo Kim

We usually have interst in compiled files only,
because they are strongly related to individual's work.
Current tags.sh can't select compiled files, so support it.

We can use this functionality like below.
make cscope O=. SRCARCH= COMPILED_SOURCE=compiled

It must be executed after building the kernel.

Signed-off-by: Joonsoo Kim js1...@gmail.com
---
v2: change bash specific '[[]]' to 'case in' statement.
use COMPILED_SOURCE env var, instead of abusing SUBARCH
v3: change [ $COMPILED_SOURCE=compiled ] to [ -n $COMPILED_SOURCE ]

diff --git a/scripts/tags.sh b/scripts/tags.sh
index 8fb18d1..08f06c0 100755
--- a/scripts/tags.sh
+++ b/scripts/tags.sh
@@ -96,6 +96,32 @@ all_sources()
find_other_sources '*.[chS]'
 }
 
+all_compiled_sources()
+{
+   for i in $(all_sources); do
+   case $i in
+   *.[cS])
+   j=${i/\.[cS]/\.o}
+   if [ -e $j ]; then
+   echo $i
+   fi
+   ;;
+   *)
+   echo $i
+   ;;
+   esac
+   done
+}
+
+all_target_sources()
+{
+   if [ -n $COMPILED_SOURCE ]; then
+   all_compiled_sources
+   else
+   all_sources
+   fi
+}
+
 all_kconfigs()
 {
for arch in $ALLSOURCE_ARCHS; do
@@ -111,18 +137,18 @@ all_defconfigs()
 
 docscope()
 {
-   (echo \-k; echo \-q; all_sources)  cscope.files
+   (echo \-k; echo \-q; all_target_sources)  cscope.files
cscope -b -f cscope.out
 }
 
 dogtags()
 {
-   all_sources | gtags -i -f -
+   all_target_sources | gtags -i -f -
 }
 
 exuberant()
 {
-   all_sources | xargs $1 -a   \
+   all_target_sources | xargs $1 -a\
-I __initdata,__exitdata,__acquires,__releases  \
-I __read_mostly,cacheline_aligned  \
-I cacheline_aligned_in_smp \
@@ -174,7 +200,7 @@ exuberant()
 
 emacs()
 {
-   all_sources | xargs $1 -a   \
+   all_target_sources | xargs $1 -a\
--regex='/^(ENTRY|_GLOBAL)(\([^)]*\)).*/\2/'\
--regex='/^SYSCALL_DEFINE[0-9]?(\([^,)]*\).*/sys_\1/'   \
--regex='/^TRACE_EVENT(\([^,)]*\).*/trace_\1/'  \
@@ -221,11 +247,10 @@ xtags()
elif $1 --version 21 | grep -iq emacs; then
emacs $1
else
-   all_sources | xargs $1 -a
+   all_target_sources | xargs $1 -a
 fi
 }
 
-
 # Support um (which uses SUBARCH)
 if [ ${ARCH} = um ]; then
if [ $SUBARCH = i386 ]; then
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry

2012-11-27 Thread JoonSoo Kim

Hello, Andrew.

2012/11/20 Minchan Kim minc...@kernel.org:
 Hi Joonsoo,
 Sorry for the delay.

 On Thu, Nov 15, 2012 at 02:09:04AM +0900, JoonSoo Kim wrote:
 Hi, Minchan.

 2012/11/14 Minchan Kim minc...@kernel.org:
  On Tue, Nov 13, 2012 at 11:12:28PM +0900, JoonSoo Kim wrote:
  2012/11/13 Minchan Kim minc...@kernel.org:
   On Tue, Nov 13, 2012 at 09:30:57AM +0900, JoonSoo Kim wrote:
   2012/11/3 Minchan Kim minc...@kernel.org:
Hi Joonsoo,
   
On Sat, Nov 03, 2012 at 04:07:25AM +0900, JoonSoo Kim wrote:
Hello, Minchan.
   
2012/11/1 Minchan Kim minc...@kernel.org:
 On Thu, Nov 01, 2012 at 01:56:36AM +0900, Joonsoo Kim wrote:
 In current code, after flush_all_zero_pkmaps() is invoked,
 then re-iterate all pkmaps. It can be optimized if 
 flush_all_zero_pkmaps()
 return index of first flushed entry. With this index,
 we can immediately map highmem page to virtual address 
 represented by index.
 So change return type of flush_all_zero_pkmaps()
 and return index of first flushed entry.

 Additionally, update last_pkmap_nr to this index.
 It is certain that entry which is below this index is occupied 
 by other mapping,
 therefore updating last_pkmap_nr to this index is reasonable 
 optimization.

 Cc: Mel Gorman m...@csn.ul.ie
 Cc: Peter Zijlstra a.p.zijls...@chello.nl
 Cc: Minchan Kim minc...@kernel.org
 Signed-off-by: Joonsoo Kim js1...@gmail.com

 diff --git a/include/linux/highmem.h b/include/linux/highmem.h
 index ef788b5..97ad208 100644
 --- a/include/linux/highmem.h
 +++ b/include/linux/highmem.h
 @@ -32,6 +32,7 @@ static inline void 
 invalidate_kernel_vmap_range(void *vaddr, int size)

  #ifdef CONFIG_HIGHMEM
  #include asm/highmem.h
 +#define PKMAP_INVALID_INDEX (LAST_PKMAP)

  /* declarations for linux/mm/highmem.c */
  unsigned int nr_free_highpages(void);
 diff --git a/mm/highmem.c b/mm/highmem.c
 index d98b0a9..b365f7b 100644
 --- a/mm/highmem.c
 +++ b/mm/highmem.c
 @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
   return virt_to_page(addr);
  }

 -static void flush_all_zero_pkmaps(void)
 +static unsigned int flush_all_zero_pkmaps(void)
  {
   int i;
 - int need_flush = 0;
 + unsigned int index = PKMAP_INVALID_INDEX;

   flush_cache_kmaps();

 @@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void)
 pkmap_page_table[i]);

   set_page_address(page, NULL);
 - need_flush = 1;
 + if (index == PKMAP_INVALID_INDEX)
 + index = i;
   }
 - if (need_flush)
 + if (index != PKMAP_INVALID_INDEX)
   flush_tlb_kernel_range(PKMAP_ADDR(0), 
 PKMAP_ADDR(LAST_PKMAP));
 +
 + return index;
  }

  /**
 @@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void)
   */
  void kmap_flush_unused(void)
  {
 + unsigned int index;
 +
   lock_kmap();
 - flush_all_zero_pkmaps();
 + index = flush_all_zero_pkmaps();
 + if (index != PKMAP_INVALID_INDEX  (index  
 last_pkmap_nr))
 + last_pkmap_nr = index;

 I don't know how kmap_flush_unused is really fast path so how my 
 nitpick
 is effective. Anyway,
 What problem happens if we do following as?

 lock()
 index = flush_all_zero_pkmaps();
 if (index != PKMAP_INVALID_INDEX)
 last_pkmap_nr = index;
 unlock();

 Normally, last_pkmap_nr is increased with searching empty slot in
 map_new_virtual. So I expect return value of flush_all_zero_pkmaps
 in kmap_flush_unused normally become either less than 
 last_pkmap_nr
 or last_pkmap_nr + 1.
   
There is a case that return value of kmap_flush_unused() is larger
than last_pkmap_nr.
   
I see but why it's problem? kmap_flush_unused returns larger value 
than
last_pkmap_nr means that there is no free slot at below the value.
So unconditional last_pkmap_nr update is vaild.
  
   I think that this is not true.
   Look at the slightly different example.
  
   Assume last_pkmap = 20 and index 1-9, 12-19 is kmapped. 10, 11 is 
   kunmapped.
  
   do kmap_flush_unused() = flush index 10,11 = last_pkmap = 10;
   do kunmap() with index 17
   do kmap_flush_unused() = flush index 17 = last_pkmap = 17?
  
   In this case, unconditional last_pkmap_nr update skip one kunmapped 
   index.
   So, conditional update is needed.
  
   Thanks for pouinting out, Joonsoo.
   You're right. I misunderstood your flush_all_zero_pkmaps change.
   As your change, flush_all_zero_pkmaps returns first *flushed* free slot 
   index.
   What's the benefit returning flushed flushed free slot index rather 
   than free slot index?
 
  If flush_all_zero_pkmaps

[PATCH v2 0/3] introduce static_vm for ARM-specific static mapped area

2012-11-27 Thread Joonsoo Kim

In current implementation, we used ARM-specific flag, that is,
VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area.
The purpose of static mapped area is to re-use static mapped area when
entire physical address range of the ioremap request can be covered
by this area.

This implementation causes needless overhead for some cases.
For example, assume that there is only one static mapped area and
vmlist has 300 areas. Every time we call ioremap, we check 300 areas for
deciding whether it is matched or not. Moreover, even if there is
no static mapped area and vmlist has 300 areas, every time we call
ioremap, we check 300 areas in now.

If we construct a extra list for static mapped area, we can eliminate
above mentioned overhead.
With a extra list, if there is one static mapped area,
we just check only one area and proceed next operation quickly.

In fact, it is not a critical problem, because ioremap is not frequently
used. But reducing overhead is better idea.

Another reason for doing this work is for removing architecture dependency
on vmalloc layer. I think that vmlist and vmlist_lock is internal data
structure for vmalloc layer. Some codes for debugging and stat inevitably
use vmlist and vmlist_lock. But it is preferable that they are used
as least as possible in outside of vmalloc.c

Changelog
v1-v2: 
  [2/3]: patch description is improved.
  Rebased on v3.7-rc7

Joonsoo Kim (3):
  ARM: vmregion: remove vmregion code entirely
  ARM: static_vm: introduce an infrastructure for static mapped area
  ARM: mm: use static_vm for managing static mapped areas

 arch/arm/include/asm/mach/static_vm.h |   51 
 arch/arm/mm/Makefile  |2 +-
 arch/arm/mm/ioremap.c |   69 ---
 arch/arm/mm/mm.h  |   10 --
 arch/arm/mm/mmu.c |   55 +
 arch/arm/mm/static_vm.c   |   97 
 arch/arm/mm/vmregion.c|  205 -
 arch/arm/mm/vmregion.h|   31 -
 8 files changed, 208 insertions(+), 312 deletions(-)
 create mode 100644 arch/arm/include/asm/mach/static_vm.h
 create mode 100644 arch/arm/mm/static_vm.c
 delete mode 100644 arch/arm/mm/vmregion.c
 delete mode 100644 arch/arm/mm/vmregion.h

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 1/3] ARM: vmregion: remove vmregion code entirely

2012-11-27 Thread Joonsoo Kim

Now, there is no user for vmregion.
So remove it.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile
index 8a9c4cb..4e333fa 100644
--- a/arch/arm/mm/Makefile
+++ b/arch/arm/mm/Makefile
@@ -6,7 +6,7 @@ obj-y   := dma-mapping.o extable.o 
fault.o init.o \
   iomap.o
 
 obj-$(CONFIG_MMU)  += fault-armv.o flush.o idmap.o ioremap.o \
-  mmap.o pgd.o mmu.o vmregion.o
+  mmap.o pgd.o mmu.o
 
 ifneq ($(CONFIG_MMU),y)
 obj-y  += nommu.o
diff --git a/arch/arm/mm/vmregion.c b/arch/arm/mm/vmregion.c
deleted file mode 100644
index a631016..000
--- a/arch/arm/mm/vmregion.c
+++ /dev/null
@@ -1,205 +0,0 @@
-#include linux/fs.h
-#include linux/spinlock.h
-#include linux/list.h
-#include linux/proc_fs.h
-#include linux/seq_file.h
-#include linux/slab.h
-
-#include vmregion.h
-
-/*
- * VM region handling support.
- *
- * This should become something generic, handling VM region allocations for
- * vmalloc and similar (ioremap, module space, etc).
- *
- * I envisage vmalloc()'s supporting vm_struct becoming:
- *
- *  struct vm_struct {
- *struct vmregion  region;
- *unsigned longflags;
- *struct page  **pages;
- *unsigned int nr_pages;
- *unsigned longphys_addr;
- *  };
- *
- * get_vm_area() would then call vmregion_alloc with an appropriate
- * struct vmregion head (eg):
- *
- *  struct vmregion vmalloc_head = {
- * .vm_list= LIST_HEAD_INIT(vmalloc_head.vm_list),
- * .vm_start   = VMALLOC_START,
- * .vm_end = VMALLOC_END,
- *  };
- *
- * However, vmalloc_head.vm_start is variable (typically, it is dependent on
- * the amount of RAM found at boot time.)  I would imagine that get_vm_area()
- * would have to initialise this each time prior to calling vmregion_alloc().
- */
-
-struct arm_vmregion *
-arm_vmregion_alloc(struct arm_vmregion_head *head, size_t align,
-  size_t size, gfp_t gfp, const void *caller)
-{
-   unsigned long start = head-vm_start, addr = head-vm_end;
-   unsigned long flags;
-   struct arm_vmregion *c, *new;
-
-   if (head-vm_end - head-vm_start  size) {
-   printk(KERN_WARNING %s: allocation too big (requested %#x)\n,
-   __func__, size);
-   goto out;
-   }
-
-   new = kmalloc(sizeof(struct arm_vmregion), gfp);
-   if (!new)
-   goto out;
-
-   new-caller = caller;
-
-   spin_lock_irqsave(head-vm_lock, flags);
-
-   addr = rounddown(addr - size, align);
-   list_for_each_entry_reverse(c, head-vm_list, vm_list) {
-   if (addr = c-vm_end)
-   goto found;
-   addr = rounddown(c-vm_start - size, align);
-   if (addr  start)
-   goto nospc;
-   }
-
- found:
-   /*
-* Insert this entry after the one we found.
-*/
-   list_add(new-vm_list, c-vm_list);
-   new-vm_start = addr;
-   new-vm_end = addr + size;
-   new-vm_active = 1;
-
-   spin_unlock_irqrestore(head-vm_lock, flags);
-   return new;
-
- nospc:
-   spin_unlock_irqrestore(head-vm_lock, flags);
-   kfree(new);
- out:
-   return NULL;
-}
-
-static struct arm_vmregion *__arm_vmregion_find(struct arm_vmregion_head 
*head, unsigned long addr)
-{
-   struct arm_vmregion *c;
-
-   list_for_each_entry(c, head-vm_list, vm_list) {
-   if (c-vm_active  c-vm_start == addr)
-   goto out;
-   }
-   c = NULL;
- out:
-   return c;
-}
-
-struct arm_vmregion *arm_vmregion_find(struct arm_vmregion_head *head, 
unsigned long addr)
-{
-   struct arm_vmregion *c;
-   unsigned long flags;
-
-   spin_lock_irqsave(head-vm_lock, flags);
-   c = __arm_vmregion_find(head, addr);
-   spin_unlock_irqrestore(head-vm_lock, flags);
-   return c;
-}
-
-struct arm_vmregion *arm_vmregion_find_remove(struct arm_vmregion_head *head, 
unsigned long addr)
-{
-   struct arm_vmregion *c;
-   unsigned long flags;
-
-   spin_lock_irqsave(head-vm_lock, flags);
-   c = __arm_vmregion_find(head, addr);
-   if (c)
-   c-vm_active = 0;
-   spin_unlock_irqrestore(head-vm_lock, flags);
-   return c;
-}
-
-void arm_vmregion_free(struct arm_vmregion_head *head, struct arm_vmregion *c)
-{
-   unsigned long flags;
-
-   spin_lock_irqsave(head-vm_lock, flags);
-   list_del(c-vm_list);
-   spin_unlock_irqrestore(head-vm_lock, flags);
-
-   kfree(c);
-}
-
-#ifdef CONFIG_PROC_FS
-static int arm_vmregion_show(struct seq_file *m, void *p)
-{
-   struct arm_vmregion *c = list_entry(p, struct arm_vmregion, vm_list);
-
-   seq_printf(m, 0x%08lx-0x%08lx %7lu, c-vm_start, c-vm_end,
-   c-vm_end - c-vm_start);
-   if (c-caller

[PATCH v2 2/3] ARM: static_vm: introduce an infrastructure for static mapped area

2012-11-27 Thread Joonsoo Kim

In current implementation, we used ARM-specific flag, that is,
VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area.
The purpose of static mapped area is to re-use static mapped area when
entire physical address range of the ioremap request can be covered
by this area.

This implementation causes needless overhead for some cases.
For example, assume that there is only one static mapped area and
vmlist has 300 areas. Every time we call ioremap, we check 300 areas for
deciding whether it is matched or not. Moreover, even if there is
no static mapped area and vmlist has 300 areas, every time we call
ioremap, we check 300 areas in now.

If we construct a extra list for static mapped area, we can eliminate
above mentioned overhead.
With a extra list, if there is one static mapped area,
we just check only one area and proceed next operation quickly.

In fact, it is not a critical problem, because ioremap is not frequently
used. But reducing overhead is better idea.

Another reason for doing this work is for removing architecture dependency
on vmalloc layer. I think that vmlist and vmlist_lock is internal data
structure for vmalloc layer. Some codes for debugging and stat inevitably
use vmlist and vmlist_lock. But it is preferable that they are used
as least as possible in outside of vmalloc.c

Now, I introduce an ARM-specific infrastructure for static mapped area. In
the following patch, we will use this and resolve above mentioned problem.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/arch/arm/include/asm/mach/static_vm.h 
b/arch/arm/include/asm/mach/static_vm.h
new file mode 100644
index 000..1bb6604
--- /dev/null
+++ b/arch/arm/include/asm/mach/static_vm.h
@@ -0,0 +1,45 @@
+/*
+ * arch/arm/include/asm/mach/static_vm.h
+ *
+ * Copyright (C) 2012 LG Electronics, Joonsoo Kim js1...@gmail.com
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef _ASM_MACH_STATIC_VM_H
+#define _ASM_MACH_STATIC_VM_H
+
+#include linux/types.h
+#include linux/vmalloc.h
+
+struct static_vm {
+   struct static_vm*next;
+   void*vaddr;
+   unsigned long   size;
+   unsigned long   flags;
+   phys_addr_t paddr;
+   const void  *caller;
+};
+
+extern struct static_vm *static_vmlist;
+extern spinlock_t static_vmlist_lock;
+
+extern struct static_vm *find_static_vm_paddr(phys_addr_t paddr,
+   size_t size, unsigned long flags);
+extern struct static_vm *find_static_vm_vaddr(void *vaddr, unsigned long 
flags);
+extern void init_static_vm(struct static_vm *static_vm,
+   struct vm_struct *vm, unsigned long flags);
+extern void insert_static_vm(struct static_vm *vm);
+
+#endif /* _ASM_MACH_STATIC_VM_H */
diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile
index 4e333fa..57b329a 100644
--- a/arch/arm/mm/Makefile
+++ b/arch/arm/mm/Makefile
@@ -6,7 +6,7 @@ obj-y   := dma-mapping.o extable.o 
fault.o init.o \
   iomap.o
 
 obj-$(CONFIG_MMU)  += fault-armv.o flush.o idmap.o ioremap.o \
-  mmap.o pgd.o mmu.o
+  mmap.o pgd.o mmu.o static_vm.o
 
 ifneq ($(CONFIG_MMU),y)
 obj-y  += nommu.o
diff --git a/arch/arm/mm/static_vm.c b/arch/arm/mm/static_vm.c
new file mode 100644
index 000..d7677cf
--- /dev/null
+++ b/arch/arm/mm/static_vm.c
@@ -0,0 +1,97 @@
+/*
+ * arch/arm/mm/static_vm.c
+ *
+ * Copyright (C) 2012 LG Electronics, Joonsoo Kim js1...@gmail.com
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include linux/spinlock.h
+
+#include asm/mach/static_vm.h
+
+struct static_vm *static_vmlist

[PATCH v2 3/3] ARM: mm: use static_vm for managing static mapped areas

2012-11-27 Thread Joonsoo Kim

A static mapped area is ARM-specific, so it is better not to use
generic vmalloc data structure, that is, vmlist and vmlist_lock
for managing static mapped area. And it causes some needless overhead and
reducing this overhead is better idea.

Now, we have newly introduced static_vm infrastructure.
With it, we don't need to iterate all mapped areas. Instead, we just
iterate static mapped areas. It helps to reduce an overhead of finding
matched area. And architecture dependency on vmalloc layer is removed,
so it will help to maintainability for vmalloc layer.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/arch/arm/include/asm/mach/static_vm.h 
b/arch/arm/include/asm/mach/static_vm.h
index 1bb6604..0d9c685 100644
--- a/arch/arm/include/asm/mach/static_vm.h
+++ b/arch/arm/include/asm/mach/static_vm.h
@@ -32,6 +32,12 @@ struct static_vm {
const void  *caller;
 };
 
+#define STATIC_VM_MEM  0x0001
+#define STATIC_VM_EMPTY0x0002
+
+/* mtype should be less than 28 */
+#define STATIC_VM_MTYPE(mt)(1UL  ((mt) + 4))
+
 extern struct static_vm *static_vmlist;
 extern spinlock_t static_vmlist_lock;
 
diff --git a/arch/arm/mm/ioremap.c b/arch/arm/mm/ioremap.c
index 5dcc2fd..b7f3c27 100644
--- a/arch/arm/mm/ioremap.c
+++ b/arch/arm/mm/ioremap.c
@@ -36,6 +36,7 @@
 #include asm/system_info.h
 
 #include asm/mach/map.h
+#include asm/mach/static_vm.h
 #include asm/mach/pci.h
 #include mm.h
 
@@ -197,7 +198,8 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn,
const struct mem_type *type;
int err;
unsigned long addr;
-   struct vm_struct * area;
+   struct vm_struct *area;
+   phys_addr_t paddr = __pfn_to_phys(pfn);
 
 #ifndef CONFIG_ARM_LPAE
/*
@@ -219,24 +221,17 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn,
/*
 * Try to reuse one of the static mapping whenever possible.
 */
-   read_lock(vmlist_lock);
-   for (area = vmlist; area; area = area-next) {
-   if (!size || (sizeof(phys_addr_t) == 4  pfn = 0x10))
-   break;
-   if (!(area-flags  VM_ARM_STATIC_MAPPING))
-   continue;
-   if ((area-flags  VM_ARM_MTYPE_MASK) != VM_ARM_MTYPE(mtype))
-   continue;
-   if (__phys_to_pfn(area-phys_addr)  pfn ||
-   __pfn_to_phys(pfn) + size-1  area-phys_addr + 
area-size-1)
-   continue;
-   /* we can drop the lock here as we know *area is static */
-   read_unlock(vmlist_lock);
-   addr = (unsigned long)area-addr;
-   addr += __pfn_to_phys(pfn) - area-phys_addr;
-   return (void __iomem *) (offset + addr);
+   if (size  !((sizeof(phys_addr_t) == 4  pfn = 0x10))) {
+   struct static_vm *static_vm;
+
+   static_vm = find_static_vm_paddr(__pfn_to_phys(pfn), size,
+   STATIC_VM_MEM | STATIC_VM_MTYPE(mtype));
+   if (static_vm) {
+   addr = (unsigned long)static_vm-vaddr;
+   addr += paddr - static_vm-paddr;
+   return (void __iomem *) (offset + addr);
+   }
}
-   read_unlock(vmlist_lock);
 
/*
 * Don't allow RAM to be mapped - this causes problems with ARMv6+
@@ -248,7 +243,7 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn,
if (!area)
return NULL;
addr = (unsigned long)area-addr;
-   area-phys_addr = __pfn_to_phys(pfn);
+   area-phys_addr = paddr;
 
 #if !defined(CONFIG_SMP)  !defined(CONFIG_ARM_LPAE)
if (DOMAIN_IO == 0 
@@ -346,34 +341,20 @@ __arm_ioremap_exec(unsigned long phys_addr, size_t size, 
bool cached)
 void __iounmap(volatile void __iomem *io_addr)
 {
void *addr = (void *)(PAGE_MASK  (unsigned long)io_addr);
-   struct vm_struct *vm;
-
-   read_lock(vmlist_lock);
-   for (vm = vmlist; vm; vm = vm-next) {
-   if (vm-addr  addr)
-   break;
-   if (!(vm-flags  VM_IOREMAP))
-   continue;
-   /* If this is a static mapping we must leave it alone */
-   if ((vm-flags  VM_ARM_STATIC_MAPPING) 
-   (vm-addr = addr)  (vm-addr + vm-size  addr)) {
-   read_unlock(vmlist_lock);
-   return;
-   }
+   struct static_vm *static_vm;
+
+   static_vm = find_static_vm_vaddr(addr, STATIC_VM_MEM);
+   if (static_vm)
+   return;
+
 #if !defined(CONFIG_SMP)  !defined(CONFIG_ARM_LPAE)
-   /*
-* If this is a section based mapping we need to handle it
-* specially as the VM subsystem does not know how to handle
-* such a beast.
-*/
-   if ((vm-addr == addr

[PATCH] slub: assign refcount for kmalloc_caches

2012-12-25 Thread Joonsoo Kim

commit cce89f4f6911286500cf7be0363f46c9b0a12ce0('Move kmem_cache
refcounting to common code') moves some refcount manipulation code to
common code. Unfortunately, it also removed refcount assignment for
kmalloc_caches. So, kmalloc_caches's refcount is initially 0.
This makes errornous situation.

Paul Hargrove report that when he create a 8-byte kmem_cache and
destory it, he encounter below message.
'Objects remaining in kmalloc-8 on kmem_cache_close()'

8-byte kmem_cache merge with 8-byte kmalloc cache and refcount is
increased by one. So, resulting refcount is 1. When destory it, it hit
refcount = 0, then kmem_cache_close() is executed and error message is
printed.

This patch assign initial refcount 1 to kmalloc_caches, so fix this
errornous situtation.

Cc: sta...@vger.kernel.org # v3.7
Cc: Christoph Lameter c...@linux.com
Reported-by: Paul Hargrove phhargr...@lbl.gov
Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/mm/slub.c b/mm/slub.c
index a0d6984..321afab 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3279,6 +3279,7 @@ static struct kmem_cache *__init 
create_kmalloc_cache(const char *name,
if (kmem_cache_open(s, flags))
goto panic;
 
+   s-refcount = 1;
list_add(s-list, slab_caches);
return s;
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] slub: assign refcount for kmalloc_caches

2012-12-25 Thread JoonSoo Kim

2012/12/26 Joonsoo Kim js1...@gmail.com:
 commit cce89f4f6911286500cf7be0363f46c9b0a12ce0('Move kmem_cache
 refcounting to common code') moves some refcount manipulation code to
 common code. Unfortunately, it also removed refcount assignment for
 kmalloc_caches. So, kmalloc_caches's refcount is initially 0.
 This makes errornous situation.

 Paul Hargrove report that when he create a 8-byte kmem_cache and
 destory it, he encounter below message.
 'Objects remaining in kmalloc-8 on kmem_cache_close()'

 8-byte kmem_cache merge with 8-byte kmalloc cache and refcount is
 increased by one. So, resulting refcount is 1. When destory it, it hit
 refcount = 0, then kmem_cache_close() is executed and error message is
 printed.

 This patch assign initial refcount 1 to kmalloc_caches, so fix this
 errornous situtation.

 Cc: sta...@vger.kernel.org # v3.7
 Cc: Christoph Lameter c...@linux.com
 Reported-by: Paul Hargrove phhargr...@lbl.gov
 Signed-off-by: Joonsoo Kim js1...@gmail.com

 diff --git a/mm/slub.c b/mm/slub.c
 index a0d6984..321afab 100644
 --- a/mm/slub.c
 +++ b/mm/slub.c
 @@ -3279,6 +3279,7 @@ static struct kmem_cache *__init 
 create_kmalloc_cache(const char *name,
 if (kmem_cache_open(s, flags))
 goto panic;

 +   s-refcount = 1;
 list_add(s-list, slab_caches);
 return s;

 --
 1.7.9.5


I missed some explanation.
In v3.8-rc1, this problem is already solved.
See create_kmalloc_cache() in mm/slab_common.c.
So this patch is just for v3.7 stable.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] mm, bootmem: panic in bootmem alloc functions even if slab is available

2012-12-28 Thread JoonSoo Kim

Hello, Sasha.

2012/12/28 Sasha Levin sasha.le...@oracle.com:
 On 12/27/2012 06:04 PM, David Rientjes wrote:
 On Thu, 27 Dec 2012, Sasha Levin wrote:

 That's exactly what happens with the patch. Note that in the current 
 upstream
 version there are several slab checks scattered all over.

 In this case for example, I'm removing it from __alloc_bootmem_node(), but 
 the
 first code line of__alloc_bootmem_node_nopanic() is:

 if (WARN_ON_ONCE(slab_is_available()))
 return kzalloc(size, GFP_NOWAIT);


 You're only talking about mm/bootmem.c and not mm/nobootmem.c, and notice
 that __alloc_bootmem_node() does not call __alloc_bootmem_node_nopanic(),
 it calls ___alloc_bootmem_node_nopanic().

 Holy cow, this is an underscore hell.


 Thanks,
 Sasha


I have a different idea.
How about removing fallback allocation in bootmem.c completely?
I don't know why it is there exactly.
But, warning for 'slab_is_available()' is there for a long time.
So, most people who misuse fallback allocation change their code adequately.
I think that removing fallback at this time is valid. Isn't it?

Fallback allocation may cause possible bug.
If someone free a memory from fallback allocation,
it can't be handled properly.

So, IMHO, at this time, we should remove fallback allocation in
bootmem.c entirely.
Please let me know what I misunderstand.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] x86, reboot: skip reboot_fixups in early boot phase

2012-12-28 Thread Joonsoo Kim

During early boot phase, PCI bus subsystem is not yet initialized.
If panic is occured in early boot phase and panic_timeout is set,
code flow go into emergency_restart() and hit mach_reboot_fixups(), then
encounter another panic. When second panic, we can't hold a panic_lock, so
code flow go into panic_smp_self_stop() which prevent system to restart.

For avoid second panic, skip reboot_fixups in early boot phase.
It makes panic_timeout works in early boot phase.

Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@redhat.com
Cc: H. Peter Anvin h...@zytor.com
Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/arch/x86/kernel/reboot_fixups_32.c 
b/arch/x86/kernel/reboot_fixups_32.c
index c8e41e9..b9b8ec9 100644
--- a/arch/x86/kernel/reboot_fixups_32.c
+++ b/arch/x86/kernel/reboot_fixups_32.c
@@ -89,6 +89,10 @@ void mach_reboot_fixups(void)
if (in_interrupt())
return;
 
+   /* During early boot phase, PCI is not yet initialized */
+   if (system_state == SYSTEM_BOOTING)
+   return;
+
for (i=0; i  ARRAY_SIZE(fixups_table); i++) {
cur = (fixups_table[i]);
dev = pci_get_device(cur-vendor, cur-device, NULL);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/8] remove vm_struct list management

2012-12-12 Thread JoonSoo Kim

Hello, Atsushi.

2012/12/12 Atsushi Kumagai kumagai-atsu...@mxc.nes.nec.co.jp:
 Hello,

 On Tue, 11 Dec 2012 17:17:05 -0500 (EST)
 Dave Anderson ander...@redhat.com wrote:



 - Original Message -
  On Mon, Dec 10, 2012 at 11:40:47PM +0900, JoonSoo Kim wrote:
 
  [..]
So without knowing details of both the data structures, I think if 
vmlist
is going away, then user space tools should be able to traverse 
vmap_area_root
rb tree. I am assuming it is sorted using -addr field and we should be
able to get vmalloc area start from there. It will just be a matter of
exporting right fields to user space (instead of vmlist).
  
   There is address sorted list of vmap_area, vmap_area_list.
   So we can use it for traversing vmalloc areas if it is necessary.
   But, as I mentioned before, kexec write *just* address of vmlist and
   offset of vm_struct's address field.  It imply that they don't traverse 
   vmlist,
   because they didn't write vm_struct's next field which is needed for 
   traversing.
   Without vm_struct's next field, they have no method for traversing.
   So, IMHO, assigning dummy vm_struct to vmlist which is implemented by 
   [7/8] is
   a safe way to maintain a compatibility of userspace tool. :)
 
  Actually the design of makedumpfile and crash tool is that they know
  about kernel data structures and they adopt to changes. So for major
  changes they keep track of kernel version numbers and if access the
  data structures accordingly.
 
  Currently we access first element of vmlist to determine start of vmalloc
  address. True we don't have to traverse the list.
 
  But as you mentioned we should be able to get same information by
  traversing to left most element of vmap_area_list rb tree. So I think
  instead of trying to retain vmlist first element just for backward
  compatibility, I will rather prefer get rid of that code completely
  from kernel and let user space tool traverse rbtree. Just export
  minimum needed info for traversal in user space.

 There's no need to traverse the rbtree.  There is a vmap_area_list
 linked list of vmap_area structures that is also sorted by virtual
 address.

 All that makedumpfile would have to do is to access the first vmap_area
 in the vmap_area_list -- as opposed to the way that it does now, which is
 by accessing the first vm_struct in the to-be-obsoleted vmlist list.

 So it seems silly to keep the dummy vmlist around.

 I think so, I will modify makedumpfile to get the start address of vmalloc
 with vmap_area_list if the related symbols are provided as VMCOREINFO like
 vmlist.
 BTW, have we to consider other tools ?
 If it is clear, I think we can get rid of the dummy vmlist.

Good!
In next spin, I will remove dummy vmlist and export vmap_area_list symbol
for makedumpfile.

I don't know any other tools.
If anyone knows it, please let me know.

Thanks! Atsushi, Dave and Vivek.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 0/3] introduce static_vm for ARM-specific static mapped area

2012-12-12 Thread JoonSoo Kim

2012/12/7 JoonSoo Kim js1...@gmail.com:
 2012/11/28 Joonsoo Kim js1...@gmail.com:
 In current implementation, we used ARM-specific flag, that is,
 VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area.
 The purpose of static mapped area is to re-use static mapped area when
 entire physical address range of the ioremap request can be covered
 by this area.

 This implementation causes needless overhead for some cases.
 For example, assume that there is only one static mapped area and
 vmlist has 300 areas. Every time we call ioremap, we check 300 areas for
 deciding whether it is matched or not. Moreover, even if there is
 no static mapped area and vmlist has 300 areas, every time we call
 ioremap, we check 300 areas in now.

 If we construct a extra list for static mapped area, we can eliminate
 above mentioned overhead.
 With a extra list, if there is one static mapped area,
 we just check only one area and proceed next operation quickly.

 In fact, it is not a critical problem, because ioremap is not frequently
 used. But reducing overhead is better idea.

 Another reason for doing this work is for removing architecture dependency
 on vmalloc layer. I think that vmlist and vmlist_lock is internal data
 structure for vmalloc layer. Some codes for debugging and stat inevitably
 use vmlist and vmlist_lock. But it is preferable that they are used
 as least as possible in outside of vmalloc.c

 Changelog
 v1-v2:
   [2/3]: patch description is improved.
   Rebased on v3.7-rc7

 Joonsoo Kim (3):
   ARM: vmregion: remove vmregion code entirely
   ARM: static_vm: introduce an infrastructure for static mapped area
   ARM: mm: use static_vm for managing static mapped areas

  arch/arm/include/asm/mach/static_vm.h |   51 
  arch/arm/mm/Makefile  |2 +-
  arch/arm/mm/ioremap.c |   69 ---
  arch/arm/mm/mm.h  |   10 --
  arch/arm/mm/mmu.c |   55 +
  arch/arm/mm/static_vm.c   |   97 
  arch/arm/mm/vmregion.c|  205 
 -
  arch/arm/mm/vmregion.h|   31 -
  8 files changed, 208 insertions(+), 312 deletions(-)
  create mode 100644 arch/arm/include/asm/mach/static_vm.h
  create mode 100644 arch/arm/mm/static_vm.c
  delete mode 100644 arch/arm/mm/vmregion.c
  delete mode 100644 arch/arm/mm/vmregion.h

 --
 1.7.9.5


 Hello, Russell.

 Could you review this patchset, please?
 I send another patchset to mm community on top of this.
 That one is related to this patchset,
 so I want to get a review about this patchset :)

 Thanks.

Hello.

Is there anyone to review this patchset?
Please let me know what I should do in order to take a review :)

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] mm: introduce numa_zero_pfn

2012-12-12 Thread Joonsoo Kim

Currently, we use just *one* zero page regardless of user process' node.
When user process read zero page, at first, cpu should load this
to cpu cache. If node of cpu is not same as node of zero page, loading
takes long time. If we make zero pages for each nodes and use them
adequetly, we can reduce this overhead.

This patch implement basic infrastructure for numa_zero_pfn.
It is default disabled, because it doesn't provide page coloring and
some architecture use page coloring for zero page.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/mm/Kconfig b/mm/Kconfig
index a3f8ddd..de0ab65 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -412,3 +412,8 @@ config FRONTSWAP
  and swap data is stored as normal on the matching swap device.
 
  If unsure, say Y to enable frontswap.
+
+config NUMA_ZERO_PFN
+   bool Enable NUMA-aware zero page handling
+   depends on NUMA
+   default n
diff --git a/mm/memory.c b/mm/memory.c
index 221fc9f..e7d3969 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -112,12 +112,43 @@ __setup(norandmaps, disable_randmaps);
 unsigned long zero_pfn __read_mostly;
 unsigned long highest_memmap_pfn __read_mostly;
 
+#ifdef CONFIG_NUMA_ZERO_PFN
+unsigned long node_to_zero_pfn[MAX_NUMNODES] __read_mostly;
+
+/* Should be called after zero_pfn initialization */
+static void __init init_numa_zero_pfn(void)
+{
+   unsigned int node;
+
+   if (nr_node_ids == 1)
+   return;
+
+   for_each_node_state(node, N_POSSIBLE) {
+   node_to_zero_pfn[node] = zero_pfn;
+   }
+
+   for_each_node_state(node, N_HIGH_MEMORY) {
+   struct page *page;
+   page = alloc_pages_exact_node(node,
+   GFP_HIGHUSER | __GFP_ZERO, 0);
+   if (!page)
+   continue;
+
+   node_to_zero_pfn[node] = page_to_pfn(page);
+   }
+}
+#else
+static inline void __init init_numa_zero_pfn(void) {}
+#endif
+
 /*
  * CONFIG_MMU architectures set up ZERO_PAGE in their paging_init()
  */
 static int __init init_zero_pfn(void)
 {
zero_pfn = page_to_pfn(ZERO_PAGE(0));
+   init_numa_zero_pfn();
+
return 0;
 }
 core_initcall(init_zero_pfn);
@@ -717,6 +748,24 @@ static inline bool is_cow_mapping(vm_flags_t flags)
return (flags  (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 }
 
+#ifdef CONFIG_NUMA_ZERO_PFN
+static inline int is_numa_zero_pfn(unsigned long pfn)
+{
+   return zero_pfn == pfn || node_to_zero_pfn[pfn_to_nid(pfn)] == pfn;
+}
+
+static inline unsigned long my_numa_zero_pfn(unsigned long addr)
+{
+   if (nr_node_ids == 1)
+   return zero_pfn;
+
+   return node_to_zero_pfn[numa_node_id()];
+}
+
+#define is_zero_pfn is_numa_zero_pfn
+#define my_zero_pfn my_numa_zero_pfn
+#endif /* CONFIG_NUMA_ZERO_PFN */
+
 #ifndef is_zero_pfn
 static inline int is_zero_pfn(unsigned long pfn)
 {
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm: introduce numa_zero_pfn

2012-12-17 Thread JoonSoo Kim

2012/12/13 Andi Kleen a...@firstfloor.org:
 I would expect a processor to fetch the zero page cachelines from the l3
 cache from other sockets avoiding memory transactions altogether. The zero
 page is likely in use somewhere so no typically no memory accesses should
 occur in a system.

 It depends on how effectively the workload uses the caches. If something
 is a cache pig of the L3 cache, then even shareable cache lines may need
 to be refetched regularly.

 But if your workloads spends a significant part of its time reading
 from zero page read only data there is something wrong with the workload.

 I would do some data profiling first to really prove that is the case.

Okay.
I didn't know about L3 cache, before.
Now, I think that I need some data profiling!
Thanks for comment.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 0/3] introduce static_vm for ARM-specific static mapped area

2012-11-14 Thread Joonsoo Kim

In current implementation, we used ARM-specific flag, that is,
VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area.
The purpose of static mapped area is to re-use static mapped area when
entire physical address range of the ioremap request can be covered
by this area.

This implementation causes needless overhead for some cases.
We unnecessarily iterate vmlist for finding matched area even if there
is no static mapped area. And if there are some static mapped areas,
iterating whole vmlist is not preferable.
In fact, it is not a critical problem, because ioremap is not frequently
used. But reducing overhead is better idea.

Another reason for doing this work is for removing architecture dependency
on vmalloc layer. I think that vmlist and vmlist_lock is internal data
structure for vmalloc layer. Some codes for debugging and stat inevitably
use vmlist and vmlist_lock. But it is preferable that they are used outside
of vmalloc.c as least as possible.

In the near future, I will try to remove other architecture dependency on
vmalloc layer.

This is just RFC patch and I did compile-test only.
If you have any good suggestion, please let me know.

These are based on v3.7-rc5.

Thanks.

Joonsoo Kim (3):
  ARM: vmregion: remove vmregion code entirely
  ARM: static_vm: introduce an infrastructure for static mapped area
  ARM: mm: use static_vm for managing static mapped areas

 arch/arm/include/asm/mach/static_vm.h |   51 
 arch/arm/mm/Makefile  |2 +-
 arch/arm/mm/ioremap.c |   69 ---
 arch/arm/mm/mm.h  |   10 --
 arch/arm/mm/mmu.c |   55 +
 arch/arm/mm/static_vm.c   |   97 
 arch/arm/mm/vmregion.c|  205 -
 arch/arm/mm/vmregion.h|   31 -
 8 files changed, 208 insertions(+), 312 deletions(-)
 create mode 100644 arch/arm/include/asm/mach/static_vm.h
 create mode 100644 arch/arm/mm/static_vm.c
 delete mode 100644 arch/arm/mm/vmregion.c
 delete mode 100644 arch/arm/mm/vmregion.h

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 1/3] ARM: vmregion: remove vmregion code entirely

2012-11-14 Thread Joonsoo Kim

Now, there is no user for vmregion.
So remove it.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile
index 8a9c4cb..4e333fa 100644
--- a/arch/arm/mm/Makefile
+++ b/arch/arm/mm/Makefile
@@ -6,7 +6,7 @@ obj-y   := dma-mapping.o extable.o 
fault.o init.o \
   iomap.o
 
 obj-$(CONFIG_MMU)  += fault-armv.o flush.o idmap.o ioremap.o \
-  mmap.o pgd.o mmu.o vmregion.o
+  mmap.o pgd.o mmu.o
 
 ifneq ($(CONFIG_MMU),y)
 obj-y  += nommu.o
diff --git a/arch/arm/mm/vmregion.c b/arch/arm/mm/vmregion.c
deleted file mode 100644
index a631016..000
--- a/arch/arm/mm/vmregion.c
+++ /dev/null
@@ -1,205 +0,0 @@
-#include linux/fs.h
-#include linux/spinlock.h
-#include linux/list.h
-#include linux/proc_fs.h
-#include linux/seq_file.h
-#include linux/slab.h
-
-#include vmregion.h
-
-/*
- * VM region handling support.
- *
- * This should become something generic, handling VM region allocations for
- * vmalloc and similar (ioremap, module space, etc).
- *
- * I envisage vmalloc()'s supporting vm_struct becoming:
- *
- *  struct vm_struct {
- *struct vmregion  region;
- *unsigned longflags;
- *struct page  **pages;
- *unsigned int nr_pages;
- *unsigned longphys_addr;
- *  };
- *
- * get_vm_area() would then call vmregion_alloc with an appropriate
- * struct vmregion head (eg):
- *
- *  struct vmregion vmalloc_head = {
- * .vm_list= LIST_HEAD_INIT(vmalloc_head.vm_list),
- * .vm_start   = VMALLOC_START,
- * .vm_end = VMALLOC_END,
- *  };
- *
- * However, vmalloc_head.vm_start is variable (typically, it is dependent on
- * the amount of RAM found at boot time.)  I would imagine that get_vm_area()
- * would have to initialise this each time prior to calling vmregion_alloc().
- */
-
-struct arm_vmregion *
-arm_vmregion_alloc(struct arm_vmregion_head *head, size_t align,
-  size_t size, gfp_t gfp, const void *caller)
-{
-   unsigned long start = head-vm_start, addr = head-vm_end;
-   unsigned long flags;
-   struct arm_vmregion *c, *new;
-
-   if (head-vm_end - head-vm_start  size) {
-   printk(KERN_WARNING %s: allocation too big (requested %#x)\n,
-   __func__, size);
-   goto out;
-   }
-
-   new = kmalloc(sizeof(struct arm_vmregion), gfp);
-   if (!new)
-   goto out;
-
-   new-caller = caller;
-
-   spin_lock_irqsave(head-vm_lock, flags);
-
-   addr = rounddown(addr - size, align);
-   list_for_each_entry_reverse(c, head-vm_list, vm_list) {
-   if (addr = c-vm_end)
-   goto found;
-   addr = rounddown(c-vm_start - size, align);
-   if (addr  start)
-   goto nospc;
-   }
-
- found:
-   /*
-* Insert this entry after the one we found.
-*/
-   list_add(new-vm_list, c-vm_list);
-   new-vm_start = addr;
-   new-vm_end = addr + size;
-   new-vm_active = 1;
-
-   spin_unlock_irqrestore(head-vm_lock, flags);
-   return new;
-
- nospc:
-   spin_unlock_irqrestore(head-vm_lock, flags);
-   kfree(new);
- out:
-   return NULL;
-}
-
-static struct arm_vmregion *__arm_vmregion_find(struct arm_vmregion_head 
*head, unsigned long addr)
-{
-   struct arm_vmregion *c;
-
-   list_for_each_entry(c, head-vm_list, vm_list) {
-   if (c-vm_active  c-vm_start == addr)
-   goto out;
-   }
-   c = NULL;
- out:
-   return c;
-}
-
-struct arm_vmregion *arm_vmregion_find(struct arm_vmregion_head *head, 
unsigned long addr)
-{
-   struct arm_vmregion *c;
-   unsigned long flags;
-
-   spin_lock_irqsave(head-vm_lock, flags);
-   c = __arm_vmregion_find(head, addr);
-   spin_unlock_irqrestore(head-vm_lock, flags);
-   return c;
-}
-
-struct arm_vmregion *arm_vmregion_find_remove(struct arm_vmregion_head *head, 
unsigned long addr)
-{
-   struct arm_vmregion *c;
-   unsigned long flags;
-
-   spin_lock_irqsave(head-vm_lock, flags);
-   c = __arm_vmregion_find(head, addr);
-   if (c)
-   c-vm_active = 0;
-   spin_unlock_irqrestore(head-vm_lock, flags);
-   return c;
-}
-
-void arm_vmregion_free(struct arm_vmregion_head *head, struct arm_vmregion *c)
-{
-   unsigned long flags;
-
-   spin_lock_irqsave(head-vm_lock, flags);
-   list_del(c-vm_list);
-   spin_unlock_irqrestore(head-vm_lock, flags);
-
-   kfree(c);
-}
-
-#ifdef CONFIG_PROC_FS
-static int arm_vmregion_show(struct seq_file *m, void *p)
-{
-   struct arm_vmregion *c = list_entry(p, struct arm_vmregion, vm_list);
-
-   seq_printf(m, 0x%08lx-0x%08lx %7lu, c-vm_start, c-vm_end,
-   c-vm_end - c-vm_start);
-   if (c-caller

[RFC PATCH 2/3] ARM: static_vm: introduce an infrastructure for static mapped area

2012-11-14 Thread Joonsoo Kim

In current implementation, we used ARM-specific flag, that is,
VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area.
The purpose of static mapped area is to re-use static mapped area when
entire physical address range of the ioremap request can be covered
by this area.

This implementation causes needless overhead for some cases.
We unnecessarily iterate vmlist for finding matched area even if there
is no static mapped area. And if there are some static mapped areas,
iterating whole vmlist is not preferable.
In fact, it is not a critical problem, because ioremap is not frequently
used. But reducing overhead is better idea.

Another reason for doing this work is for removing architecture dependency
on vmalloc layer. I think that vmlist and vmlist_lock is internal data
structure for vmalloc layer. Some codes for debugging and stat inevitably
use vmlist and vmlist_lock. But it is preferable that they are used outside
of vmalloc.c as least as possible.

Now, I introduce an ARM-specific infrastructure for static mapped area. In
the following patch, we will use this and resolve above mentioned problem.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/arch/arm/include/asm/mach/static_vm.h 
b/arch/arm/include/asm/mach/static_vm.h
new file mode 100644
index 000..1bb6604
--- /dev/null
+++ b/arch/arm/include/asm/mach/static_vm.h
@@ -0,0 +1,45 @@
+/*
+ * arch/arm/include/asm/mach/static_vm.h
+ *
+ * Copyright (C) 2012 LG Electronics, Joonsoo Kim js1...@gmail.com
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef _ASM_MACH_STATIC_VM_H
+#define _ASM_MACH_STATIC_VM_H
+
+#include linux/types.h
+#include linux/vmalloc.h
+
+struct static_vm {
+   struct static_vm*next;
+   void*vaddr;
+   unsigned long   size;
+   unsigned long   flags;
+   phys_addr_t paddr;
+   const void  *caller;
+};
+
+extern struct static_vm *static_vmlist;
+extern spinlock_t static_vmlist_lock;
+
+extern struct static_vm *find_static_vm_paddr(phys_addr_t paddr,
+   size_t size, unsigned long flags);
+extern struct static_vm *find_static_vm_vaddr(void *vaddr, unsigned long 
flags);
+extern void init_static_vm(struct static_vm *static_vm,
+   struct vm_struct *vm, unsigned long flags);
+extern void insert_static_vm(struct static_vm *vm);
+
+#endif /* _ASM_MACH_STATIC_VM_H */
diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile
index 4e333fa..57b329a 100644
--- a/arch/arm/mm/Makefile
+++ b/arch/arm/mm/Makefile
@@ -6,7 +6,7 @@ obj-y   := dma-mapping.o extable.o 
fault.o init.o \
   iomap.o
 
 obj-$(CONFIG_MMU)  += fault-armv.o flush.o idmap.o ioremap.o \
-  mmap.o pgd.o mmu.o
+  mmap.o pgd.o mmu.o static_vm.o
 
 ifneq ($(CONFIG_MMU),y)
 obj-y  += nommu.o
diff --git a/arch/arm/mm/static_vm.c b/arch/arm/mm/static_vm.c
new file mode 100644
index 000..d7677cf
--- /dev/null
+++ b/arch/arm/mm/static_vm.c
@@ -0,0 +1,97 @@
+/*
+ * arch/arm/mm/static_vm.c
+ *
+ * Copyright (C) 2012 LG Electronics, Joonsoo Kim js1...@gmail.com
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include linux/spinlock.h
+
+#include asm/mach/static_vm.h
+
+struct static_vm *static_vmlist;
+DEFINE_SPINLOCK(static_vmlist_lock);
+
+struct static_vm *find_static_vm_paddr(phys_addr_t paddr,
+   size_t size, unsigned long flags)
+{
+   struct static_vm *area;
+
+   spin_lock(static_vmlist_lock);
+   for (area = static_vmlist; area; area = area-next) {
+   if ((area-flags  flags) != flags

[RFC PATCH 3/3] ARM: mm: use static_vm for managing static mapped areas

2012-11-14 Thread Joonsoo Kim

A static mapped area is ARM-specific, so it is better not to use
generic vmalloc data structure, that is, vmlist and vmlist_lock
for managing static mapped area. And it causes some needless overhead and
reducing this overhead is better idea.

Now, we have newly introduced static_vm infrastructure.
With it, we don't need to iterate all mapped areas. Instead, we just
iterate static mapped areas. It helps to reduce an overhead of finding
matched area. And architecture dependency on vmalloc layer is removed,
so it will help to maintainability for vmalloc layer.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/arch/arm/include/asm/mach/static_vm.h 
b/arch/arm/include/asm/mach/static_vm.h
index 1bb6604..0d9c685 100644
--- a/arch/arm/include/asm/mach/static_vm.h
+++ b/arch/arm/include/asm/mach/static_vm.h
@@ -32,6 +32,12 @@ struct static_vm {
const void  *caller;
 };
 
+#define STATIC_VM_MEM  0x0001
+#define STATIC_VM_EMPTY0x0002
+
+/* mtype should be less than 28 */
+#define STATIC_VM_MTYPE(mt)(1UL  ((mt) + 4))
+
 extern struct static_vm *static_vmlist;
 extern spinlock_t static_vmlist_lock;
 
diff --git a/arch/arm/mm/ioremap.c b/arch/arm/mm/ioremap.c
index 5dcc2fd..b7f3c27 100644
--- a/arch/arm/mm/ioremap.c
+++ b/arch/arm/mm/ioremap.c
@@ -36,6 +36,7 @@
 #include asm/system_info.h
 
 #include asm/mach/map.h
+#include asm/mach/static_vm.h
 #include asm/mach/pci.h
 #include mm.h
 
@@ -197,7 +198,8 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn,
const struct mem_type *type;
int err;
unsigned long addr;
-   struct vm_struct * area;
+   struct vm_struct *area;
+   phys_addr_t paddr = __pfn_to_phys(pfn);
 
 #ifndef CONFIG_ARM_LPAE
/*
@@ -219,24 +221,17 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn,
/*
 * Try to reuse one of the static mapping whenever possible.
 */
-   read_lock(vmlist_lock);
-   for (area = vmlist; area; area = area-next) {
-   if (!size || (sizeof(phys_addr_t) == 4  pfn = 0x10))
-   break;
-   if (!(area-flags  VM_ARM_STATIC_MAPPING))
-   continue;
-   if ((area-flags  VM_ARM_MTYPE_MASK) != VM_ARM_MTYPE(mtype))
-   continue;
-   if (__phys_to_pfn(area-phys_addr)  pfn ||
-   __pfn_to_phys(pfn) + size-1  area-phys_addr + 
area-size-1)
-   continue;
-   /* we can drop the lock here as we know *area is static */
-   read_unlock(vmlist_lock);
-   addr = (unsigned long)area-addr;
-   addr += __pfn_to_phys(pfn) - area-phys_addr;
-   return (void __iomem *) (offset + addr);
+   if (size  !((sizeof(phys_addr_t) == 4  pfn = 0x10))) {
+   struct static_vm *static_vm;
+
+   static_vm = find_static_vm_paddr(__pfn_to_phys(pfn), size,
+   STATIC_VM_MEM | STATIC_VM_MTYPE(mtype));
+   if (static_vm) {
+   addr = (unsigned long)static_vm-vaddr;
+   addr += paddr - static_vm-paddr;
+   return (void __iomem *) (offset + addr);
+   }
}
-   read_unlock(vmlist_lock);
 
/*
 * Don't allow RAM to be mapped - this causes problems with ARMv6+
@@ -248,7 +243,7 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn,
if (!area)
return NULL;
addr = (unsigned long)area-addr;
-   area-phys_addr = __pfn_to_phys(pfn);
+   area-phys_addr = paddr;
 
 #if !defined(CONFIG_SMP)  !defined(CONFIG_ARM_LPAE)
if (DOMAIN_IO == 0 
@@ -346,34 +341,20 @@ __arm_ioremap_exec(unsigned long phys_addr, size_t size, 
bool cached)
 void __iounmap(volatile void __iomem *io_addr)
 {
void *addr = (void *)(PAGE_MASK  (unsigned long)io_addr);
-   struct vm_struct *vm;
-
-   read_lock(vmlist_lock);
-   for (vm = vmlist; vm; vm = vm-next) {
-   if (vm-addr  addr)
-   break;
-   if (!(vm-flags  VM_IOREMAP))
-   continue;
-   /* If this is a static mapping we must leave it alone */
-   if ((vm-flags  VM_ARM_STATIC_MAPPING) 
-   (vm-addr = addr)  (vm-addr + vm-size  addr)) {
-   read_unlock(vmlist_lock);
-   return;
-   }
+   struct static_vm *static_vm;
+
+   static_vm = find_static_vm_vaddr(addr, STATIC_VM_MEM);
+   if (static_vm)
+   return;
+
 #if !defined(CONFIG_SMP)  !defined(CONFIG_ARM_LPAE)
-   /*
-* If this is a section based mapping we need to handle it
-* specially as the VM subsystem does not know how to handle
-* such a beast.
-*/
-   if ((vm-addr == addr

Re: [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry

2012-11-14 Thread JoonSoo Kim

Hi, Minchan.

2012/11/14 Minchan Kim minc...@kernel.org:
 On Tue, Nov 13, 2012 at 11:12:28PM +0900, JoonSoo Kim wrote:
 2012/11/13 Minchan Kim minc...@kernel.org:
  On Tue, Nov 13, 2012 at 09:30:57AM +0900, JoonSoo Kim wrote:
  2012/11/3 Minchan Kim minc...@kernel.org:
   Hi Joonsoo,
  
   On Sat, Nov 03, 2012 at 04:07:25AM +0900, JoonSoo Kim wrote:
   Hello, Minchan.
  
   2012/11/1 Minchan Kim minc...@kernel.org:
On Thu, Nov 01, 2012 at 01:56:36AM +0900, Joonsoo Kim wrote:
In current code, after flush_all_zero_pkmaps() is invoked,
then re-iterate all pkmaps. It can be optimized if 
flush_all_zero_pkmaps()
return index of first flushed entry. With this index,
we can immediately map highmem page to virtual address represented 
by index.
So change return type of flush_all_zero_pkmaps()
and return index of first flushed entry.
   
Additionally, update last_pkmap_nr to this index.
It is certain that entry which is below this index is occupied by 
other mapping,
therefore updating last_pkmap_nr to this index is reasonable 
optimization.
   
Cc: Mel Gorman m...@csn.ul.ie
Cc: Peter Zijlstra a.p.zijls...@chello.nl
Cc: Minchan Kim minc...@kernel.org
Signed-off-by: Joonsoo Kim js1...@gmail.com
   
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index ef788b5..97ad208 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -32,6 +32,7 @@ static inline void 
invalidate_kernel_vmap_range(void *vaddr, int size)
   
 #ifdef CONFIG_HIGHMEM
 #include asm/highmem.h
+#define PKMAP_INVALID_INDEX (LAST_PKMAP)
   
 /* declarations for linux/mm/highmem.c */
 unsigned int nr_free_highpages(void);
diff --git a/mm/highmem.c b/mm/highmem.c
index d98b0a9..b365f7b 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
  return virt_to_page(addr);
 }
   
-static void flush_all_zero_pkmaps(void)
+static unsigned int flush_all_zero_pkmaps(void)
 {
  int i;
- int need_flush = 0;
+ unsigned int index = PKMAP_INVALID_INDEX;
   
  flush_cache_kmaps();
   
@@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void)
pkmap_page_table[i]);
   
  set_page_address(page, NULL);
- need_flush = 1;
+ if (index == PKMAP_INVALID_INDEX)
+ index = i;
  }
- if (need_flush)
+ if (index != PKMAP_INVALID_INDEX)
  flush_tlb_kernel_range(PKMAP_ADDR(0), 
PKMAP_ADDR(LAST_PKMAP));
+
+ return index;
 }
   
 /**
@@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void)
  */
 void kmap_flush_unused(void)
 {
+ unsigned int index;
+
  lock_kmap();
- flush_all_zero_pkmaps();
+ index = flush_all_zero_pkmaps();
+ if (index != PKMAP_INVALID_INDEX  (index  last_pkmap_nr))
+ last_pkmap_nr = index;
   
I don't know how kmap_flush_unused is really fast path so how my 
nitpick
is effective. Anyway,
What problem happens if we do following as?
   
lock()
index = flush_all_zero_pkmaps();
if (index != PKMAP_INVALID_INDEX)
last_pkmap_nr = index;
unlock();
   
Normally, last_pkmap_nr is increased with searching empty slot in
map_new_virtual. So I expect return value of flush_all_zero_pkmaps
in kmap_flush_unused normally become either less than last_pkmap_nr
or last_pkmap_nr + 1.
  
   There is a case that return value of kmap_flush_unused() is larger
   than last_pkmap_nr.
  
   I see but why it's problem? kmap_flush_unused returns larger value than
   last_pkmap_nr means that there is no free slot at below the value.
   So unconditional last_pkmap_nr update is vaild.
 
  I think that this is not true.
  Look at the slightly different example.
 
  Assume last_pkmap = 20 and index 1-9, 12-19 is kmapped. 10, 11 is 
  kunmapped.
 
  do kmap_flush_unused() = flush index 10,11 = last_pkmap = 10;
  do kunmap() with index 17
  do kmap_flush_unused() = flush index 17 = last_pkmap = 17?
 
  In this case, unconditional last_pkmap_nr update skip one kunmapped index.
  So, conditional update is needed.
 
  Thanks for pouinting out, Joonsoo.
  You're right. I misunderstood your flush_all_zero_pkmaps change.
  As your change, flush_all_zero_pkmaps returns first *flushed* free slot 
  index.
  What's the benefit returning flushed flushed free slot index rather than 
  free slot index?

 If flush_all_zero_pkmaps() return free slot index rather than first
 flushed free slot,
 we need another comparison like as 'if pkmap_count[i] == 0' and
 need another local variable for determining whether flush is occurred or not.
 I want to minimize these overhead and churning of the code, although
 they are negligible.

  I think

[RFC PATCH] mm: WARN_ON_ONCE if f_op-mmap() change vma's start address

2012-11-14 Thread Joonsoo Kim

During reviewing the source code, I found a comment which mention that
after f_op-mmap(), vma's start address can be changed.
I didn't verify that it is really possible, because there are so many
f_op-mmap() implementation. But if there are some mmap() which change
vma's start address, it is possible error situation, because we already
prepare prev vma, rb_link and rb_parent and these are related to original
address.

So add WARN_ON_ONCE for finding that this situtation really happens.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/mm/mmap.c b/mm/mmap.c
index 2d94235..36567b7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1333,7 +1333,11 @@ munmap_back:
 *
 * Answer: Yes, several device drivers can do it in their
 * f_op-mmap method. -DaveM
+* Bug: If addr is changed, prev, rb_link, rb_parent should
+*  be updated for vma_link()
 */
+   WARN_ON_ONCE(addr != vma-vm_start);
+
addr = vma-vm_start;
pgoff = vma-vm_pgoff;
vm_flags = vma-vm_flags;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/3] introduce static_vm for ARM-specific static mapped area

2012-11-15 Thread JoonSoo Kim

Hello, Russell.
Thanks for review.

2012/11/15 Russell King - ARM Linux li...@arm.linux.org.uk:
 On Thu, Nov 15, 2012 at 01:55:51AM +0900, Joonsoo Kim wrote:
 In current implementation, we used ARM-specific flag, that is,
 VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area.
 The purpose of static mapped area is to re-use static mapped area when
 entire physical address range of the ioremap request can be covered
 by this area.

 This implementation causes needless overhead for some cases.

 In what cases?

For example, assume that there is only one static mapped area and
vmlist has 300 areas.
Every time we call ioremap, we check 300 areas for deciding whether it
is matched or not.
Moreover, even if there is no static mapped area and vmlist has 300 areas,
every time we call ioremap, we check 300 areas in now.

 We unnecessarily iterate vmlist for finding matched area even if there
 is no static mapped area. And if there are some static mapped areas,
 iterating whole vmlist is not preferable.

 Why not?  Please put some explanation into your message rather than
 just statements making unexplained assertions.

If we construct a extra list for static mapped area, we can eliminate
above mentioned overhead.
With a extra list, if there is one static mapped area,
we just check only one area and proceed next operation quickly.

 Another reason for doing this work is for removing architecture dependency
 on vmalloc layer. I think that vmlist and vmlist_lock is internal data
 structure for vmalloc layer. Some codes for debugging and stat inevitably
 use vmlist and vmlist_lock. But it is preferable that they are used outside
 of vmalloc.c as least as possible.

 The vmalloc layer is also made available for ioremap use, and it is
 intended that architectures hook into this for ioremap support.

Yes.
But, I think that it is preferable to use well-defined vmalloc API rather than
directly manipulating low-level data structure. IMHO, if there is no suitable
vmalloc API, making new one is better than directly manipulating
low-level data structure. It makes vmalloc code more maintainable.

I'm not expert, so please let me know what I missed.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] slub: assign refcount for kmalloc_caches

2013-01-10 Thread Joonsoo Kim

On Thu, Jan 10, 2013 at 08:47:39PM -0800, Paul Hargrove wrote:
 I just had a look at patch-3.7.2-rc1, and this change doesn't appear to
 have made it in yet.
 Am I missing something?
 
 -Paul

I try to check it.
Ccing to Greg.

Hello, Pekka and Greg.

v3.8-rcX has already fixed by another stuff, but it is not simple change.
So I made a new patch and sent it.

How this kind of patch (only for stable v3.7) go into stable tree?
through Pekka's slab tree? or send it to Greg, directly?

I don't know how to submit this kind of patch to stable tree exactly.
Could anyone help me?

Thanks.

 On Tue, Dec 25, 2012 at 7:30 AM, JoonSoo Kim js1...@gmail.com wrote:
 
  2012/12/26 Joonsoo Kim js1...@gmail.com:
   commit cce89f4f6911286500cf7be0363f46c9b0a12ce0('Move kmem_cache
   refcounting to common code') moves some refcount manipulation code to
   common code. Unfortunately, it also removed refcount assignment for
   kmalloc_caches. So, kmalloc_caches's refcount is initially 0.
   This makes errornous situation.
  
   Paul Hargrove report that when he create a 8-byte kmem_cache and
   destory it, he encounter below message.
   'Objects remaining in kmalloc-8 on kmem_cache_close()'
  
   8-byte kmem_cache merge with 8-byte kmalloc cache and refcount is
   increased by one. So, resulting refcount is 1. When destory it, it hit
   refcount = 0, then kmem_cache_close() is executed and error message is
   printed.
  
   This patch assign initial refcount 1 to kmalloc_caches, so fix this
   errornous situtation.
  
   Cc: sta...@vger.kernel.org # v3.7
   Cc: Christoph Lameter c...@linux.com
   Reported-by: Paul Hargrove phhargr...@lbl.gov
   Signed-off-by: Joonsoo Kim js1...@gmail.com
  
   diff --git a/mm/slub.c b/mm/slub.c
   index a0d6984..321afab 100644
   --- a/mm/slub.c
   +++ b/mm/slub.c
   @@ -3279,6 +3279,7 @@ static struct kmem_cache *__init
  create_kmalloc_cache(const char *name,
   if (kmem_cache_open(s, flags))
   goto panic;
  
   +   s-refcount = 1;
   list_add(s-list, slab_caches);
   return s;
  
   --
   1.7.9.5
  
 
  I missed some explanation.
  In v3.8-rc1, this problem is already solved.
  See create_kmalloc_cache() in mm/slab_common.c.
  So this patch is just for v3.7 stable.
 
 
 
 
 -- 
 Paul H. Hargrove  phhargr...@lbl.gov
 Future Technologies Group
 Computer and Data Sciences Department Tel: +1-510-495-2352
 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] slub: correct to calculate num of acquired objects in get_partial_node()

2013-01-14 Thread Joonsoo Kim

There is a subtle bug when calculating a number of acquired objects.
After acquire_slab() is executed at first, page-inuse is same as
page-objects, then, available is always 0. So, we always go next
iteration.

Correct it to stop a iteration when we find sufficient objects
at first iteration.

After that, we don't need return value of put_cpu_partial().
So remove it.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
---
These are based on v3.8-rc3 and there is no dependency between each other.
If rebase is needed, please notify me.

diff --git a/mm/slub.c b/mm/slub.c
index ba2ca53..abef30e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1493,7 +1493,7 @@ static inline void remove_partial(struct kmem_cache_node 
*n,
  */
 static inline void *acquire_slab(struct kmem_cache *s,
struct kmem_cache_node *n, struct page *page,
-   int mode)
+   int mode, int *objects)
 {
void *freelist;
unsigned long counters;
@@ -1506,6 +1506,7 @@ static inline void *acquire_slab(struct kmem_cache *s,
 */
freelist = page-freelist;
counters = page-counters;
+   *objects = page-objects - page-inuse;
new.counters = counters;
if (mode) {
new.inuse = page-objects;
@@ -1528,7 +1529,7 @@ static inline void *acquire_slab(struct kmem_cache *s,
return freelist;
 }
 
-static int put_cpu_partial(struct kmem_cache *s, struct page *page, int drain);
+static void put_cpu_partial(struct kmem_cache *s, struct page *page, int 
drain);
 static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags);
 
 /*
@@ -1539,6 +1540,8 @@ static void *get_partial_node(struct kmem_cache *s, 
struct kmem_cache_node *n,
 {
struct page *page, *page2;
void *object = NULL;
+   int available = 0;
+   int objects;
 
/*
 * Racy check. If we mistakenly see no partial slabs then we
@@ -1552,22 +1555,21 @@ static void *get_partial_node(struct kmem_cache *s, 
struct kmem_cache_node *n,
spin_lock(n-list_lock);
list_for_each_entry_safe(page, page2, n-partial, lru) {
void *t;
-   int available;
 
if (!pfmemalloc_match(page, flags))
continue;
 
-   t = acquire_slab(s, n, page, object == NULL);
+   t = acquire_slab(s, n, page, object == NULL, objects);
if (!t)
break;
 
+   available += objects;
if (!object) {
c-page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
-   available =  page-objects - page-inuse;
} else {
-   available = put_cpu_partial(s, page, 0);
+   put_cpu_partial(s, page, 0);
stat(s, CPU_PARTIAL_NODE);
}
if (kmem_cache_debug(s) || available  s-cpu_partial / 2)
@@ -1946,7 +1948,7 @@ static void unfreeze_partials(struct kmem_cache *s,
  * If we did not find a slot then simply move all the partials to the
  * per node partial list.
  */
-static int put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
+static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 {
struct page *oldpage;
int pages;
@@ -1984,7 +1986,6 @@ static int put_cpu_partial(struct kmem_cache *s, struct 
page *page, int drain)
page-next = oldpage;
 
} while (this_cpu_cmpxchg(s-cpu_slab-partial, oldpage, page) != 
oldpage);
-   return pobjects;
 }
 
 static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] slub: add 'likely' macro to inc_slabs_node()

2013-01-14 Thread Joonsoo Kim

After boot phase, 'n' always exist.
So add 'likely' macro for helping compiler.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/slub.c b/mm/slub.c
index 830348b..6f82070 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1005,7 +1005,7 @@ static inline void inc_slabs_node(struct kmem_cache *s, 
int node, int objects)
 * dilemma by deferring the increment of the count during
 * bootstrap (see early_kmem_cache_node_alloc).
 */
-   if (n) {
+   if (likely(n)) {
atomic_long_inc(n-nr_slabs);
atomic_long_add(objects, n-total_objects);
}
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] slub: correct bootstrap() for kmem_cache, kmem_cache_node

2013-01-14 Thread Joonsoo Kim

Current implementation of bootstrap() is not sufficient for kmem_cache
and kmem_cache_node.

First, for kmem_cache.
bootstrap() call kmem_cache_zalloc() at first. When kmem_cache_zalloc()
is called, kmem_cache's slab is moved to cpu slab for satisfying kmem_cache
allocation request. In current implementation, we only consider
n-partial slabs, so, we miss this cpu slab for kmem_cache.

Second, for kmem_cache_node.
When slab_state = PARTIAL, create_boot_cache() is called. And then,
kmem_cache_node's slab is moved to cpu slab for satisfying kmem_cache_node
allocation request. So, we also miss this slab.

These didn't make any error previously, because we normally don't free
objects which comes from kmem_cache's first slab and kmem_cache_node's.

Problem will be solved if we consider a cpu slab in bootstrap().
This patch implement it.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/slub.c b/mm/slub.c
index abef30e..830348b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3613,11 +3613,22 @@ static int slab_memory_callback(struct notifier_block 
*self,
 
 static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
 {
+   int cpu;
int node;
struct kmem_cache *s = kmem_cache_zalloc(kmem_cache, GFP_NOWAIT);
 
memcpy(s, static_cache, kmem_cache-object_size);
 
+   for_each_possible_cpu(cpu) {
+   struct kmem_cache_cpu *c;
+   struct page *p;
+
+   c = per_cpu_ptr(s-cpu_slab, cpu);
+   p = c-page;
+   if (p)
+   p-slab_cache = s;
+   }
+
for_each_node_state(node, N_NORMAL_MEMORY) {
struct kmem_cache_node *n = get_node(s, node);
struct page *p;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] tools, vm: add .gitignore to ignore built binaries

2013-01-14 Thread Joonsoo Kim

There is no .gitignore in tools/vm,
so 'git status' always show built binaries.
To ignore this, add .gitignore.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/tools/vm/.gitignore b/tools/vm/.gitignore
new file mode 100644
index 000..44f095f
--- /dev/null
+++ b/tools/vm/.gitignore
@@ -0,0 +1,2 @@
+slabinfo
+page-types
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 2/3] sched: factor out code to should_we_balance()

2013-08-05 Thread Joonsoo Kim

On Fri, Aug 02, 2013 at 12:20:40PM +0200, Peter Zijlstra wrote:
 On Fri, Aug 02, 2013 at 06:05:51PM +0900, ���ؼ� wrote:
 
 What is with you people; have you never learned to trim emails?
 
 Seriously, I'm going to write a script which tests to too many quoted
 lines, too many nested quotes and other such nonsense and drop your
 emails into /dev/null.

Hello, Peter.

Sorry about that.
I will be careful about this later.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 2/3] sched: factor out code to should_we_balance()

2013-08-05 Thread Joonsoo Kim

On Mon, Aug 05, 2013 at 09:52:28AM +0530, Preeti U Murthy wrote:
 On 08/02/2013 04:02 PM, Peter Zijlstra wrote:
  On Fri, Aug 02, 2013 at 02:56:14PM +0530, Preeti U Murthy wrote:
  You need to iterate over all the groups of the sched domain env-sd and
  not just the first group of env-sd like you are doing above. This is to
 
  I don't think so.
  IIRC, env-sd-groups always means local group,
  so we don't need to find our group by iterating over all the groups.
 
  Take a look at update_sd_lb_stats(). That should clarify this. There is
  an exclusive
  local_group check there.
 
  sd-groups points to the first group in the list of groups under this sd.
  
  Take a look at: 88b8dac0a
  
 Ah ok! Thanks for this pointer.
 
 Apologies for having overlooked the fact that the sd-groups always
 points to the group to which the balance_cpu belongs. And subsequent
 dst_cpus for retries of load balancing also belong to the same group as
 the balance_cpu.
 
 This patch thus looks fine to me.

Okay!

 
 Regards
 Preeti U Murthy
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 3/3] sched: clean-up struct sd_lb_stat

2013-08-05 Thread Joonsoo Kim

  +   if (busiest-group_imb) {
  +   busiest-sum_weighted_load =
  +   min(busiest-sum_weighted_load, sds-sd_avg_load);
 
 Right here we get confused as to why the total load is being compared
 against load per task (although you are changing it to load per task above).

Yes, you are right. I will add load_per_task to struct sg_lb_stats.


  @@ -4771,12 +4763,13 @@ static inline void calculate_imbalance(struct 
  lb_env *env, struct sd_lb_stats *s
   * Be careful of negative numbers as they'll appear as very large values
   * with unsigned longs.
   */
  -   max_pull = min(sds-max_load - sds-avg_load, load_above_capacity);
  +   max_pull = min(busiest-avg_load - sds-sd_avg_load,
  +   load_above_capacity);
 
 This is ok, but readability wise max_load is much better. max_load
 signifies the maximum load per task on a group in this sd, and avg_load
 signifies the total load per task across the sd. You are checking if
 there is imbalance in the total load in the sd and try to even it out
 across the sd.  Here busiest does not convey this immediately.

You may be confused. max_load doesn't means the maximum load per task.
It means that the busiest group's load per cpu power. And here max doesn't
mean maximum load in this sd, instead, load of busiest sg. IMO,
busiest-avg_load convey proper meaning.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 17/18] mm, hugetlb: retry if we fail to allocate a hugepage with use_reserve

2013-08-05 Thread Joonsoo Kim

 Any mapping that doesn't use the reserved pool, not just
 MAP_NORESERVE.  For example, if a process makes a MAP_PRIVATE mapping,
 then fork()s then the mapping is instantiated in the child, that will
 not draw from the reserved pool.
 
  Should we ensure them to allocate the last hugepage?
  They map a region with MAP_NORESERVE, so don't assume that their requests
  always succeed.
 
 If the pages are available, people get cranky if it fails for no
 apparent reason, MAP_NORESERVE or not.  They get especially cranky if
 it sometimes fails and sometimes doesn't due to a race condition.

Hello,

Hmm... Okay. I will try to implement another way to protect race condition.
Maybe it is the best to use a table mutex :)
Anyway, please give me a time, guys.

Really thank you for pointing that.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/4] mm, migrate: allocation new page lazyily in unmap_and_move()

2013-08-05 Thread Joonsoo Kim

 get_new_page() sets up result to communicate error codes from the
 following checks.  While the existing ones (page freed and thp split
 failed) don't change rc, somebody else might add a condition whose
 error code should be propagated back into *result but miss it.
 
 Please leave get_new_page() where it is.  The win from this change is
 not big enough to risk these problems.

Hello, Johannes.

Okay. You are right. I will omit this patch next time.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/4] mm, page_alloc: add likely macro to help compiler optimization

2013-08-05 Thread Joonsoo Kim

Hello, Michal.

On Fri, Aug 02, 2013 at 11:36:07PM +0200, Michal Hocko wrote:
 On Fri 02-08-13 16:47:10, Johannes Weiner wrote:
  On Fri, Aug 02, 2013 at 06:27:22PM +0200, Michal Hocko wrote:
   On Fri 02-08-13 11:07:56, Joonsoo Kim wrote:
We rarely allocate a page with ALLOC_NO_WATERMARKS and it is used
in slow path. For making fast path more faster, add likely macro to
help compiler optimization.
   
   The code is different in mmotm tree (see mm: page_alloc: rearrange
   watermark checking in get_page_from_freelist)
  
  Yes, please rebase this on top.
  
   Besides that, make sure you provide numbers which prove your claims
   about performance optimizations.
  
  Isn't that a bit overkill?  We know it's a likely path (we would
  deadlock constantly if a sizable portion of allocations were to ignore
  the watermarks).  Does he have to justify that likely in general makes
  sense?
 
 That was more a generic comment. If there is a claim that something
 would be faster it would be nice to back that claim by some numbers
 (e.g. smaller hot path).
 
 In this particular case, unlikely(alloc_flags  ALLOC_NO_WATERMARKS)
 doesn't make any change to the generated code with gcc 4.8.1 resp.
 4.3.4 I have here.
 Maybe other versions of gcc would benefit from the hint but changelog
 didn't tell us. I wouldn't add the anotation if it doesn't make any
 difference for the resulting code.

Hmm, Is there no change with gcc 4.8.1 and 4.3.4?

I found a change with gcc 4.6.3 and v3.10 kernel.

   textdata bss dec hex filename
 35683 1461 644   37788939c page_alloc_base.o
 35715 1461 644   3782093bc page_alloc_patch.o

Slightly larger (32 bytes) than before.
And assembly code looks different as I expected.

* Original code

 17126 .LVL1518:
 17127 .loc 2 1904 0 is_stmt 1  
  
 17128 testb   $4, -116(%rbp)  #, %sfp
 17129 je  .L866   #,

(snip)

 17974 .L866:
 17975 .LBE6053:
 17976 .LBE6052:
 17977 .LBE6051:
 17978 .LBE6073:
  
 17979 .LBE6093:
  
 17980 .LBB6094:
 17981 .loc 2 1908 0 
 17982 movl-116(%rbp), %r14d   # %sfp, D.42080
 17983 .loc 2 1909 0
 17984 movl-116(%rbp), %r8d# %sfp,
 17985 movq%rbx, %rdi  # prephitmp.1723,
 17986 movl-212(%rbp), %ecx# %sfp,
 17987 movl-80(%rbp), %esi # %sfp,
 17988 .loc 2 1908 0
 17989 andl$3, %r14d   #, D.42080
 17990 movslq  %r14d, %rax # D.42080, D.42080
 17991 movq(%rbx,%rax,8), %r13 # prephitmp.1723_268-watermark, 
mark
 17992 .LVL1591:
 17993 .loc 2 1909 0
 17994 movq%r13, %rdx  # mark,
 17995 callzone_watermark_ok   #

On 17129 line, we check ALLOC_NO_WATERMARKS and if not matched, then jump to 
L866.
L866 is on 17981 line.

* Patched code

 17122 .L807:
 17123 .LVL1513:
 17124 .loc 2 1904 0 is_stmt 1
 17125 testb   $4, -88(%rbp)   #, %sfp
 17126 jne .L811   #,
 17127 .LBB6092:
 17128 .loc 2 1908 0
 17129 movl-88(%rbp), %r13d# %sfp, D.42082
 17130 .loc 2 1909 0
 17131 movl-88(%rbp), %r8d # %sfp,
 17132 movq%rbx, %rdi  # prephitmp.1723,
 17133 movl-160(%rbp), %ecx# %sfp,
 17134 movl-80(%rbp), %esi # %sfp,
 17135 .loc 2 1908 0
 17136 andl$3, %r13d   #, D.42082
 17137 movslq  %r13d, %rax # D.42082, D.42082
 17138 movq(%rbx,%rax,8), %r12 # prephitmp.1723_270-watermark, 
mark
 17139 .LVL1514:
 17140 .loc 2 1909 0
 17141 movq%r12, %rdx  # mark,
 17142 callzone_watermark_ok   #

On 17124 line, we check ALLOC_NO_WATERMARKS (0x4) and if not matched,
execute following code without jumping. This is effect of 'likely' macro.
Isn't it reasonable?

Thanks.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/4] mm, page_alloc: add likely macro to help compiler optimization

2013-08-05 Thread Joonsoo Kim

On Mon, Aug 05, 2013 at 05:10:08PM +0900, Joonsoo Kim wrote:
 Hello, Michal.
 
 On Fri, Aug 02, 2013 at 11:36:07PM +0200, Michal Hocko wrote:
  On Fri 02-08-13 16:47:10, Johannes Weiner wrote:
   On Fri, Aug 02, 2013 at 06:27:22PM +0200, Michal Hocko wrote:
On Fri 02-08-13 11:07:56, Joonsoo Kim wrote:
 We rarely allocate a page with ALLOC_NO_WATERMARKS and it is used
 in slow path. For making fast path more faster, add likely macro to
 help compiler optimization.

The code is different in mmotm tree (see mm: page_alloc: rearrange
watermark checking in get_page_from_freelist)
   
   Yes, please rebase this on top.
   
Besides that, make sure you provide numbers which prove your claims
about performance optimizations.
   
   Isn't that a bit overkill?  We know it's a likely path (we would
   deadlock constantly if a sizable portion of allocations were to ignore
   the watermarks).  Does he have to justify that likely in general makes
   sense?
  
  That was more a generic comment. If there is a claim that something
  would be faster it would be nice to back that claim by some numbers
  (e.g. smaller hot path).
  
  In this particular case, unlikely(alloc_flags  ALLOC_NO_WATERMARKS)
  doesn't make any change to the generated code with gcc 4.8.1 resp.
  4.3.4 I have here.
  Maybe other versions of gcc would benefit from the hint but changelog
  didn't tell us. I wouldn't add the anotation if it doesn't make any
  difference for the resulting code.
 
 Hmm, Is there no change with gcc 4.8.1 and 4.3.4?
 
 I found a change with gcc 4.6.3 and v3.10 kernel.

Ah... I did a test on mmotm (Johannes's git) and found that this patch
doesn't make any effect. I guess, a change from Johannes ('rearrange
watermark checking in get_page_from_freelist') already makes better code
for !ALLOC_NO_WATERMARKS case. IMHO, although there is no effect, it is
better to add likely macro, because arrangement can be changed from time
to time without any consideration of assembly code generation. How about
your opinion, Johannes and Michal?

Thanks.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 0/3] optimization, clean-up about fair.c

2013-08-06 Thread Joonsoo Kim

Patch 1 is for removing one division operation with multiplication.
Patch 2,3 is for clean-up related to load_balance(), there is improvement
in terms of code size and readability may be also improved.

Feel free to give a comment for this patchset.

It's tested on v3.10.
On v3.11-rc3, it can be compiled properly.

* Change from v2
[2/3]: set initial value 1 to should_balance in rebalance_domains()
as Vincent commented
[3/3]: not overwrite sum_weighted_load to represent load_per_task.
Instead, use newly load_per_task as Preeti commented

* Change from v1
Remove 2 patches, because I'm not sure they are right.

Joonsoo Kim (3):
  sched: remove one division operation in find_buiest_queue()
  sched: factor out code to should_we_balance()
  sched: clean-up struct sd_lb_stat

 kernel/sched/fair.c |  326 +--
 1 file changed, 162 insertions(+), 164 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 3/3] sched: clean-up struct sd_lb_stat

2013-08-06 Thread Joonsoo Kim

There is no reason to maintain separate variables for this_group
and busiest_group in sd_lb_stat, except saving some space.
But this structure is always allocated in stack, so this saving
isn't really benificial.

This patch unify these variables, so IMO, readability may be improved.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c6732d2..f8a9660 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4232,36 +4232,6 @@ static unsigned long task_h_load(struct task_struct *p)
 
 /** Helpers for find_busiest_group /
 /*
- * sd_lb_stats - Structure to store the statistics of a sched_domain
- * during load balancing.
- */
-struct sd_lb_stats {
-   struct sched_group *busiest; /* Busiest group in this sd */
-   struct sched_group *this;  /* Local group in this sd */
-   unsigned long total_load;  /* Total load of all groups in sd */
-   unsigned long total_pwr;   /*   Total power of all groups in sd */
-   unsigned long avg_load;/* Average load across all groups in sd */
-
-   /** Statistics of this group */
-   unsigned long this_load;
-   unsigned long this_load_per_task;
-   unsigned long this_nr_running;
-   unsigned long this_has_capacity;
-   unsigned int  this_idle_cpus;
-
-   /* Statistics of the busiest group */
-   unsigned int  busiest_idle_cpus;
-   unsigned long max_load;
-   unsigned long busiest_load_per_task;
-   unsigned long busiest_nr_running;
-   unsigned long busiest_group_capacity;
-   unsigned long busiest_has_capacity;
-   unsigned int  busiest_group_weight;
-
-   int group_imb; /* Is there imbalance in this sd */
-};
-
-/*
  * sg_lb_stats - stats of a sched_group required for load_balancing
  */
 struct sg_lb_stats {
@@ -4269,6 +4239,7 @@ struct sg_lb_stats {
unsigned long group_load; /* Total load over the CPUs of the group */
unsigned long sum_nr_running; /* Nr tasks running in the group */
unsigned long sum_weighted_load; /* Weighted load of group's tasks */
+   unsigned long load_per_task;
unsigned long group_capacity;
unsigned long idle_cpus;
unsigned long group_weight;
@@ -4276,6 +4247,24 @@ struct sg_lb_stats {
int group_has_capacity; /* Is there extra capacity in the group? */
 };
 
+/*
+ * sd_lb_stats - Structure to store the statistics of a sched_domain
+ * during load balancing.
+ */
+struct sd_lb_stats {
+   struct sched_group *busiest; /* Busiest group in this sd */
+   struct sched_group *this;  /* Local group in this sd */
+   unsigned long total_load;  /* Total load of all groups in sd */
+   unsigned long total_pwr;   /*   Total power of all groups in sd */
+   unsigned long sd_avg_load; /* Average load across all groups in sd */
+
+   /* Statistics of this group */
+   struct sg_lb_stats this_stat;
+
+   /* Statistics of the busiest group */
+   struct sg_lb_stats busiest_stat;
+};
+
 /**
  * get_sd_load_idx - Obtain the load index for a given sched domain.
  * @sd: The sched_domain whose load_idx is to be obtained.
@@ -4556,7 +4545,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
   struct sched_group *sg,
   struct sg_lb_stats *sgs)
 {
-   if (sgs-avg_load = sds-max_load)
+   if (sgs-avg_load = sds-busiest_stat.avg_load)
return false;
 
if (sgs-sum_nr_running  sgs-group_capacity)
@@ -4593,7 +4582,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 {
struct sched_domain *child = env-sd-child;
struct sched_group *sg = env-sd-groups;
-   struct sg_lb_stats sgs;
+   struct sg_lb_stats tmp_sgs;
int load_idx, prefer_sibling = 0;
 
if (child  child-flags  SD_PREFER_SIBLING)
@@ -4603,13 +4592,16 @@ static inline void update_sd_lb_stats(struct lb_env 
*env,
 
do {
int local_group;
+   struct sg_lb_stats *sgs;
 
local_group = cpumask_test_cpu(env-dst_cpu, 
sched_group_cpus(sg));
-   memset(sgs, 0, sizeof(sgs));
-   update_sg_lb_stats(env, sg, load_idx, local_group, sgs);
-
-   sds-total_load += sgs.group_load;
-   sds-total_pwr += sg-sgp-power;
+   if (local_group)
+   sgs = sds-this_stat;
+   else {
+   sgs = tmp_sgs;
+   memset(sgs, 0, sizeof(*sgs));
+   }
+   update_sg_lb_stats(env, sg, load_idx, local_group, sgs);
 
/*
 * In case the child domain prefers tasks go to siblings
@@ -4621,26 +4613,19 @@ static inline void update_sd_lb_stats(struct lb_env 
*env,
 * heaviest group when it is already under-utilized (possible
 * with a large weight task

[PATCH v2 mmotm 3/3] swap: clean-up #ifdef in page_mapping()

2013-08-06 Thread Joonsoo Kim

PageSwapCache() is always false when !CONFIG_SWAP, so compiler
properly discard related code. Therefore, we don't need #ifdef explicitly.

Acked-by: Johannes Weiner han...@cmpxchg.org
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 24db914..c03c139 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -447,6 +447,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, 
swp_entry_t ent, bool swapout)
 
 #else /* CONFIG_SWAP */
 
+#define swap_address_space(entry)  (NULL)
 #define get_nr_swap_pages()0L
 #define total_swap_pages   0L
 #define total_swapcache_pages()0UL
diff --git a/mm/util.c b/mm/util.c
index 7441c41..eaf63fc2 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -388,15 +388,12 @@ struct address_space *page_mapping(struct page *page)
struct address_space *mapping = page-mapping;
 
VM_BUG_ON(PageSlab(page));
-#ifdef CONFIG_SWAP
if (unlikely(PageSwapCache(page))) {
swp_entry_t entry;
 
entry.val = page_private(page);
mapping = swap_address_space(entry);
-   } else
-#endif
-   if ((unsigned long)mapping  PAGE_MAPPING_ANON)
+   } else if ((unsigned long)mapping  PAGE_MAPPING_ANON)
mapping = NULL;
return mapping;
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 mmotm 1/3] mm, page_alloc: add unlikely macro to help compiler optimization

2013-08-06 Thread Joonsoo Kim

We rarely allocate a page with ALLOC_NO_WATERMARKS and it is used
in slow path. For helping compiler optimization, add unlikely macro to
ALLOC_NO_WATERMARKS checking.

This patch doesn't have any effect now, because gcc already optimize
this properly. But we cannot assume that gcc always does right and nobody
re-evaluate if gcc do proper optimization with their change, for example,
it is not optimized properly on v3.10. So adding compiler hint here
is reasonable.

Acked-by: Johannes Weiner han...@cmpxchg.org
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f5c549c..04bec49 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1901,7 +1901,7 @@ zonelist_scan:
!cpuset_zone_allowed_softwall(zone, gfp_mask))
continue;
BUILD_BUG_ON(ALLOC_NO_WATERMARKS  NR_WMARK);
-   if (alloc_flags  ALLOC_NO_WATERMARKS)
+   if (unlikely(alloc_flags  ALLOC_NO_WATERMARKS))
goto try_this_zone;
/*
 * Distribute pages in proportion to the individual
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 2/3] sched: factor out code to should_we_balance()

2013-08-06 Thread Joonsoo Kim

Now checking whether this cpu is appropriate to balance or not
is embedded into update_sg_lb_stats() and this checking has no direct
relationship to this function. There is not enough reason to place
this checking at update_sg_lb_stats(), except saving one iteration
for sched_group_cpus.

In this patch, I factor out this checking to should_we_balance() function.
And before doing actual work for load_balancing, check whether this cpu is
appropriate to balance via should_we_balance(). If this cpu is not
a candidate for balancing, it quit the work immediately.

With this change, we can save two memset cost and can expect better
compiler optimization.

Below is result of this patch.

* Vanilla *
   textdata bss dec hex filename
  344991136 116   357518ba7 kernel/sched/fair.o

* Patched *
   textdata bss dec hex filename
  342431136 116   354958aa7 kernel/sched/fair.o

In addition, rename @balance to @should_balance in order to represent
its purpose more clearly.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 52898dc..c6732d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4463,22 +4463,17 @@ fix_small_capacity(struct sched_domain *sd, struct 
sched_group *group)
  * @group: sched_group whose statistics are to be updated.
  * @load_idx: Load index of sched_domain of this_cpu for load calc.
  * @local_group: Does group contain this_cpu.
- * @balance: Should we balance.
  * @sgs: variable to hold the statistics for this group.
  */
 static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
-   int local_group, int *balance, struct sg_lb_stats *sgs)
+   int local_group, struct sg_lb_stats *sgs)
 {
unsigned long nr_running, max_nr_running, min_nr_running;
unsigned long load, max_cpu_load, min_cpu_load;
-   unsigned int balance_cpu = -1, first_idle_cpu = 0;
unsigned long avg_load_per_task = 0;
int i;
 
-   if (local_group)
-   balance_cpu = group_balance_cpu(group);
-
/* Tally up the load of all CPUs in the group */
max_cpu_load = 0;
min_cpu_load = ~0UL;
@@ -4491,15 +4486,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
nr_running = rq-nr_running;
 
/* Bias balancing toward cpus of our domain */
-   if (local_group) {
-   if (idle_cpu(i)  !first_idle_cpu 
-   cpumask_test_cpu(i, 
sched_group_mask(group))) {
-   first_idle_cpu = 1;
-   balance_cpu = i;
-   }
-
+   if (local_group)
load = target_load(i, load_idx);
-   } else {
+   else {
load = source_load(i, load_idx);
if (load  max_cpu_load)
max_cpu_load = load;
@@ -4519,22 +4508,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs-idle_cpus++;
}
 
-   /*
-* First idle cpu or the first cpu(busiest) in this sched group
-* is eligible for doing load balancing at this and above
-* domains. In the newly idle case, we will allow all the cpu's
-* to do the newly idle load balance.
-*/
-   if (local_group) {
-   if (env-idle != CPU_NEWLY_IDLE) {
-   if (balance_cpu != env-dst_cpu) {
-   *balance = 0;
-   return;
-   }
-   update_group_power(env-sd, env-dst_cpu);
-   } else if (time_after_eq(jiffies, group-sgp-next_update))
-   update_group_power(env-sd, env-dst_cpu);
-   }
+   if (local_group  (env-idle != CPU_NEWLY_IDLE ||
+   time_after_eq(jiffies, group-sgp-next_update)))
+   update_group_power(env-sd, env-dst_cpu);
 
/* Adjust by relative CPU power of the group */
sgs-avg_load = (sgs-group_load*SCHED_POWER_SCALE) / group-sgp-power;
@@ -4613,7 +4589,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
  * @sds: variable to hold the statistics for this sched_domain.
  */
 static inline void update_sd_lb_stats(struct lb_env *env,
-   int *balance, struct sd_lb_stats *sds)
+   struct sd_lb_stats *sds)
 {
struct sched_domain *child = env-sd-child;
struct sched_group *sg = env-sd-groups;
@@ -4630,10 +4606,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 
local_group = cpumask_test_cpu(env-dst_cpu, 
sched_group_cpus(sg));
memset(sgs, 0, sizeof(sgs));
-   update_sg_lb_stats(env, sg

[PATCH v2 mmotm 2/3] mm: move pgtable related functions to right place

2013-08-06 Thread Joonsoo Kim

pgtable related functions are mostly in pgtable-generic.c.
So move remaining functions from memory.c to pgtable-generic.c.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/memory.c b/mm/memory.c
index f2ab2a8..8fd4d42 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -374,30 +374,6 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
 #endif /* CONFIG_HAVE_RCU_TABLE_FREE */
 
 /*
- * If a p?d_bad entry is found while walking page tables, report
- * the error, before resetting entry to p?d_none.  Usually (but
- * very seldom) called out from the p?d_none_or_clear_bad macros.
- */
-
-void pgd_clear_bad(pgd_t *pgd)
-{
-   pgd_ERROR(*pgd);
-   pgd_clear(pgd);
-}
-
-void pud_clear_bad(pud_t *pud)
-{
-   pud_ERROR(*pud);
-   pud_clear(pud);
-}
-
-void pmd_clear_bad(pmd_t *pmd)
-{
-   pmd_ERROR(*pmd);
-   pmd_clear(pmd);
-}
-
-/*
  * Note: this doesn't free the actual pages themselves. That
  * has been handled earlier when unmapping all the memory regions.
  */
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index e1a6e4f..3929a40 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -10,6 +10,30 @@
 #include asm/tlb.h
 #include asm-generic/pgtable.h
 
+/*
+ * If a p?d_bad entry is found while walking page tables, report
+ * the error, before resetting entry to p?d_none.  Usually (but
+ * very seldom) called out from the p?d_none_or_clear_bad macros.
+ */
+
+void pgd_clear_bad(pgd_t *pgd)
+{
+   pgd_ERROR(*pgd);
+   pgd_clear(pgd);
+}
+
+void pud_clear_bad(pud_t *pud)
+{
+   pud_ERROR(*pud);
+   pud_clear(pud);
+}
+
+void pmd_clear_bad(pmd_t *pmd)
+{
+   pmd_ERROR(*pmd);
+   pmd_clear(pmd);
+}
+
 #ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
 /*
  * Only sets the access flags (dirty, accessed), as well as write 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 1/3] sched: remove one division operation in find_buiest_queue()

2013-08-06 Thread Joonsoo Kim

Remove one division operation in find_buiest_queue().

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9565645..52898dc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4968,7 +4968,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 struct sched_group *group)
 {
struct rq *busiest = NULL, *rq;
-   unsigned long max_load = 0;
+   unsigned long busiest_load = 0, busiest_power = SCHED_POWER_SCALE;
int i;
 
for_each_cpu(i, sched_group_cpus(group)) {
@@ -4999,10 +4999,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 * the load can be moved away from the cpu that is potentially
 * running at a lower capacity.
 */
-   wl = (wl * SCHED_POWER_SCALE) / power;
-
-   if (wl  max_load) {
-   max_load = wl;
+   if (wl * busiest_power  busiest_load * power) {
+   busiest_load = wl;
+   busiest_power = power;
busiest = rq;
}
}
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] mm, page_alloc: optimize batch count in free_pcppages_bulk()

2013-08-06 Thread Joonsoo Kim

If we use a division operation, we can compute a batch count more closed
to ideal value. With this value, we can finish our job within
MIGRATE_PCPTYPES iteration. In addition, batching to free more pages
may be helpful to cache usage.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 26ab229..7f145cc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -634,53 +634,76 @@ static inline int free_pages_check(struct page *page)
 static void free_pcppages_bulk(struct zone *zone, int count,
struct per_cpu_pages *pcp)
 {
-   int migratetype = 0;
-   int batch_free = 0;
-   int to_free = count;
+   struct list_head *list;
+   int batch_free;
+   int mt;
+   int nr_list;
+   bool all = false;
+
+   if (pcp-count == count)
+   all = true;
 
spin_lock(zone-lock);
zone-all_unreclaimable = 0;
zone-pages_scanned = 0;
 
-   while (to_free) {
-   struct page *page;
-   struct list_head *list;
+redo:
+   /* Count non-empty list */
+   nr_list = 0;
+   for (mt = 0; mt  MIGRATE_PCPTYPES; mt++) {
+   list = pcp-lists[mt];
+   if (!list_empty(list))
+   nr_list++;
+   }
 
-   /*
-* Remove pages from lists in a round-robin fashion. A
-* batch_free count is maintained that is incremented when an
-* empty list is encountered.  This is so more pages are freed
-* off fuller lists instead of spinning excessively around empty
-* lists
-*/
-   do {
-   batch_free++;
-   if (++migratetype == MIGRATE_PCPTYPES)
-   migratetype = 0;
-   list = pcp-lists[migratetype];
-   } while (list_empty(list));
+   /*
+* If there is only one non-empty list, free them all.
+* Otherwise, remove pages from lists in a round-robin fashion.
+* batch_free is set to remove at least one list.
+*/
+   if (all || nr_list == 1)
+   batch_free = count;
+   else if (count = nr_list)
+   batch_free = 1;
+   else
+   batch_free = count / nr_list;
 
-   /* This is the only non-empty list. Free them all. */
-   if (batch_free == MIGRATE_PCPTYPES)
-   batch_free = to_free;
+   for (mt = 0; mt  MIGRATE_PCPTYPES; mt++) {
+   struct page *page;
+   int i, page_mt;
 
-   do {
-   int mt; /* migratetype of the to-be-freed page */
+   list = pcp-lists[mt];
 
+   for (i = 0; i  batch_free; i++) {
+   if (list_empty(list))
+   break;
+
+   count--;
page = list_entry(list-prev, struct page, lru);
+
/* must delete as __free_one_page list manipulates */
list_del(page-lru);
-   mt = get_freepage_migratetype(page);
+   page_mt = get_freepage_migratetype(page);
/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
-   __free_one_page(page, zone, 0, mt);
-   trace_mm_page_pcpu_drain(page, 0, mt);
-   if (likely(!is_migrate_isolate_page(page))) {
-   __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
-   if (is_migrate_cma(mt))
-   __mod_zone_page_state(zone, 
NR_FREE_CMA_PAGES, 1);
-   }
-   } while (--to_free  --batch_free  !list_empty(list));
+   __free_one_page(page, zone, 0, page_mt);
+   trace_mm_page_pcpu_drain(page, 0, page_mt);
+
+   if (unlikely(is_migrate_isolate_page(page)))
+   continue;
+
+   __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
+   if (is_migrate_cma(page_mt))
+   __mod_zone_page_state(zone,
+   NR_FREE_CMA_PAGES, 1);
+   }
+
+   if (!count)
+   break;
}
+
+   if (count)
+   goto redo;
+
spin_unlock(zone-lock);
 }
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/4] mm, page_alloc: optimize batch count in free_pcppages_bulk()

2013-08-06 Thread Joonsoo Kim

If we use a division operation, we can compute a batch count more closed
to ideal value. With this value, we can finish our job within
MIGRATE_PCPTYPES iteration. In addition, batching to free more pages
may be helpful to cache usage.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 26ab229..7f145cc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -634,53 +634,76 @@ static inline int free_pages_check(struct page *page)
 static void free_pcppages_bulk(struct zone *zone, int count,
struct per_cpu_pages *pcp)
 {
-   int migratetype = 0;
-   int batch_free = 0;
-   int to_free = count;
+   struct list_head *list;
+   int batch_free;
+   int mt;
+   int nr_list;
+   bool all = false;
+
+   if (pcp-count == count)
+   all = true;
 
spin_lock(zone-lock);
zone-all_unreclaimable = 0;
zone-pages_scanned = 0;
 
-   while (to_free) {
-   struct page *page;
-   struct list_head *list;
+redo:
+   /* Count non-empty list */
+   nr_list = 0;
+   for (mt = 0; mt  MIGRATE_PCPTYPES; mt++) {
+   list = pcp-lists[mt];
+   if (!list_empty(list))
+   nr_list++;
+   }
 
-   /*
-* Remove pages from lists in a round-robin fashion. A
-* batch_free count is maintained that is incremented when an
-* empty list is encountered.  This is so more pages are freed
-* off fuller lists instead of spinning excessively around empty
-* lists
-*/
-   do {
-   batch_free++;
-   if (++migratetype == MIGRATE_PCPTYPES)
-   migratetype = 0;
-   list = pcp-lists[migratetype];
-   } while (list_empty(list));
+   /*
+* If there is only one non-empty list, free them all.
+* Otherwise, remove pages from lists in a round-robin fashion.
+* batch_free is set to remove at least one list.
+*/
+   if (all || nr_list == 1)
+   batch_free = count;
+   else if (count = nr_list)
+   batch_free = 1;
+   else
+   batch_free = count / nr_list;
 
-   /* This is the only non-empty list. Free them all. */
-   if (batch_free == MIGRATE_PCPTYPES)
-   batch_free = to_free;
+   for (mt = 0; mt  MIGRATE_PCPTYPES; mt++) {
+   struct page *page;
+   int i, page_mt;
 
-   do {
-   int mt; /* migratetype of the to-be-freed page */
+   list = pcp-lists[mt];
 
+   for (i = 0; i  batch_free; i++) {
+   if (list_empty(list))
+   break;
+
+   count--;
page = list_entry(list-prev, struct page, lru);
+
/* must delete as __free_one_page list manipulates */
list_del(page-lru);
-   mt = get_freepage_migratetype(page);
+   page_mt = get_freepage_migratetype(page);
/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
-   __free_one_page(page, zone, 0, mt);
-   trace_mm_page_pcpu_drain(page, 0, mt);
-   if (likely(!is_migrate_isolate_page(page))) {
-   __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
-   if (is_migrate_cma(mt))
-   __mod_zone_page_state(zone, 
NR_FREE_CMA_PAGES, 1);
-   }
-   } while (--to_free  --batch_free  !list_empty(list));
+   __free_one_page(page, zone, 0, page_mt);
+   trace_mm_page_pcpu_drain(page, 0, page_mt);
+
+   if (unlikely(is_migrate_isolate_page(page)))
+   continue;
+
+   __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
+   if (is_migrate_cma(page_mt))
+   __mod_zone_page_state(zone,
+   NR_FREE_CMA_PAGES, 1);
+   }
+
+   if (!count)
+   break;
}
+
+   if (count)
+   goto redo;
+
spin_unlock(zone-lock);
 }
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/4] mm, rmap: minimize lock hold when unlink_anon_vmas

2013-08-06 Thread Joonsoo Kim

Currently, we free the avc objects with holding a lock. To minimize
lock hold time, we just move the avc objects to another list
with holding a lock. Then, iterate them and free objects without holding
a lock. This makes lock hold time minimized.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/rmap.c b/mm/rmap.c
index 1603f64..9cfb282 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -330,6 +330,7 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
 {
struct anon_vma_chain *avc, *next;
struct anon_vma *root = NULL;
+   LIST_HEAD(avc_list);
 
/*
 * Unlink each anon_vma chained to the VMA.  This list is ordered
@@ -348,10 +349,14 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
if (RB_EMPTY_ROOT(anon_vma-rb_root))
continue;
 
+   list_move(avc-same_vma, avc_list);
+   }
+   unlock_anon_vma_root(root);
+
+   list_for_each_entry_safe(avc, next, avc_list, same_vma) {
list_del(avc-same_vma);
anon_vma_chain_free(avc);
}
-   unlock_anon_vma_root(root);
 
/*
 * Iterate the list once more, it now only contains empty and unlinked
@@ -363,7 +368,6 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
 
put_anon_vma(anon_vma);
 
-   list_del(avc-same_vma);
anon_vma_chain_free(avc);
}
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/4] mm, rmap: do easy-job first in anon_vma_fork

2013-08-06 Thread Joonsoo Kim

If we fail due to some errorous situation, it is better to quit
without doing heavy work. So changing order of execution.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/rmap.c b/mm/rmap.c
index a149e3a..c2f51cb 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -278,19 +278,19 @@ int anon_vma_fork(struct vm_area_struct *vma, struct 
vm_area_struct *pvma)
if (!pvma-anon_vma)
return 0;
 
+   /* First, allocate required objects */
+   avc = anon_vma_chain_alloc(GFP_KERNEL);
+   if (!avc)
+   goto out_error;
+   anon_vma = anon_vma_alloc();
+   if (!anon_vma)
+   goto out_error_free_avc;
+
/*
-* First, attach the new VMA to the parent VMA's anon_vmas,
+* Then attach the new VMA to the parent VMA's anon_vmas,
 * so rmap can find non-COWed pages in child processes.
 */
if (anon_vma_clone(vma, pvma))
-   return -ENOMEM;
-
-   /* Then add our own anon_vma. */
-   anon_vma = anon_vma_alloc();
-   if (!anon_vma)
-   goto out_error;
-   avc = anon_vma_chain_alloc(GFP_KERNEL);
-   if (!avc)
goto out_error_free_anon_vma;
 
/*
@@ -312,10 +312,11 @@ int anon_vma_fork(struct vm_area_struct *vma, struct 
vm_area_struct *pvma)
 
return 0;
 
- out_error_free_anon_vma:
+out_error_free_anon_vma:
put_anon_vma(anon_vma);
- out_error:
-   unlink_anon_vmas(vma);
+out_error_free_avc:
+   anon_vma_chain_free(avc);
+out_error:
return -ENOMEM;
 }
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/4] mm, rmap: allocate anon_vma_chain before starting to link anon_vma_chain

2013-08-06 Thread Joonsoo Kim

If we allocate anon_vma_chain before starting to link, we can reduce
the lock hold time. This patch implement it.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/rmap.c b/mm/rmap.c
index c2f51cb..1603f64 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -240,18 +240,21 @@ int anon_vma_clone(struct vm_area_struct *dst, struct 
vm_area_struct *src)
 {
struct anon_vma_chain *avc, *pavc;
struct anon_vma *root = NULL;
+   LIST_HEAD(avc_list);
+
+   list_for_each_entry(pavc, src-anon_vma_chain, same_vma) {
+   avc = anon_vma_chain_alloc(GFP_KERNEL);
+   if (unlikely(!avc))
+   goto enomem_failure;
+
+   list_add_tail(avc-same_vma, avc_list);
+   }
 
list_for_each_entry_reverse(pavc, src-anon_vma_chain, same_vma) {
struct anon_vma *anon_vma;
 
-   avc = anon_vma_chain_alloc(GFP_NOWAIT | __GFP_NOWARN);
-   if (unlikely(!avc)) {
-   unlock_anon_vma_root(root);
-   root = NULL;
-   avc = anon_vma_chain_alloc(GFP_KERNEL);
-   if (!avc)
-   goto enomem_failure;
-   }
+   avc = list_entry((avc_list)-next, typeof(*avc), same_vma);
+   list_del(avc-same_vma);
anon_vma = pavc-anon_vma;
root = lock_anon_vma_root(root, anon_vma);
anon_vma_chain_link(dst, avc, anon_vma);
@@ -259,8 +262,11 @@ int anon_vma_clone(struct vm_area_struct *dst, struct 
vm_area_struct *src)
unlock_anon_vma_root(root);
return 0;
 
- enomem_failure:
-   unlink_anon_vmas(dst);
+enomem_failure:
+   list_for_each_entry_safe(avc, pavc, avc_list, same_vma) {
+   list_del(avc-same_vma);
+   anon_vma_chain_free(avc);
+   }
return -ENOMEM;
 }
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/4] mm, page_alloc: optimize batch count in free_pcppages_bulk()

2013-08-06 Thread Joonsoo Kim

On Tue, Aug 06, 2013 at 05:43:40PM +0900, Joonsoo Kim wrote:
 If we use a division operation, we can compute a batch count more closed
 to ideal value. With this value, we can finish our job within
 MIGRATE_PCPTYPES iteration. In addition, batching to free more pages
 may be helpful to cache usage.
 
 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

Oops... Sorry.
Please ignore this one.
This patch is already submitted few seconds ago :)

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/4] mm, rmap: do easy-job first in anon_vma_fork

2013-08-07 Thread Joonsoo Kim

Hello, Johannes.

On Tue, Aug 06, 2013 at 08:58:54AM -0400, Johannes Weiner wrote:
  if (anon_vma_clone(vma, pvma))
  -   return -ENOMEM;
  -
  -   /* Then add our own anon_vma. */
  -   anon_vma = anon_vma_alloc();
  -   if (!anon_vma)
  -   goto out_error;
  -   avc = anon_vma_chain_alloc(GFP_KERNEL);
  -   if (!avc)
  goto out_error_free_anon_vma;
 
 Which heavy work?  anon_vma_clone() is anon_vma_chain_alloc() in a
 loop.
 
 Optimizing error paths only makes sense if they are common and you
 actually could save something by reordering.  This matches neither.

Yes, you are right. I drop this one.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/4] mm, rmap: allocate anon_vma_chain before starting to link anon_vma_chain

2013-08-07 Thread Joonsoo Kim

On Wed, Aug 07, 2013 at 02:08:03AM -0400, Johannes Weiner wrote:
   
  list_for_each_entry_reverse(pavc, src-anon_vma_chain, same_vma) {
  struct anon_vma *anon_vma;
   
  -   avc = anon_vma_chain_alloc(GFP_NOWAIT | __GFP_NOWARN);
  -   if (unlikely(!avc)) {
  -   unlock_anon_vma_root(root);
  -   root = NULL;
  -   avc = anon_vma_chain_alloc(GFP_KERNEL);
  -   if (!avc)
  -   goto enomem_failure;
  -   }
  +   avc = list_entry((avc_list)-next, typeof(*avc), same_vma);
 
 list_first_entry() please

Okay. I will send next version soon.

 
  +   list_del(avc-same_vma);
  anon_vma = pavc-anon_vma;
  root = lock_anon_vma_root(root, anon_vma);
  anon_vma_chain_link(dst, avc, anon_vma);
  @@ -259,8 +262,11 @@ int anon_vma_clone(struct vm_area_struct *dst, struct 
  vm_area_struct *src)
  unlock_anon_vma_root(root);
  return 0;
   
  - enomem_failure:
  -   unlink_anon_vmas(dst);
  +enomem_failure:
  +   list_for_each_entry_safe(avc, pavc, avc_list, same_vma) {
  +   list_del(avc-same_vma);
  +   anon_vma_chain_free(avc);
  +   }
  return -ENOMEM;
   }
 
 Otherwise, looks good.

Thank you!

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/4] mm, rmap: minimize lock hold when unlink_anon_vmas

2013-08-07 Thread Joonsoo Kim

On Wed, Aug 07, 2013 at 02:11:38AM -0400, Johannes Weiner wrote:
 On Tue, Aug 06, 2013 at 05:43:39PM +0900, Joonsoo Kim wrote:
  Currently, we free the avc objects with holding a lock. To minimize
  lock hold time, we just move the avc objects to another list
  with holding a lock. Then, iterate them and free objects without holding
  a lock. This makes lock hold time minimized.
  
  Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 
 I expect kfree() to be fast enough that we don't think we need to
 bother batching this outside the lock.

Okay!

Thanks.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 17/18] mm, hugetlb: retry if we fail to allocate a hugepage with use_reserve

2013-08-07 Thread Joonsoo Kim

On Tue, Aug 06, 2013 at 06:38:49PM -0700, Davidlohr Bueso wrote:
 On Wed, 2013-08-07 at 11:03 +1000, David Gibson wrote:
  On Tue, Aug 06, 2013 at 05:18:44PM -0700, Davidlohr Bueso wrote:
   On Mon, 2013-08-05 at 16:36 +0900, Joonsoo Kim wrote:
 Any mapping that doesn't use the reserved pool, not just
 MAP_NORESERVE.  For example, if a process makes a MAP_PRIVATE mapping,
 then fork()s then the mapping is instantiated in the child, that will
 not draw from the reserved pool.
 
  Should we ensure them to allocate the last hugepage?
  They map a region with MAP_NORESERVE, so don't assume that their 
  requests
  always succeed.
 
 If the pages are available, people get cranky if it fails for no
 apparent reason, MAP_NORESERVE or not.  They get especially cranky if
 it sometimes fails and sometimes doesn't due to a race condition.

Hello,

Hmm... Okay. I will try to implement another way to protect race 
condition.
Maybe it is the best to use a table mutex :)
Anyway, please give me a time, guys.
   
   So another option is to take the mutex table patchset for now as it
   *does* improve things a great deal, then, when ready, get rid of the
   instantiation lock all together.
  
  We still don't have a solid proposal for doing that. Joonsoo Kim's
  patchset misses cases (non reserved mappings).  I'm also not certain
  there aren't a few edge cases which can lead to even reserved mappings
  failing, and if that happens the patches will lead to livelock.
  
 
 Exactly, which is why I suggest minimizing the lock contention until we
 do have such a proposal.

Okay. my proposal is not complete and maybe much time is needed.
And I'm not sure that my *retry* approach can eventually cover all
the race situations, currently.

If you have to hurry, I don't have strong objection to your patches,
but, IMHO, we should go slow, because it is not just trivial change.
Hugetlb code is too subtle, so it is hard to confirm it's solidness.
Following is the race problem what I found with those patches.

I assume that nr_free_hugepage is 2.

1. parent process map an 1 hugepage sizeid region with MAP_PRIVATE
2. parent process write something to this region, so fault occur.
3. fault handling.
4. fork
5. parent process write something to this hugepage, so cow-fault occur.
6. while parent allocate a new page and do copy_user_huge_page()
in fault handler, child process write something to this hugepage,
so cow-fault occur. This access is not protected by table mutex,
because mm is different.
7. child process die, because there is no free hugepage.

If we have no race, child process would not die,
because all we needed is only 2 hugepages, one for parent,
and the other for child.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] hugepage: optimize page fault path locking

2013-08-07 Thread Joonsoo Kim

On Tue, Aug 06, 2013 at 05:08:04PM -0700, Davidlohr Bueso wrote:
 On Mon, 2013-07-29 at 15:18 +0900, Joonsoo Kim wrote:
  On Fri, Jul 26, 2013 at 07:27:23AM -0700, Davidlohr Bueso wrote:
   This patchset attempts to reduce the amount of contention we impose
   on the hugetlb_instantiation_mutex by replacing the global mutex with
   a table of mutexes, selected based on a hash. The original discussion can 
   be found here: http://lkml.org/lkml/2013/7/12/428
  
  Hello, Davidlohr.
  
  I recently sent a patchset which remove the hugetlb_instantiation_mutex
  entirely ('mm, hugetlb: remove a hugetlb_instantiation_mutex').
  This patchset can be found here: https://lkml.org/lkml/2013/7/29/54
  
  If possible, could you review it and test it whether your problem is
  disappered with it or not?
 
 This patchset applies on top of https://lkml.org/lkml/2013/7/22/96
 [PATCH v2 00/10] mm, hugetlb: clean-up and possible bug fix, right?
 
 AFAIK those changes are the ones Andrew picked up a few weeks ago and
 are now in linux-next, right? I was able to apply those just fine, but
 couldn't apply your 'remove a hugetlb_instantiation_mutex series' (IIRC
 pach 1/18 failed). I guess you'll send out a v2 anyway so I'll wait
 until then.
 
 In any case I'm not seeing an actual performance issue with the
 hugetlb_instantiation_mutex, all I noticed was that under large DB
 workloads that make use of hugepages, such as Oracle, this lock becomes
 quite hot during the first few minutes of startup, which makes sense in
 the fault path it is contended. So I'll try out your patches, but, in
 this particular case, I just cannot compare with the lock vs without the
 lock situations.

Okay. I just want to know that lock contention is reduced by my patches
in the first few minutes of startup. I will send v2 soon.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm, page_alloc: optimize batch count in free_pcppages_bulk()

2013-08-07 Thread JoonSoo Kim

Hello, Andrew.

2013/8/7 Andrew Morton a...@linux-foundation.org:
 On Tue,  6 Aug 2013 17:40:40 +0900 Joonsoo Kim iamjoonsoo@lge.com wrote:

 If we use a division operation, we can compute a batch count more closed
 to ideal value. With this value, we can finish our job within
 MIGRATE_PCPTYPES iteration. In addition, batching to free more pages
 may be helpful to cache usage.


 hm, maybe.  The .text got 120 bytes larger so the code now will
 eject two of someone else's cachelines, which can't be good.  I need
 more convincing, please ;)

 (bss got larger too - I don't have a clue why this happens).

In my testing, it makes .text just 64 byes larger.
I think that I cannot avoid such few increasing size.

Current round-robin freeing algorithm access 'struct page' at random
order, because
it change it's migrate type and list on every iteration and a page on
different list
may be far from each other. If we do more batch free, we have more
probability to access
adjacent 'struct page' than before, so I think that this is
cache-friendly. But this is just
theoretical argument, so I'm not sure whether it is useful or not :)

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 8/9] mm, hugetlb: remove decrement_hugepage_resv_vma()

2013-07-28 Thread Joonsoo Kim

Now, Checking condition of decrement_hugepage_resv_vma() and
vma_has_reserves() is same, so we can clean-up this function with
vma_has_reserves(). Additionally, decrement_hugepage_resv_vma() has only
one call site, so we can remove function and embed it into
dequeue_huge_page_vma() directly. This patch implement it.

Reviewed-by: Wanpeng Li liw...@linux.vnet.ibm.com
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ca15854..4b1b043 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -434,25 +434,6 @@ static int is_vma_resv_set(struct vm_area_struct *vma, 
unsigned long flag)
return (get_vma_private_data(vma)  flag) != 0;
 }
 
-/* Decrement the reserved pages in the hugepage pool by one */
-static void decrement_hugepage_resv_vma(struct hstate *h,
-   struct vm_area_struct *vma)
-{
-   if (vma-vm_flags  VM_NORESERVE)
-   return;
-
-   if (vma-vm_flags  VM_MAYSHARE) {
-   /* Shared mappings always use reserves */
-   h-resv_huge_pages--;
-   } else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
-   /*
-* Only the process that called mmap() has reserves for
-* private mappings.
-*/
-   h-resv_huge_pages--;
-   }
-}
-
 /* Reset counters to 0 and clear all HPAGE_RESV_* flags */
 void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
 {
@@ -466,10 +447,18 @@ static int vma_has_reserves(struct vm_area_struct *vma)
 {
if (vma-vm_flags  VM_NORESERVE)
return 0;
+
+   /* Shared mappings always use reserves */
if (vma-vm_flags  VM_MAYSHARE)
return 1;
+
+   /*
+* Only the process that called mmap() has reserves for
+* private mappings.
+*/
if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
return 1;
+
return 0;
 }
 
@@ -564,8 +553,8 @@ retry_cpuset:
if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
page = dequeue_huge_page_node(h, zone_to_nid(zone));
if (page) {
-   if (!avoid_reserve)
-   decrement_hugepage_resv_vma(h, vma);
+   if (!avoid_reserve  vma_has_reserves(vma))
+   h-resv_huge_pages--;
break;
}
}
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 1/9] mm, hugetlb: move up the code which check availability of free huge page

2013-07-28 Thread Joonsoo Kim

In this time we are holding a hugetlb_lock, so hstate values can't
be changed. If we don't have any usable free huge page in this time,
we don't need to proceede the processing. So move this code up.

Acked-by: Michal Hocko mho...@suse.cz
Reviewed-by: Wanpeng Li liw...@linux.vnet.ibm.com
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e2bfbf7..fc4988c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -539,10 +539,6 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
struct zoneref *z;
unsigned int cpuset_mems_cookie;
 
-retry_cpuset:
-   cpuset_mems_cookie = get_mems_allowed();
-   zonelist = huge_zonelist(vma, address,
-   htlb_alloc_mask, mpol, nodemask);
/*
 * A child process with MAP_PRIVATE mappings created by their parent
 * have no page reserves. This check ensures that reservations are
@@ -556,6 +552,11 @@ retry_cpuset:
if (avoid_reserve  h-free_huge_pages - h-resv_huge_pages == 0)
goto err;
 
+retry_cpuset:
+   cpuset_mems_cookie = get_mems_allowed();
+   zonelist = huge_zonelist(vma, address,
+   htlb_alloc_mask, mpol, nodemask);
+
for_each_zone_zonelist_nodemask(zone, z, zonelist,
MAX_NR_ZONES - 1, nodemask) {
if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
@@ -574,7 +575,6 @@ retry_cpuset:
return page;
 
 err:
-   mpol_cond_put(mpol);
return NULL;
 }
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 9/9] mm, hugetlb: decrement reserve count if VM_NORESERVE alloc page cache

2013-07-28 Thread Joonsoo Kim

If a vma with VM_NORESERVE allocate a new page for page cache, we should
check whether this area is reserved or not. If this address is
already reserved by other process(in case of chg == 0), we should
decrement reserve count, because this allocated page will go into page
cache and currently, there is no way to know that this page comes from
reserved pool or not when releasing inode. This may introduce
over-counting problem to reserved count. With following example code,
you can easily reproduce this situation.

Assume 2MB, nr_hugepages = 100

size = 20 * MB;
flag = MAP_SHARED;
p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
if (p == MAP_FAILED) {
fprintf(stderr, mmap() failed: %s\n, strerror(errno));
return -1;
}

flag = MAP_SHARED | MAP_NORESERVE;
q = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
if (q == MAP_FAILED) {
fprintf(stderr, mmap() failed: %s\n, strerror(errno));
}
q[0] = 'c';

After finish the program, run 'cat /proc/meminfo'.
You can see below result.

HugePages_Free:  100
HugePages_Rsvd:1

To fix this, we should check our mapping type and tracked region.
If our mapping is VM_NORESERVE, VM_MAYSHARE and chg is 0, this imply
that current allocated page will go into page cache which is already
reserved region when mapping is created. In this case, we should decrease
reserve count. As implementing above, this patch solve the problem.

Reviewed-by: Wanpeng Li liw...@linux.vnet.ibm.com
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4b1b043..b3b8252 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -443,10 +443,23 @@ void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
 }
 
 /* Returns true if the VMA has associated reserve pages */
-static int vma_has_reserves(struct vm_area_struct *vma)
+static int vma_has_reserves(struct vm_area_struct *vma, long chg)
 {
-   if (vma-vm_flags  VM_NORESERVE)
-   return 0;
+   if (vma-vm_flags  VM_NORESERVE) {
+   /*
+* This address is already reserved by other process(chg == 0),
+* so, we should decreament reserved count. Without
+* decreamenting, reserve count is remained after releasing
+* inode, because this allocated page will go into page cache
+* and is regarded as coming from reserved pool in releasing
+* step. Currently, we don't have any other solution to deal
+* with this situation properly, so add work-around here.
+*/
+   if (vma-vm_flags  VM_MAYSHARE  chg == 0)
+   return 1;
+   else
+   return 0;
+   }
 
/* Shared mappings always use reserves */
if (vma-vm_flags  VM_MAYSHARE)
@@ -520,7 +533,8 @@ static struct page *dequeue_huge_page_node(struct hstate 
*h, int nid)
 
 static struct page *dequeue_huge_page_vma(struct hstate *h,
struct vm_area_struct *vma,
-   unsigned long address, int avoid_reserve)
+   unsigned long address, int avoid_reserve,
+   long chg)
 {
struct page *page = NULL;
struct mempolicy *mpol;
@@ -535,7 +549,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 * have no page reserves. This check ensures that reservations are
 * not stolen. The child may still get SIGKILLed
 */
-   if (!vma_has_reserves(vma) 
+   if (!vma_has_reserves(vma, chg) 
h-free_huge_pages - h-resv_huge_pages == 0)
goto err;
 
@@ -553,8 +567,12 @@ retry_cpuset:
if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
page = dequeue_huge_page_node(h, zone_to_nid(zone));
if (page) {
-   if (!avoid_reserve  vma_has_reserves(vma))
-   h-resv_huge_pages--;
+   if (avoid_reserve)
+   break;
+   if (!vma_has_reserves(vma, chg))
+   break;
+
+   h-resv_huge_pages--;
break;
}
}
@@ -1138,7 +1156,7 @@ static struct page *alloc_huge_page(struct vm_area_struct 
*vma,
return ERR_PTR(-ENOSPC);
}
spin_lock(hugetlb_lock);
-   page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
+   page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg);
if (!page) {
spin_unlock(hugetlb_lock);
page = alloc_buddy_huge_page(h

[PATCH v3 6/9] mm, hugetlb: do not use a page in page cache for cow optimization

2013-07-28 Thread Joonsoo Kim

Currently, we use a page with mapped count 1 in page cache for cow
optimization. If we find this condition, we don't allocate a new
page and copy contents. Instead, we map this page directly.
This may introduce a problem that writting to private mapping overwrite
hugetlb file directly. You can find this situation with following code.

size = 20 * MB;
flag = MAP_SHARED;
p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
if (p == MAP_FAILED) {
fprintf(stderr, mmap() failed: %s\n, strerror(errno));
return -1;
}
p[0] = 's';
fprintf(stdout, BEFORE STEAL PRIVATE WRITE: %c\n, p[0]);
munmap(p, size);

flag = MAP_PRIVATE;
p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
if (p == MAP_FAILED) {
fprintf(stderr, mmap() failed: %s\n, strerror(errno));
}
p[0] = 'c';
munmap(p, size);

flag = MAP_SHARED;
p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
if (p == MAP_FAILED) {
fprintf(stderr, mmap() failed: %s\n, strerror(errno));
return -1;
}
fprintf(stdout, AFTER STEAL PRIVATE WRITE: %c\n, p[0]);
munmap(p, size);

We can see that AFTER STEAL PRIVATE WRITE: c, not AFTER STEAL
PRIVATE WRITE: s. If we turn off this optimization to a page
in page cache, the problem is disappeared.

So, I change the trigger condition of optimization. If this page is not
AnonPage, we don't do optimization. This makes this optimization turning
off for a page cache.

Acked-by: Michal Hocko mho...@suse.cz
Reviewed-by: Wanpeng Li liw...@linux.vnet.ibm.com
Reviewed-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2e52afea..1f6b3a6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2511,7 +2511,6 @@ static int hugetlb_cow(struct mm_struct *mm, struct 
vm_area_struct *vma,
 {
struct hstate *h = hstate_vma(vma);
struct page *old_page, *new_page;
-   int avoidcopy;
int outside_reserve = 0;
unsigned long mmun_start;   /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
@@ -2521,10 +2520,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct 
vm_area_struct *vma,
 retry_avoidcopy:
/* If no-one else is actually using this page, avoid the copy
 * and just make the page writable */
-   avoidcopy = (page_mapcount(old_page) == 1);
-   if (avoidcopy) {
-   if (PageAnon(old_page))
-   page_move_anon_rmap(old_page, vma, address);
+   if (page_mapcount(old_page) == 1  PageAnon(old_page)) {
+   page_move_anon_rmap(old_page, vma, address);
set_huge_ptep_writable(vma, address, ptep);
return 0;
}
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 7/9] mm, hugetlb: add VM_NORESERVE check in vma_has_reserves()

2013-07-28 Thread Joonsoo Kim

If we map the region with MAP_NORESERVE and MAP_SHARED,
we can skip to check reserve counting and eventually we cannot be ensured
to allocate a huge page in fault time.
With following example code, you can easily find this situation.

Assume 2MB, nr_hugepages = 100

fd = hugetlbfs_unlinked_fd();
if (fd  0)
return 1;

size = 200 * MB;
flag = MAP_SHARED;
p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
if (p == MAP_FAILED) {
fprintf(stderr, mmap() failed: %s\n, strerror(errno));
return -1;
}

size = 2 * MB;
flag = MAP_ANONYMOUS | MAP_SHARED | MAP_HUGETLB | MAP_NORESERVE;
p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, -1, 0);
if (p == MAP_FAILED) {
fprintf(stderr, mmap() failed: %s\n, strerror(errno));
}
p[0] = '0';
sleep(10);

During executing sleep(10), run 'cat /proc/meminfo' on another process.

HugePages_Free:   99
HugePages_Rsvd:  100

Number of free should be higher or equal than number of reserve,
but this aren't. This represent that non reserved shared mapping steal
a reserved page. Non reserved shared mapping should not eat into
reserve space.

If we consider VM_NORESERVE in vma_has_reserve() and return 0 which mean
that we don't have reserved pages, then we check that we have enough
free pages in dequeue_huge_page_vma(). This prevent to steal
a reserved page.

With this change, above test generate a SIGBUG which is correct,
because all free pages are reserved and non reserved shared mapping
can't get a free page.

Reviewed-by: Wanpeng Li liw...@linux.vnet.ibm.com
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1f6b3a6..ca15854 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -464,6 +464,8 @@ void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
 /* Returns true if the VMA has associated reserve pages */
 static int vma_has_reserves(struct vm_area_struct *vma)
 {
+   if (vma-vm_flags  VM_NORESERVE)
+   return 0;
if (vma-vm_flags  VM_MAYSHARE)
return 1;
if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 0/9] mm, hugetlb: clean-up and possible bug fix

2013-07-28 Thread Joonsoo Kim

First 5 patches are almost trivial clean-up patches.

The others are for fixing three bugs.
Perhaps, these problems are minor, because this codes are used
for a long time, and there is no bug reporting for these problems.

These patches are based on v3.10.0 and
passed the libhugetlbfs test suite.

Changes from v2.
There is no code change in all patches from v2.
Omit patch 2('clean-up alloc_huge_page()') which try to remove one
goto label.
Add more comments to changelog as Michal's opinion.
Add reviewed-by or acked-by.

Changes from v1.
Split patch 1 into two patches to clear it's purpose.
Remove useless indentation changes in 'clean-up alloc_huge_page()'
Fix new iteration code bug.
Add reviewed-by or acked-by.

Joonsoo Kim (9):
  mm, hugetlb: move up the code which check availability of free huge
page
  mm, hugetlb: trivial commenting fix
  mm, hugetlb: clean-up alloc_huge_page()
  mm, hugetlb: fix and clean-up node iteration code to alloc or free
  mm, hugetlb: remove redundant list_empty check in
gather_surplus_pages()
  mm, hugetlb: do not use a page in page cache for cow optimization
  mm, hugetlb: add VM_NORESERVE check in vma_has_reserves()
  mm, hugetlb: remove decrement_hugepage_resv_vma()
  mm, hugetlb: decrement reserve count if VM_NORESERVE alloc page cache

 mm/hugetlb.c |  243 ++
 1 file changed, 110 insertions(+), 133 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 4/9] mm, hugetlb: fix and clean-up node iteration code to alloc or free

2013-07-28 Thread Joonsoo Kim

Current node iteration code have a minor problem which do one more
node rotation if we can't succeed to allocate. For example,
if we start to allocate at node 0, we stop to iterate at node 0.
Then we start to allocate at node 1 for next allocation.

I introduce new macros for_each_node_mask_to_[alloc|free] and
fix and clean-up node iteration code to alloc or free.
This makes code more understandable.

Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 31d78c5..87d7637 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -755,33 +755,6 @@ static int hstate_next_node_to_alloc(struct hstate *h,
return nid;
 }
 
-static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
-{
-   struct page *page;
-   int start_nid;
-   int next_nid;
-   int ret = 0;
-
-   start_nid = hstate_next_node_to_alloc(h, nodes_allowed);
-   next_nid = start_nid;
-
-   do {
-   page = alloc_fresh_huge_page_node(h, next_nid);
-   if (page) {
-   ret = 1;
-   break;
-   }
-   next_nid = hstate_next_node_to_alloc(h, nodes_allowed);
-   } while (next_nid != start_nid);
-
-   if (ret)
-   count_vm_event(HTLB_BUDDY_PGALLOC);
-   else
-   count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
-
-   return ret;
-}
-
 /*
  * helper for free_pool_huge_page() - return the previously saved
  * node [this node] from which to free a huge page.  Advance the
@@ -800,6 +773,40 @@ static int hstate_next_node_to_free(struct hstate *h, 
nodemask_t *nodes_allowed)
return nid;
 }
 
+#define for_each_node_mask_to_alloc(hs, nr_nodes, node, mask)  \
+   for (nr_nodes = nodes_weight(*mask);\
+   nr_nodes  0  \
+   ((node = hstate_next_node_to_alloc(hs, mask)) || 1);\
+   nr_nodes--)
+
+#define for_each_node_mask_to_free(hs, nr_nodes, node, mask)   \
+   for (nr_nodes = nodes_weight(*mask);\
+   nr_nodes  0  \
+   ((node = hstate_next_node_to_free(hs, mask)) || 1); \
+   nr_nodes--)
+
+static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
+{
+   struct page *page;
+   int nr_nodes, node;
+   int ret = 0;
+
+   for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
+   page = alloc_fresh_huge_page_node(h, node);
+   if (page) {
+   ret = 1;
+   break;
+   }
+   }
+
+   if (ret)
+   count_vm_event(HTLB_BUDDY_PGALLOC);
+   else
+   count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
+
+   return ret;
+}
+
 /*
  * Free huge page from pool from next node to free.
  * Attempt to keep persistent huge pages more or less
@@ -809,36 +816,31 @@ static int hstate_next_node_to_free(struct hstate *h, 
nodemask_t *nodes_allowed)
 static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
 bool acct_surplus)
 {
-   int start_nid;
-   int next_nid;
+   int nr_nodes, node;
int ret = 0;
 
-   start_nid = hstate_next_node_to_free(h, nodes_allowed);
-   next_nid = start_nid;
-
-   do {
+   for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
/*
 * If we're returning unused surplus pages, only examine
 * nodes with surplus pages.
 */
-   if ((!acct_surplus || h-surplus_huge_pages_node[next_nid]) 
-   !list_empty(h-hugepage_freelists[next_nid])) {
+   if ((!acct_surplus || h-surplus_huge_pages_node[node]) 
+   !list_empty(h-hugepage_freelists[node])) {
struct page *page =
-   list_entry(h-hugepage_freelists[next_nid].next,
+   list_entry(h-hugepage_freelists[node].next,
  struct page, lru);
list_del(page-lru);
h-free_huge_pages--;
-   h-free_huge_pages_node[next_nid]--;
+   h-free_huge_pages_node[node]--;
if (acct_surplus) {
h-surplus_huge_pages--;
-   h-surplus_huge_pages_node[next_nid]--;
+   h-surplus_huge_pages_node[node]--;
}
update_and_free_page(h, page);
ret = 1;
break;
}
-   next_nid = hstate_next_node_to_free(h, nodes_allowed);
-   } while

[PATCH 05/18] mm, hugetlb: protect region tracking via newly introduced resv_map lock

2013-07-28 Thread Joonsoo Kim

There is a race condition if we map a same file on different processes.
Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
grab a mmap_sem. This doesn't prevent other process to modify region
structure, so it can be modified by two processes concurrently.

To solve this, I introduce a lock to resv_map and make region manipulation
function grab a lock before they do actual work. This makes region
tracking safe.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 2677c07..e29e28f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -26,6 +26,7 @@ struct hugepage_subpool {
 
 struct resv_map {
struct kref refs;
+   spinlock_t lock;
struct list_head regions;
 };
 extern struct resv_map *resv_map_alloc(void);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 24c0111..bf2ee11 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -134,15 +134,8 @@ static inline struct hugepage_subpool *subpool_vma(struct 
vm_area_struct *vma)
  * Region tracking -- allows tracking of reservations and instantiated pages
  *across the pages in a mapping.
  *
- * The region data structures are protected by a combination of the mmap_sem
- * and the hugetlb_instantiation_mutex.  To access or modify a region the 
caller
- * must either hold the mmap_sem for write, or the mmap_sem for read and
- * the hugetlb_instantiation_mutex:
- *
- * down_write(mm-mmap_sem);
- * or
- * down_read(mm-mmap_sem);
- * mutex_lock(hugetlb_instantiation_mutex);
+ * The region data structures are embedded into a resv_map and
+ * protected by a resv_map's lock
  */
 struct file_region {
struct list_head link;
@@ -155,6 +148,7 @@ static long region_add(struct resv_map *resv, long f, long 
t)
struct list_head *head = resv-regions;
struct file_region *rg, *nrg, *trg;
 
+   spin_lock(resv-lock);
/* Locate the region we are either in or before. */
list_for_each_entry(rg, head, link)
if (f = rg-to)
@@ -184,6 +178,7 @@ static long region_add(struct resv_map *resv, long f, long 
t)
}
nrg-from = f;
nrg-to = t;
+   spin_unlock(resv-lock);
return 0;
 }
 
@@ -193,6 +188,7 @@ static long region_chg(struct resv_map *resv, long f, long 
t)
struct file_region *rg, *nrg;
long chg = 0;
 
+   spin_lock(resv-lock);
/* Locate the region we are before or in. */
list_for_each_entry(rg, head, link)
if (f = rg-to)
@@ -203,14 +199,18 @@ static long region_chg(struct resv_map *resv, long f, 
long t)
 * size such that we can guarantee to record the reservation. */
if (rg-link == head || t  rg-from) {
nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
-   if (!nrg)
-   return -ENOMEM;
+   if (!nrg) {
+   chg = -ENOMEM;
+   goto out;
+   }
+
nrg-from = f;
nrg-to   = f;
INIT_LIST_HEAD(nrg-link);
list_add(nrg-link, rg-link.prev);
 
-   return t - f;
+   chg = t - f;
+   goto out;
}
 
/* Round our left edge to the current segment if it encloses us. */
@@ -223,7 +223,7 @@ static long region_chg(struct resv_map *resv, long f, long 
t)
if (rg-link == head)
break;
if (rg-from  t)
-   return chg;
+   goto out;
 
/* We overlap with this area, if it extends further than
 * us then we must extend ourselves.  Account for its
@@ -234,6 +234,9 @@ static long region_chg(struct resv_map *resv, long f, long 
t)
}
chg -= rg-to - rg-from;
}
+
+out:
+   spin_unlock(resv-lock);
return chg;
 }
 
@@ -243,12 +246,13 @@ static long region_truncate(struct resv_map *resv, long 
end)
struct file_region *rg, *trg;
long chg = 0;
 
+   spin_lock(resv-lock);
/* Locate the region we are either in or before. */
list_for_each_entry(rg, head, link)
if (end = rg-to)
break;
if (rg-link == head)
-   return 0;
+   goto out;
 
/* If we are in the middle of a region then adjust it. */
if (end  rg-from) {
@@ -265,6 +269,9 @@ static long region_truncate(struct resv_map *resv, long end)
list_del(rg-link);
kfree(rg);
}
+
+out:
+   spin_unlock(resv-lock);
return chg;
 }
 
@@ -274,6 +281,7 @@ static long region_count(struct resv_map *resv, long f, 
long t)
struct file_region *rg;
long chg = 0;
 
+   spin_lock(resv-lock);
/* Locate each segment we overlap

[PATCH 02/18] mm, hugetlb: change variable name reservations to resv

2013-07-28 Thread Joonsoo Kim

'reservations' is so long name as a variable and we use 'resv_map'
to represent 'struct resv_map' in other place. To reduce confusion and
unreadability, change it.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d971233..12b6581 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1095,9 +1095,9 @@ static long vma_needs_reservation(struct hstate *h,
} else  {
long err;
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-   struct resv_map *reservations = vma_resv_map(vma);
+   struct resv_map *resv = vma_resv_map(vma);
 
-   err = region_chg(reservations-regions, idx, idx + 1);
+   err = region_chg(resv-regions, idx, idx + 1);
if (err  0)
return err;
return 0;
@@ -1115,10 +1115,10 @@ static void vma_commit_reservation(struct hstate *h,
 
} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
-   struct resv_map *reservations = vma_resv_map(vma);
+   struct resv_map *resv = vma_resv_map(vma);
 
/* Mark this page used in the map. */
-   region_add(reservations-regions, idx, idx + 1);
+   region_add(resv-regions, idx, idx + 1);
}
 }
 
@@ -2168,7 +2168,7 @@ out:
 
 static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 {
-   struct resv_map *reservations = vma_resv_map(vma);
+   struct resv_map *resv = vma_resv_map(vma);
 
/*
 * This new VMA should share its siblings reservation map if present.
@@ -2178,34 +2178,34 @@ static void hugetlb_vm_op_open(struct vm_area_struct 
*vma)
 * after this open call completes.  It is therefore safe to take a
 * new reference here without additional locking.
 */
-   if (reservations)
-   kref_get(reservations-refs);
+   if (resv)
+   kref_get(resv-refs);
 }
 
 static void resv_map_put(struct vm_area_struct *vma)
 {
-   struct resv_map *reservations = vma_resv_map(vma);
+   struct resv_map *resv = vma_resv_map(vma);
 
-   if (!reservations)
+   if (!resv)
return;
-   kref_put(reservations-refs, resv_map_release);
+   kref_put(resv-refs, resv_map_release);
 }
 
 static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 {
struct hstate *h = hstate_vma(vma);
-   struct resv_map *reservations = vma_resv_map(vma);
+   struct resv_map *resv = vma_resv_map(vma);
struct hugepage_subpool *spool = subpool_vma(vma);
unsigned long reserve;
unsigned long start;
unsigned long end;
 
-   if (reservations) {
+   if (resv) {
start = vma_hugecache_offset(h, vma, vma-vm_start);
end = vma_hugecache_offset(h, vma, vma-vm_end);
 
reserve = (end - start) -
-   region_count(reservations-regions, start, end);
+   region_count(resv-regions, start, end);
 
resv_map_put(vma);
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 3/9] mm, hugetlb: clean-up alloc_huge_page()

2013-07-28 Thread Joonsoo Kim

This patch unifies successful allocation paths to make the code more
readable. There are no functional changes.

Acked-by: Michal Hocko mho...@suse.cz
Reviewed-by: Wanpeng Li liw...@linux.vnet.ibm.com
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 51564a8..31d78c5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1149,12 +1149,7 @@ static struct page *alloc_huge_page(struct 
vm_area_struct *vma,
}
spin_lock(hugetlb_lock);
page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
-   if (page) {
-   /* update page cgroup details */
-   hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h),
-h_cg, page);
-   spin_unlock(hugetlb_lock);
-   } else {
+   if (!page) {
spin_unlock(hugetlb_lock);
page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
if (!page) {
@@ -1165,11 +1160,11 @@ static struct page *alloc_huge_page(struct 
vm_area_struct *vma,
return ERR_PTR(-ENOSPC);
}
spin_lock(hugetlb_lock);
-   hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h),
-h_cg, page);
list_move(page-lru, h-hugepage_activelist);
-   spin_unlock(hugetlb_lock);
+   /* Fall through */
}
+   hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page);
+   spin_unlock(hugetlb_lock);
 
set_page_private(page, (unsigned long)spool);
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 5/9] mm, hugetlb: remove redundant list_empty check in gather_surplus_pages()

2013-07-28 Thread Joonsoo Kim

If list is empty, list_for_each_entry_safe() doesn't do anything.
So, this check is redundant. Remove it.

Acked-by: Michal Hocko mho...@suse.cz
Reviewed-by: Wanpeng Li liw...@linux.vnet.ibm.com
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 87d7637..2e52afea 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1020,11 +1020,8 @@ free:
spin_unlock(hugetlb_lock);
 
/* Free unnecessary surplus pages to the buddy allocator */
-   if (!list_empty(surplus_list)) {
-   list_for_each_entry_safe(page, tmp, surplus_list, lru) {
-   put_page(page);
-   }
-   }
+   list_for_each_entry_safe(page, tmp, surplus_list, lru)
+   put_page(page);
spin_lock(hugetlb_lock);
 
return ret;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 00/18] mm, hugetlb: remove a hugetlb_instantiation_mutex

2013-07-28 Thread Joonsoo Kim

Without a hugetlb_instantiation_mutex, if parallel fault occur, we can
fail to allocate a hugepage, because many threads dequeue a hugepage
to handle a fault of same address. This makes reserved pool shortage
just for a little while and this cause faulting thread who is ensured
to have enough reserved hugepages to get a SIGBUS signal.

To solve this problem, we already have a nice solution, that is,
a hugetlb_instantiation_mutex. This blocks other threads to dive into
a fault handler. This solve the problem clearly, but it introduce
performance degradation, because it serialize all fault handling.

Now, I try to remove a hugetlb_instantiation_mutex to get rid of
performance problem reported by Davidlohr Bueso [1].

It is implemented by following 3-steps.

Step 1. Protect region tracking via per region spin_lock.

Currently, region tracking is protected by a
hugetlb_instantiation_mutex, so before removing it, we should
replace it with another solution.

Step 2. Decide whether we use reserved page pool or not by an uniform way.

We need a graceful failure handling if there is no lock like as
hugetlb_instantiation_mutex. To decide whether we need to handle
a failure or not, we need to know current status properly.

Step 3. Graceful failure handling if we failed with reserved page or
failed to allocate with use_reserve.

Failure handling consist of two cases. One is if we failed with
having reserved page, we return back to reserved pool properly.
Current code doesn't recover a reserve count properly, so we need
to fix it. The other is if we failed to allocate a new huge page
with use_reserve indicator, we return 0 to fault handler,
instead of SIGBUS. This makes this thread retrying fault handling.
With above handlings, we can succeed to handle a fault
on any situation without a hugetlb_instantiation_mutex.

Patch 1: Fix a minor problem
Patch 2-5: Implement Step 1.
Patch 6-11: Implement Step 2.
Patch 12-18: Implement Step 3.

These patches are based on my previous patchset [2].
[2] is based on v3.10.

With applying these, I passed a libhugetlbfs test suite clearly which
have allocation-instantiation race test cases.

If there is a something I should consider, please let me know!
Thanks.

[1] http://lwn.net/Articles/558863/ 
[PATCH] mm/hugetlb: per-vma instantiation mutexes
[2] https://lkml.org/lkml/2013/7/22/96
[PATCH v2 00/10] mm, hugetlb: clean-up and possible bug fix


Joonsoo Kim (18):
  mm, hugetlb: protect reserved pages when softofflining requests the
pages
  mm, hugetlb: change variable name reservations to resv
  mm, hugetlb: unify region structure handling
  mm, hugetlb: region manipulation functions take resv_map rather
list_head
  mm, hugetlb: protect region tracking via newly introduced resv_map
lock
  mm, hugetlb: remove vma_need_reservation()
  mm, hugetlb: pass has_reserve to dequeue_huge_page_vma()
  mm, hugetlb: do hugepage_subpool_get_pages() when avoid_reserve
  mm, hugetlb: unify has_reserve and avoid_reserve to use_reserve
  mm, hugetlb: call vma_has_reserve() before entering alloc_huge_page()
  mm, hugetlb: move down outside_reserve check
  mm, hugetlb: remove a check for return value of alloc_huge_page()
  mm, hugetlb: grab a page_table_lock after page_cache_release
  mm, hugetlb: clean-up error handling in hugetlb_cow()
  mm, hugetlb: move up anon_vma_prepare()
  mm, hugetlb: return a reserved page to a reserved pool if failed
  mm, hugetlb: retry if we fail to allocate a hugepage with use_reserve
  mm, hugetlb: remove a hugetlb_instantiation_mutex

 fs/hugetlbfs/inode.c|   12 +-
 include/linux/hugetlb.h |   10 ++
 mm/hugetlb.c|  361 +--
 3 files changed, 217 insertions(+), 166 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 04/18] mm, hugetlb: region manipulation functions take resv_map rather list_head

2013-07-28 Thread Joonsoo Kim

To change a protection method for region tracking to find grained one,
we pass the resv_map, instead of list_head, to region manipulation
functions. This doesn't introduce any functional change, and it is just
for preparing a next step.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 35f6b12..24c0111 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -150,8 +150,9 @@ struct file_region {
long to;
 };
 
-static long region_add(struct list_head *head, long f, long t)
+static long region_add(struct resv_map *resv, long f, long t)
 {
+   struct list_head *head = resv-regions;
struct file_region *rg, *nrg, *trg;
 
/* Locate the region we are either in or before. */
@@ -186,8 +187,9 @@ static long region_add(struct list_head *head, long f, long 
t)
return 0;
 }
 
-static long region_chg(struct list_head *head, long f, long t)
+static long region_chg(struct resv_map *resv, long f, long t)
 {
+   struct list_head *head = resv-regions;
struct file_region *rg, *nrg;
long chg = 0;
 
@@ -235,8 +237,9 @@ static long region_chg(struct list_head *head, long f, long 
t)
return chg;
 }
 
-static long region_truncate(struct list_head *head, long end)
+static long region_truncate(struct resv_map *resv, long end)
 {
+   struct list_head *head = resv-regions;
struct file_region *rg, *trg;
long chg = 0;
 
@@ -265,8 +268,9 @@ static long region_truncate(struct list_head *head, long 
end)
return chg;
 }
 
-static long region_count(struct list_head *head, long f, long t)
+static long region_count(struct resv_map *resv, long f, long t)
 {
+   struct list_head *head = resv-regions;
struct file_region *rg;
long chg = 0;
 
@@ -392,7 +396,7 @@ void resv_map_release(struct kref *ref)
struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
 
/* Clear out any active regions before we release the map. */
-   region_truncate(resv_map-regions, 0);
+   region_truncate(resv_map, 0);
kfree(resv_map);
 }
 
@@ -1083,7 +1087,7 @@ static long vma_needs_reservation(struct hstate *h,
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
struct resv_map *resv = inode-i_mapping-private_data;
 
-   return region_chg(resv-regions, idx, idx + 1);
+   return region_chg(resv, idx, idx + 1);
 
} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
return 1;
@@ -1093,7 +1097,7 @@ static long vma_needs_reservation(struct hstate *h,
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
struct resv_map *resv = vma_resv_map(vma);
 
-   err = region_chg(resv-regions, idx, idx + 1);
+   err = region_chg(resv, idx, idx + 1);
if (err  0)
return err;
return 0;
@@ -1109,14 +1113,14 @@ static void vma_commit_reservation(struct hstate *h,
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
struct resv_map *resv = inode-i_mapping-private_data;
 
-   region_add(resv-regions, idx, idx + 1);
+   region_add(resv, idx, idx + 1);
 
} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
struct resv_map *resv = vma_resv_map(vma);
 
/* Mark this page used in the map. */
-   region_add(resv-regions, idx, idx + 1);
+   region_add(resv, idx, idx + 1);
}
 }
 
@@ -2203,7 +2207,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct 
*vma)
end = vma_hugecache_offset(h, vma, vma-vm_end);
 
reserve = (end - start) -
-   region_count(resv-regions, start, end);
+   region_count(resv, start, end);
 
resv_map_put(vma);
 
@@ -3078,7 +3082,7 @@ int hugetlb_reserve_pages(struct inode *inode,
if (!vma || vma-vm_flags  VM_MAYSHARE) {
resv_map = inode-i_mapping-private_data;
 
-   chg = region_chg(resv_map-regions, from, to);
+   chg = region_chg(resv_map, from, to);
 
} else {
resv_map = resv_map_alloc();
@@ -3124,7 +3128,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 * else has to be done for private mappings here
 */
if (!vma || vma-vm_flags  VM_MAYSHARE)
-   region_add(resv_map-regions, from, to);
+   region_add(resv_map, from, to);
return 0;
 out_err:
if (vma)
@@ -3140,7 +3144,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long 
offset, long freed)
struct hugepage_subpool *spool = subpool_inode(inode);
 
if (resv_map)
-   chg = region_truncate(resv_map-regions, offset);
+   chg = region_truncate(resv_map, offset);
spin_lock(inode-i_lock

[PATCH 16/18] mm, hugetlb: return a reserved page to a reserved pool if failed

2013-07-28 Thread Joonsoo Kim

If we fail with a reserved page, just calling put_page() is not sufficient,
because put_page() invoke free_huge_page() at last step and it doesn't
know whether a page comes from a reserved pool or not. So it doesn't do
anything related to reserved count. This makes reserve count lower
than how we need, because reserve count already decrease in
dequeue_huge_page_vma(). This patch fix this situation.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bb8a45f..6a9ec69 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -649,6 +649,34 @@ struct hstate *size_to_hstate(unsigned long size)
return NULL;
 }
 
+static void put_huge_page(struct page *page, int use_reserve)
+{
+   struct hstate *h = page_hstate(page);
+   struct hugepage_subpool *spool =
+   (struct hugepage_subpool *)page_private(page);
+
+   if (!use_reserve) {
+   put_page(page);
+   return;
+   }
+
+   if (!put_page_testzero(page))
+   return;
+
+   set_page_private(page, 0);
+   page-mapping = NULL;
+   BUG_ON(page_count(page));
+   BUG_ON(page_mapcount(page));
+
+   spin_lock(hugetlb_lock);
+   hugetlb_cgroup_uncharge_page(hstate_index(h),
+pages_per_huge_page(h), page);
+   enqueue_huge_page(h, page);
+   h-resv_huge_pages++;
+   spin_unlock(hugetlb_lock);
+   hugepage_subpool_put_pages(spool, 1);
+}
+
 static void free_huge_page(struct page *page)
 {
/*
@@ -2625,7 +2653,7 @@ retry_avoidcopy:
spin_unlock(mm-page_table_lock);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
-   page_cache_release(new_page);
+   put_huge_page(new_page, use_reserve);
 out_old_page:
page_cache_release(old_page);
 out_lock:
@@ -2725,7 +2753,7 @@ retry:
 
err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
if (err) {
-   put_page(page);
+   put_huge_page(page, use_reserve);
if (err == -EEXIST)
goto retry;
goto out;
@@ -2798,7 +2826,7 @@ backout:
spin_unlock(mm-page_table_lock);
 backout_unlocked:
unlock_page(page);
-   put_page(page);
+   put_huge_page(page, use_reserve);
goto out;
 }
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 18/18] mm, hugetlb: remove a hugetlb_instantiation_mutex

2013-07-28 Thread Joonsoo Kim

Now, we have prepared to have an infrastructure in order to remove a this
awkward mutex which serialize all faulting tasks, so remove it.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 909075b..4fab047 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2533,9 +2533,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct 
vm_area_struct *vma,
 
 /*
  * Hugetlb_cow() should be called with page lock of the original hugepage held.
- * Called with hugetlb_instantiation_mutex held and pte_page locked so we
- * cannot race with other handlers or page migration.
- * Keep the pte_same checks anyway to make transition from the mutex easier.
+ * Called with pte_page locked so we cannot race with page migration.
  */
 static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *ptep, pte_t pte,
@@ -2844,7 +2842,6 @@ int hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
int ret;
struct page *page = NULL;
struct page *pagecache_page = NULL;
-   static DEFINE_MUTEX(hugetlb_instantiation_mutex);
struct hstate *h = hstate_vma(vma);
 
address = huge_page_mask(h);
@@ -2864,17 +2861,9 @@ int hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
if (!ptep)
return VM_FAULT_OOM;
 
-   /*
-* Serialize hugepage allocation and instantiation, so that we don't
-* get spurious allocation failures if two CPUs race to instantiate
-* the same page in the page cache.
-*/
-   mutex_lock(hugetlb_instantiation_mutex);
entry = huge_ptep_get(ptep);
-   if (huge_pte_none(entry)) {
-   ret = hugetlb_no_page(mm, vma, address, ptep, flags);
-   goto out_mutex;
-   }
+   if (huge_pte_none(entry))
+   return hugetlb_no_page(mm, vma, address, ptep, flags);
 
ret = 0;
 
@@ -2887,10 +2876,8 @@ int hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
 * consumed.
 */
if ((flags  FAULT_FLAG_WRITE)  !huge_pte_write(entry)) {
-   if (vma_has_reserves(h, vma, address)  0) {
-   ret = VM_FAULT_OOM;
-   goto out_mutex;
-   }
+   if (vma_has_reserves(h, vma, address)  0)
+   return VM_FAULT_OOM;
 
if (!(vma-vm_flags  VM_MAYSHARE))
pagecache_page = hugetlbfs_pagecache_page(h,
@@ -2939,9 +2926,6 @@ out_page_table_lock:
unlock_page(page);
put_page(page);
 
-out_mutex:
-   mutex_unlock(hugetlb_instantiation_mutex);
-
return ret;
 }
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 15/18] mm, hugetlb: move up anon_vma_prepare()

2013-07-28 Thread Joonsoo Kim

If we fail with a allocated hugepage, it is hard to recover properly.
One such example is reserve count. We don't have any method to recover
reserve count. Although, I will introduce a function to recover reserve
count in following patch, it is better not to allocate a hugepage
as much as possible. So move up anon_vma_prepare() which can be failed
in OOM situation.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 683fd38..bb8a45f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2536,6 +2536,15 @@ retry_avoidcopy:
/* Drop page_table_lock as buddy allocator may be called */
spin_unlock(mm-page_table_lock);
 
+   /*
+* When the original hugepage is shared one, it does not have
+* anon_vma prepared.
+*/
+   if (unlikely(anon_vma_prepare(vma))) {
+   ret = VM_FAULT_OOM;
+   goto out_old_page;
+   }
+
use_reserve = vma_has_reserves(h, vma, address);
if (use_reserve == -ENOMEM) {
ret = VM_FAULT_OOM;
@@ -2590,15 +2599,6 @@ retry_avoidcopy:
goto out_lock;
}
 
-   /*
-* When the original hugepage is shared one, it does not have
-* anon_vma prepared.
-*/
-   if (unlikely(anon_vma_prepare(vma))) {
-   ret = VM_FAULT_OOM;
-   goto out_new_page;
-   }
-
copy_user_huge_page(new_page, old_page, address, vma,
pages_per_huge_page(h));
__SetPageUptodate(new_page);
@@ -2625,7 +2625,6 @@ retry_avoidcopy:
spin_unlock(mm-page_table_lock);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
-out_new_page:
page_cache_release(new_page);
 out_old_page:
page_cache_release(old_page);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 17/18] mm, hugetlb: retry if we fail to allocate a hugepage with use_reserve

2013-07-28 Thread Joonsoo Kim

If parallel fault occur, we can fail to allocate a hugepage,
because many threads dequeue a hugepage to handle a fault of same address.
This makes reserved pool shortage just for a little while and this cause
faulting thread who is ensured to have enough reserved hugepages
to get a SIGBUS signal.

To solve this problem, we already have a nice solution, that is,
a hugetlb_instantiation_mutex. This blocks other threads to dive into
a fault handler. This solve the problem clearly, but it introduce
performance degradation, because it serialize all fault handling.

Now, I try to remove a hugetlb_instantiation_mutex to get rid of
performance degradation. A prerequisite is that other thread should
not get a SIGBUS if they are ensured to have enough reserved pages.

For this purpose, if we fail to allocate a new hugepage with use_reserve,
we return just 0, instead of VM_FAULT_SIGBUS. use_reserve
represent that this user is legimate one who are ensured to have enough
reserved pages. This prevent these thread not to get a SIGBUS signal and
make these thread retrying fault handling.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6a9ec69..909075b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2623,7 +2623,10 @@ retry_avoidcopy:
WARN_ON_ONCE(1);
}
 
-   ret = VM_FAULT_SIGBUS;
+   if (use_reserve)
+   ret = 0;
+   else
+   ret = VM_FAULT_SIGBUS;
goto out_lock;
}
 
@@ -2741,7 +2744,10 @@ retry:
 
page = alloc_huge_page(vma, address, use_reserve);
if (IS_ERR(page)) {
-   ret = VM_FAULT_SIGBUS;
+   if (use_reserve)
+   ret = 0;
+   else
+   ret = VM_FAULT_SIGBUS;
goto out;
}
clear_huge_page(page, address, pages_per_huge_page(h));
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 14/18] mm, hugetlb: clean-up error handling in hugetlb_cow()

2013-07-28 Thread Joonsoo Kim

Current code include 'Caller expects lock to be held' in every error path.
We can clean-up it as we do error handling in one place.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 255bd9e..683fd38 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2516,6 +2516,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct 
vm_area_struct *vma,
struct hstate *h = hstate_vma(vma);
struct page *old_page, *new_page;
int use_reserve, outside_reserve = 0;
+   int ret = 0;
unsigned long mmun_start;   /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
 
@@ -2537,11 +2538,8 @@ retry_avoidcopy:
 
use_reserve = vma_has_reserves(h, vma, address);
if (use_reserve == -ENOMEM) {
-   page_cache_release(old_page);
-
-   /* Caller expects lock to be held */
-   spin_lock(mm-page_table_lock);
-   return VM_FAULT_OOM;
+   ret = VM_FAULT_OOM;
+   goto out_old_page;
}
 
/*
@@ -2588,9 +2586,8 @@ retry_avoidcopy:
WARN_ON_ONCE(1);
}
 
-   /* Caller expects lock to be held */
-   spin_lock(mm-page_table_lock);
-   return VM_FAULT_SIGBUS;
+   ret = VM_FAULT_SIGBUS;
+   goto out_lock;
}
 
/*
@@ -2598,11 +2595,8 @@ retry_avoidcopy:
 * anon_vma prepared.
 */
if (unlikely(anon_vma_prepare(vma))) {
-   page_cache_release(new_page);
-   page_cache_release(old_page);
-   /* Caller expects lock to be held */
-   spin_lock(mm-page_table_lock);
-   return VM_FAULT_OOM;
+   ret = VM_FAULT_OOM;
+   goto out_new_page;
}
 
copy_user_huge_page(new_page, old_page, address, vma,
@@ -2630,12 +2624,15 @@ retry_avoidcopy:
}
spin_unlock(mm-page_table_lock);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+
+out_new_page:
page_cache_release(new_page);
+out_old_page:
page_cache_release(old_page);
-
+out_lock:
/* Caller expects lock to be held */
spin_lock(mm-page_table_lock);
-   return 0;
+   return ret;
 }
 
 /* Return the pagecache page at a given address within a VMA */
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 01/18] mm, hugetlb: protect reserved pages when softofflining requests the pages

2013-07-28 Thread Joonsoo Kim

alloc_huge_page_node() use dequeue_huge_page_node() without
any validation check, so it can steal reserved page unconditionally.
To fix it, check the number of free_huge_page in alloc_huge_page_node().

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6782b41..d971233 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -935,10 +935,11 @@ static struct page *alloc_buddy_huge_page(struct hstate 
*h, int nid)
  */
 struct page *alloc_huge_page_node(struct hstate *h, int nid)
 {
-   struct page *page;
+   struct page *page = NULL;
 
spin_lock(hugetlb_lock);
-   page = dequeue_huge_page_node(h, nid);
+   if (h-free_huge_pages - h-resv_huge_pages  0)
+   page = dequeue_huge_page_node(h, nid);
spin_unlock(hugetlb_lock);
 
if (!page)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 12/18] mm, hugetlb: remove a check for return value of alloc_huge_page()

2013-07-28 Thread Joonsoo Kim

Now, alloc_huge_page() only return -ENOSPEC if failed.
So, we don't worry about other return value.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 94173e0..35ccdad 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2562,7 +2562,6 @@ retry_avoidcopy:
new_page = alloc_huge_page(vma, address, use_reserve);
 
if (IS_ERR(new_page)) {
-   long err = PTR_ERR(new_page);
page_cache_release(old_page);
 
/*
@@ -2591,10 +2590,7 @@ retry_avoidcopy:
 
/* Caller expects lock to be held */
spin_lock(mm-page_table_lock);
-   if (err == -ENOMEM)
-   return VM_FAULT_OOM;
-   else
-   return VM_FAULT_SIGBUS;
+   return VM_FAULT_SIGBUS;
}
 
/*
@@ -2720,11 +2716,7 @@ retry:
 
page = alloc_huge_page(vma, address, use_reserve);
if (IS_ERR(page)) {
-   ret = PTR_ERR(page);
-   if (ret == -ENOMEM)
-   ret = VM_FAULT_OOM;
-   else
-   ret = VM_FAULT_SIGBUS;
+   ret = VM_FAULT_SIGBUS;
goto out;
}
clear_huge_page(page, address, pages_per_huge_page(h));
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 13/18] mm, hugetlb: grab a page_table_lock after page_cache_release

2013-07-28 Thread Joonsoo Kim

We don't need to grab a page_table_lock when we try to release a page.
So, defer to grab a page_table_lock.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 35ccdad..255bd9e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2630,10 +2630,11 @@ retry_avoidcopy:
}
spin_unlock(mm-page_table_lock);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-   /* Caller expects lock to be held */
-   spin_lock(mm-page_table_lock);
page_cache_release(new_page);
page_cache_release(old_page);
+
+   /* Caller expects lock to be held */
+   spin_lock(mm-page_table_lock);
return 0;
 }
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 11/18] mm, hugetlb: move down outside_reserve check

2013-07-28 Thread Joonsoo Kim

Just move down outsider_reserve check.
This makes code more readable.

There is no functional change.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5f31ca5..94173e0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2530,20 +2530,6 @@ retry_avoidcopy:
return 0;
}
 
-   /*
-* If the process that created a MAP_PRIVATE mapping is about to
-* perform a COW due to a shared page count, attempt to satisfy
-* the allocation without using the existing reserves. The pagecache
-* page is used to determine if the reserve at this address was
-* consumed or not. If reserves were used, a partial faulted mapping
-* at the time of fork() could consume its reserves on COW instead
-* of the full address range.
-*/
-   if (!(vma-vm_flags  VM_MAYSHARE) 
-   is_vma_resv_set(vma, HPAGE_RESV_OWNER) 
-   old_page != pagecache_page)
-   outside_reserve = 1;
-
page_cache_get(old_page);
 
/* Drop page_table_lock as buddy allocator may be called */
@@ -2557,6 +2543,20 @@ retry_avoidcopy:
spin_lock(mm-page_table_lock);
return VM_FAULT_OOM;
}
+
+   /*
+* If the process that created a MAP_PRIVATE mapping is about to
+* perform a COW due to a shared page count, attempt to satisfy
+* the allocation without using the existing reserves. The pagecache
+* page is used to determine if the reserve at this address was
+* consumed or not. If reserves were used, a partial faulted mapping
+* at the time of fork() could consume its reserves on COW instead
+* of the full address range.
+*/
+   if (!(vma-vm_flags  VM_MAYSHARE) 
+   is_vma_resv_set(vma, HPAGE_RESV_OWNER) 
+   old_page != pagecache_page)
+   outside_reserve = 1;
use_reserve = use_reserve  !outside_reserve;
 
new_page = alloc_huge_page(vma, address, use_reserve);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 10/18] mm, hugetlb: call vma_has_reserve() before entering alloc_huge_page()

2013-07-28 Thread Joonsoo Kim

To implement a graceful failure handling, we need to know whether
allocation request is for reserved pool or not, on higher level.
In this patch, we just move up vma_has_reseve() to caller function
in order to know it. There is no functional change.

Following patches implement a grace failure handling and remove
a hugetlb_instantiation_mutex.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a66226e..5f31ca5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1123,12 +1123,12 @@ static void vma_commit_reservation(struct hstate *h,
 }
 
 static struct page *alloc_huge_page(struct vm_area_struct *vma,
-   unsigned long addr, int avoid_reserve)
+   unsigned long addr, int use_reserve)
 {
struct hugepage_subpool *spool = subpool_vma(vma);
struct hstate *h = hstate_vma(vma);
struct page *page;
-   int ret, idx, use_reserve;
+   int ret, idx;
struct hugetlb_cgroup *h_cg;
 
idx = hstate_index(h);
@@ -1140,11 +1140,6 @@ static struct page *alloc_huge_page(struct 
vm_area_struct *vma,
 * need pages and subpool limit allocated allocated if no reserve
 * mapping overlaps.
 */
-   use_reserve = vma_has_reserves(h, vma, addr);
-   if (use_reserve  0)
-   return ERR_PTR(-ENOMEM);
-
-   use_reserve = use_reserve  !avoid_reserve;
if (!use_reserve  (hugepage_subpool_get_pages(spool, 1)  0))
return ERR_PTR(-ENOSPC);
 
@@ -2520,7 +2515,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct 
vm_area_struct *vma,
 {
struct hstate *h = hstate_vma(vma);
struct page *old_page, *new_page;
-   int outside_reserve = 0;
+   int use_reserve, outside_reserve = 0;
unsigned long mmun_start;   /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
 
@@ -2553,7 +2548,18 @@ retry_avoidcopy:
 
/* Drop page_table_lock as buddy allocator may be called */
spin_unlock(mm-page_table_lock);
-   new_page = alloc_huge_page(vma, address, outside_reserve);
+
+   use_reserve = vma_has_reserves(h, vma, address);
+   if (use_reserve == -ENOMEM) {
+   page_cache_release(old_page);
+
+   /* Caller expects lock to be held */
+   spin_lock(mm-page_table_lock);
+   return VM_FAULT_OOM;
+   }
+   use_reserve = use_reserve  !outside_reserve;
+
+   new_page = alloc_huge_page(vma, address, use_reserve);
 
if (IS_ERR(new_page)) {
long err = PTR_ERR(new_page);
@@ -2679,6 +2685,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
struct page *page;
struct address_space *mapping;
pte_t new_pte;
+   int use_reserve = 0;
 
/*
 * Currently, we are forced to kill the process in the event the
@@ -2704,7 +2711,14 @@ retry:
size = i_size_read(mapping-host)  huge_page_shift(h);
if (idx = size)
goto out;
-   page = alloc_huge_page(vma, address, 0);
+
+   use_reserve = vma_has_reserves(h, vma, address);
+   if (use_reserve == -ENOMEM) {
+   ret = VM_FAULT_OOM;
+   goto out;
+   }
+
+   page = alloc_huge_page(vma, address, use_reserve);
if (IS_ERR(page)) {
ret = PTR_ERR(page);
if (ret == -ENOMEM)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

< 1 2 3 4 5 6 7 8 9 10 >

301 - 400 of 4538 matches

Mail list logo