[Devel] [PATCH] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()

2010-10-07 Thread KAMEZAWA Hiroyuki
On Thu, 7 Oct 2010 11:17:43 +0900
KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:
 
  hmm, if we'll do that, I think we need to do that under pte_lock in
  mem_cgroup_move_charge_pte_range(). But, we can't do 
  wait_on_page_writeback()
  under pte_lock, right? Or, we need re-organize current move-charge 
  implementation.
  
 Nice catch. I think releaseing pte_lock() is okay. (and it should be released)
 
 IIUC, task's css_set() points to new cgroup when move is called. Then,
 it's not necessary to take pte_lock, I guess.
 (And taking pte_lock too long is not appreciated..)
 
 I'll write a sample patch today.
 
Here.
==
From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

Now, at task migration among cgroup, memory cgroup scans page table and moving
account if flags are properly set.

The core code, mem_cgroup_move_charge_pte_range() does

pte_offset_map_lock();
for all ptes in a page table:
1. look into page table, find_and_get a page
2. remove it from LRU.
3. move charge.
4. putback to LRU. put_page()
pte_offset_map_unlock();

for pte entries on a 3rd(2nd) level page table.

This pte_offset_map_lock seems a bit long. This patch modifies a rountine as

pte_offset_map_lock()
for 32 pages:
  find_and_get a page
  record it
pte_offset_map_unlock()
for all recorded pages
  isolate it from LRU.
  move charge
  putback to LRU
for all recorded pages
  put_page()

Note: newly-charged pages while we move account are charged to the new group.

Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
---
 mm/memcontrol.c |   92 
 1 file changed, 60 insertions(+), 32 deletions(-)

Index: mmotm-0928/mm/memcontrol.c
===
--- mmotm-0928.orig/mm/memcontrol.c
+++ mmotm-0928/mm/memcontrol.c
@@ -4475,17 +4475,22 @@ one_by_one:
  *
  * Called with pte lock held.
  */
-union mc_target {
-   struct page *page;
-   swp_entry_t ent;
-};
 
 enum mc_target_type {
-   MC_TARGET_NONE, /* not used */
+   MC_TARGET_NONE, /* used as failure code(0) */
MC_TARGET_PAGE,
MC_TARGET_SWAP,
 };
 
+struct mc_target {
+   enum mc_target_type type;
+   union {
+   struct page *page;
+   swp_entry_t ent;
+   } val;
+};
+
+
 static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
unsigned long addr, pte_t ptent)
 {
@@ -4561,7 +4566,7 @@ static struct page *mc_handle_file_pte(s
 }
 
 static int is_target_pte_for_mc(struct vm_area_struct *vma,
-   unsigned long addr, pte_t ptent, union mc_target *target)
+   unsigned long addr, pte_t ptent, struct mc_target *target)
 {
struct page *page = NULL;
struct page_cgroup *pc;
@@ -4587,7 +4592,7 @@ static int is_target_pte_for_mc(struct v
if (PageCgroupUsed(pc)  pc-mem_cgroup == mc.from) {
ret = MC_TARGET_PAGE;
if (target)
-   target-page = page;
+   target-val.page = page;
}
if (!ret || !target)
put_page(page);
@@ -4597,8 +4602,10 @@ static int is_target_pte_for_mc(struct v
css_id(mc.from-css) == lookup_swap_cgroup(ent)) {
ret = MC_TARGET_SWAP;
if (target)
-   target-ent = ent;
+   target-val.ent = ent;
}
+   if (target)
+   target-type = ret;
return ret;
 }
 
@@ -4751,6 +4758,9 @@ static void mem_cgroup_cancel_attach(str
mem_cgroup_clear_mc();
 }
 
+
+#define MC_MOVE_ONCE   (32)
+
 static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
unsigned long addr, unsigned long end,
struct mm_walk *walk)
@@ -4759,26 +4769,47 @@ static int mem_cgroup_move_charge_pte_ra
struct vm_area_struct *vma = walk-private;
pte_t *pte;
spinlock_t *ptl;
+   struct mc_target *target;
+   int index, num;
+
+   target = kzalloc(sizeof(struct mc_target) *MC_MOVE_ONCE, GFP_KERNEL);
+   if (!target)
+   return -ENOMEM;
 
 retry:
pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
-   for (; addr != end; addr += PAGE_SIZE) {
+   for (num = 0; num  MC_MOVE_ONCE  addr != end; addr += PAGE_SIZE) {
pte_t ptent = *(pte++);
-   union mc_target target;
-   int type;
+   ret = is_target_pte_for_mc(vma, addr, ptent, target[num]);
+   if (!ret)
+   

[Devel] [PATCH] memcg: lock-free clear page writeback (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()

2010-10-07 Thread KAMEZAWA Hiroyuki
Greg, I think clear_page_writeback() will not require _any_ locks with this 
patch.
But set_page_writeback() requires it...
(Maybe adding a special function for clear_page_writeback() is better rather 
than
 adding some complex to switch() in update_page_stat())

==
From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

Now, at page information accounting, we do lock_page_cgroup() if pc-mem_cgroup
points to a cgroup where someone is moving charges from.

At supporing dirty-page accounting, one of troubles is writeback bit.
In general, writeback can be cleared via IRQ context. To update writeback bit
with lock_page_cgroup() in safe way, we'll have to disable IRQ.
or do something.

This patch waits for completion of writeback under lock_page() and do
lock_page_cgroup() in safe way. (We never got end_io via IRQ context.)

By this, writeback-accounting will never see race with account_move() and
it can trust pc-mem_cgroup always _without_ any lock.

Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
---
 mm/memcontrol.c |   18 ++
 1 file changed, 18 insertions(+)

Index: mmotm-0928/mm/memcontrol.c
===
--- mmotm-0928.orig/mm/memcontrol.c
+++ mmotm-0928/mm/memcontrol.c
@@ -2183,17 +2183,35 @@ static void __mem_cgroup_move_account(st
 /*
  * check whether the @pc is valid for moving account and call
  * __mem_cgroup_move_account()
+ * Don't call this under pte_lock etc...we'll do lock_page() and wait for
+ * the end of I/O.
  */
 static int mem_cgroup_move_account(struct page_cgroup *pc,
struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
 {
int ret = -EINVAL;
+
+   /*
+* We move severl flags and accounting information here. So we need to
+* avoid the races with update_stat routines. For most of routines,
+* lock_page_cgroup() is enough for avoiding race. But we need to take
+* care of IRQ context. If flag updates comes from IRQ context, This
+* move account will be racy (and cause deadlock in 
lock_page_cgroup())
+*
+* Now, the only race we have is Writeback flag. We wait for it cleared
+* before starting our jobs.
+*/
+
+   lock_page(pc-page);
+   wait_on_page_writeback(pc-page);
+
lock_page_cgroup(pc);
if (PageCgroupUsed(pc)  pc-mem_cgroup == from) {
__mem_cgroup_move_account(pc, from, to, uncharge);
ret = 0;
}
unlock_page_cgroup(pc);
+   unlock_page(pc-page);
/*
 * check events
 */

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()

2010-10-07 Thread KAMEZAWA Hiroyuki
On Thu, 7 Oct 2010 16:28:11 +0900
Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote:

 On Thu, 7 Oct 2010 15:21:11 +0900
 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:
 
  On Thu, 7 Oct 2010 11:17:43 +0900
  KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:
   
hmm, if we'll do that, I think we need to do that under pte_lock in
mem_cgroup_move_charge_pte_range(). But, we can't do 
wait_on_page_writeback()
under pte_lock, right? Or, we need re-organize current move-charge 
implementation.

   Nice catch. I think releaseing pte_lock() is okay. (and it should be 
   released)
   
   IIUC, task's css_set() points to new cgroup when move is called. Then,
   it's not necessary to take pte_lock, I guess.
   (And taking pte_lock too long is not appreciated..)
   
   I'll write a sample patch today.
   
  Here.
 Great!
 
  ==
  From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
  
  Now, at task migration among cgroup, memory cgroup scans page table and 
  moving
  account if flags are properly set.
  
  The core code, mem_cgroup_move_charge_pte_range() does
  
  pte_offset_map_lock();
  for all ptes in a page table:
  1. look into page table, find_and_get a page
  2. remove it from LRU.
  3. move charge.
  4. putback to LRU. put_page()
  pte_offset_map_unlock();
  
  for pte entries on a 3rd(2nd) level page table.
  
  This pte_offset_map_lock seems a bit long. This patch modifies a rountine as
  
  pte_offset_map_lock()
  for 32 pages:
find_and_get a page
record it
  pte_offset_map_unlock()
  for all recorded pages
isolate it from LRU.
move charge
putback to LRU
  for all recorded pages
put_page()
  
  Note: newly-charged pages while we move account are charged to the new 
  group.
  
  Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
  ---
   mm/memcontrol.c |   92 
  
   1 file changed, 60 insertions(+), 32 deletions(-)
  
  Index: mmotm-0928/mm/memcontrol.c
  ===
  --- mmotm-0928.orig/mm/memcontrol.c
  +++ mmotm-0928/mm/memcontrol.c
  @@ -4475,17 +4475,22 @@ one_by_one:
*
* Called with pte lock held.
*/
  -union mc_target {
  -   struct page *page;
  -   swp_entry_t ent;
  -};
   
   enum mc_target_type {
  -   MC_TARGET_NONE, /* not used */
  +   MC_TARGET_NONE, /* used as failure code(0) */
  MC_TARGET_PAGE,
  MC_TARGET_SWAP,
   };
   
  +struct mc_target {
  +   enum mc_target_type type;
  +   union {
  +   struct page *page;
  +   swp_entry_t ent;
  +   } val;
  +};
  +
  +
   static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
  unsigned long addr, pte_t ptent)
   {
  @@ -4561,7 +4566,7 @@ static struct page *mc_handle_file_pte(s
   }
   
   static int is_target_pte_for_mc(struct vm_area_struct *vma,
  -   unsigned long addr, pte_t ptent, union mc_target *target)
  +   unsigned long addr, pte_t ptent, struct mc_target *target)
   {
  struct page *page = NULL;
  struct page_cgroup *pc;
  @@ -4587,7 +4592,7 @@ static int is_target_pte_for_mc(struct v
  if (PageCgroupUsed(pc)  pc-mem_cgroup == mc.from) {
  ret = MC_TARGET_PAGE;
  if (target)
  -   target-page = page;
  +   target-val.page = page;
  }
  if (!ret || !target)
  put_page(page);
  @@ -4597,8 +4602,10 @@ static int is_target_pte_for_mc(struct v
  css_id(mc.from-css) == lookup_swap_cgroup(ent)) {
  ret = MC_TARGET_SWAP;
  if (target)
  -   target-ent = ent;
  +   target-val.ent = ent;
  }
  +   if (target)
  +   target-type = ret;
  return ret;
   }
   
  @@ -4751,6 +4758,9 @@ static void mem_cgroup_cancel_attach(str
  mem_cgroup_clear_mc();
   }
   
  +
  +#define MC_MOVE_ONCE   (32)
  +
   static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
  unsigned long addr, unsigned long end,
  struct mm_walk *walk)
  @@ -4759,26 +4769,47 @@ static int mem_cgroup_move_charge_pte_ra
  struct vm_area_struct *vma = walk-private;
  pte_t *pte;
  spinlock_t *ptl;
  +   struct mc_target *target;
  +   int index, num;
  +
  +   target = kzalloc(sizeof(struct mc_target) *MC_MOVE_ONCE, GFP_KERNEL);
 hmm? I can't see it freed anywhere.
 
leaked..


 Considering you reset target[]-type to MC_TARGET_NONE, do you intended to
 reuse targe[] while walking the page table ?
yes.

 If so, how about adding a new member(struct mc_target *targe) to 
 

[Devel] [PATCH v2] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()

2010-10-07 Thread KAMEZAWA Hiroyuki
On Thu, 7 Oct 2010 16:42:04 +0900
KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:
  If so, how about adding a new member(struct mc_target *targe) to 
  move_charge_struct,
  and allocate/free it at mem_cgroup_move_charge() ?
  
 Hmm, sounds nice.
 

Here.
==
From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

Now, at task migration among cgroup, memory cgroup scans page table and moving
account if flags are properly set.

The core code, mem_cgroup_move_charge_pte_range() does

pte_offset_map_lock();
for all ptes in a page table:
1. look into page table, find_and_get a page
2. remove it from LRU.
3. move charge.
4. putback to LRU. put_page()
pte_offset_map_unlock();

for pte entries on a 3rd level? page table.

This pte_offset_map_lock seems a bit long. This patch modifies a rountine as

for 32 pages: pte_offset_map_lock()
  find_and_get a page
  record it
  pte_offset_map_unlock()
for all recorded pages
  isolate it from LRU.
  move charge
  putback to LRU
for all recorded pages
  put_page()

Changelog: v1-v2
 - removed kzalloc() of mc_target. preallocate it on mc

Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
---
 mm/memcontrol.c |   95 ++--
 1 file changed, 59 insertions(+), 36 deletions(-)

Index: mmotm-0928/mm/memcontrol.c
===
--- mmotm-0928.orig/mm/memcontrol.c
+++ mmotm-0928/mm/memcontrol.c
@@ -276,6 +276,21 @@ enum move_type {
NR_MOVE_TYPE,
 };
 
+enum mc_target_type {
+   MC_TARGET_NONE, /* used as failure code(0) */
+   MC_TARGET_PAGE,
+   MC_TARGET_SWAP,
+};
+
+struct mc_target {
+   enum mc_target_type type;
+   union {
+   struct page *page;
+   swp_entry_t ent;
+   } val;
+};
+#define MC_MOVE_ONCE   (32)
+
 /* mc and its members are protected by cgroup_mutex */
 static struct move_charge_struct {
spinlock_tlock; /* for from, to, moving_task */
@@ -284,6 +299,7 @@ static struct move_charge_struct {
unsigned long precharge;
unsigned long moved_charge;
unsigned long moved_swap;
+   struct mc_target target[MC_MOVE_ONCE];
struct task_struct *moving_task;/* a task moving charges */
wait_queue_head_t waitq;/* a waitq for other context */
 } mc = {
@@ -291,6 +307,7 @@ static struct move_charge_struct {
.waitq = __WAIT_QUEUE_HEAD_INITIALIZER(mc.waitq),
 };
 
+
 static bool move_anon(void)
 {
return test_bit(MOVE_CHARGE_TYPE_ANON,
@@ -4475,16 +4492,7 @@ one_by_one:
  *
  * Called with pte lock held.
  */
-union mc_target {
-   struct page *page;
-   swp_entry_t ent;
-};
 
-enum mc_target_type {
-   MC_TARGET_NONE, /* not used */
-   MC_TARGET_PAGE,
-   MC_TARGET_SWAP,
-};
 
 static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
unsigned long addr, pte_t ptent)
@@ -4561,7 +4569,7 @@ static struct page *mc_handle_file_pte(s
 }
 
 static int is_target_pte_for_mc(struct vm_area_struct *vma,
-   unsigned long addr, pte_t ptent, union mc_target *target)
+   unsigned long addr, pte_t ptent, struct mc_target *target)
 {
struct page *page = NULL;
struct page_cgroup *pc;
@@ -4587,7 +4595,7 @@ static int is_target_pte_for_mc(struct v
if (PageCgroupUsed(pc)  pc-mem_cgroup == mc.from) {
ret = MC_TARGET_PAGE;
if (target)
-   target-page = page;
+   target-val.page = page;
}
if (!ret || !target)
put_page(page);
@@ -4597,8 +4605,10 @@ static int is_target_pte_for_mc(struct v
css_id(mc.from-css) == lookup_swap_cgroup(ent)) {
ret = MC_TARGET_SWAP;
if (target)
-   target-ent = ent;
+   target-val.ent = ent;
}
+   if (target)
+   target-type = ret;
return ret;
 }
 
@@ -4759,26 +4769,42 @@ static int mem_cgroup_move_charge_pte_ra
struct vm_area_struct *vma = walk-private;
pte_t *pte;
spinlock_t *ptl;
+   int index, num;
 
 retry:
pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
-   for (; addr != end; addr += PAGE_SIZE) {
+   for (num = 0; num  MC_MOVE_ONCE  addr != end; addr += PAGE_SIZE) {
pte_t ptent = *(pte++);
-   union mc_target target;
-   int type;
+   ret = is_target_pte_for_mc(vma, addr, ptent, mc.target[num]);
+   if (!ret)
+  

[Devel] Re: [PATCH] memcg: lock-free clear page writeback (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()

2010-10-07 Thread KAMEZAWA Hiroyuki
On Thu, 7 Oct 2010 15:24:22 +0900
KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:

 Greg, I think clear_page_writeback() will not require _any_ locks with this 
 patch.
 But set_page_writeback() requires it...
 (Maybe adding a special function for clear_page_writeback() is better rather 
 than
  adding some complex to switch() in update_page_stat())
 

I'm testing a code like this.
==
   /* pc-mem_cgroup is unstable ? */
if (unlikely(mem_cgroup_stealed(mem))) {
/* take a lock against to access pc-mem_cgroup */
if (!in_interrupt()) {
lock_page_cgroup(pc);
need_unlock = true;
mem = pc-mem_cgroup;
if (!mem || !PageCgroupUsed(pc))
goto out;
} else if (idx == MEMCG_NR_FILE_WRITEBACK  (val  0)) {
/* This is allowed */
} else
BUG();
}
==
Thanks,
-Kame

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: lxc performance?

2010-10-07 Thread Pavel Labushev
06.10.2010 23:41, MALATTAR пишет:

 the container dora1, where i launch an instance of my IDS, does not take 
 more than 70 MB as memory even though the memory limit for it is much 
 bigger than this value,

How do you measure memory usage? What's in memory.max_usage_in_bytes of
the container's cgroup?
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/2] Kconfig : default all the namespaces to 'yes'

2010-10-07 Thread Daniel Lezcano
As the different namespaces depend on 'CONFIG_NAMESPACES', it is
logical to enable all the namespaces when we enable NAMESPACES.

Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
---
 init/Kconfig |7 +--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index a52124e..a7fe61e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -739,6 +739,7 @@ config NAMESPACES
 config UTS_NS
bool UTS namespace
depends on NAMESPACES
+   default y
help
  In this namespace tasks see different info provided with the
  uname() system call
@@ -746,6 +747,7 @@ config UTS_NS
 config IPC_NS
bool IPC namespace
depends on NAMESPACES  (SYSVIPC || POSIX_MQUEUE)
+   default y
help
  In this namespace tasks work with IPC ids which correspond to
  different IPC objects in different namespaces.
@@ -753,6 +755,7 @@ config IPC_NS
 config USER_NS
bool User namespace (EXPERIMENTAL)
depends on NAMESPACES  EXPERIMENTAL
+   default y
help
  This allows containers, i.e. vservers, to use user namespaces
  to provide different user info for different servers.
@@ -760,8 +763,8 @@ config USER_NS
 
 config PID_NS
bool PID Namespaces
-   default n
depends on NAMESPACES
+   default y
help
  Support process id namespaces.  This allows having multiple
  processes with the same pid as long as they are in different
@@ -769,8 +772,8 @@ config PID_NS
 
 config NET_NS
bool Network namespace
-   default n
depends on NAMESPACES  NET
+   default y
help
  Allow user space to create what appear to be multiple instances
  of the network stack.
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/2] Kconfig : remove pid_ns and net_ns experimental

2010-10-07 Thread Daniel Lezcano
The pid namespace is in the kernel since 2.6.27 and the net_ns
since 2.6.29. They are enabled in the distro by default and used by
userspace component. They are mature enough to remove the 'experimental'
label.

Signed-off-by: Daniel Lezcano daniel.lezc...@free.fr
---
 init/Kconfig |9 +++--
 1 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index a175935..a52124e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -759,21 +759,18 @@ config USER_NS
  If unsure, say N.
 
 config PID_NS
-   bool PID Namespaces (EXPERIMENTAL)
+   bool PID Namespaces
default n
-   depends on NAMESPACES  EXPERIMENTAL
+   depends on NAMESPACES
help
  Support process id namespaces.  This allows having multiple
  processes with the same pid as long as they are in different
  pid namespaces.  This is a building block of containers.
 
- Unless you want to work with an experimental feature
- say N here.
-
 config NET_NS
bool Network namespace
default n
-   depends on NAMESPACES  EXPERIMENTAL  NET
+   depends on NAMESPACES  NET
help
  Allow user space to create what appear to be multiple instances
  of the network stack.
-- 
1.7.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits

2010-10-07 Thread Greg Thelen
Ciju Rajan K c...@linux.vnet.ibm.com writes:

 Greg Thelen wrote:
 Add cgroupfs interface to memcg dirty page limits:
   Direct write-out is controlled with:
   - memory.dirty_ratio
   - memory.dirty_bytes

   Background write-out is controlled with:
   - memory.dirty_background_ratio
   - memory.dirty_background_bytes

 Signed-off-by: Andrea Righi ari...@develer.com
 Signed-off-by: Greg Thelen gthe...@google.com
 ---
  mm/memcontrol.c |   89 
 +++
  1 files changed, 89 insertions(+), 0 deletions(-)

 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 6ec2625..2d45a0a 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -100,6 +100,13 @@ enum mem_cgroup_stat_index {
  MEM_CGROUP_STAT_NSTATS,
  };

 +enum {
 +MEM_CGROUP_DIRTY_RATIO,
 +MEM_CGROUP_DIRTY_BYTES,
 +MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
 +MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
 +};
 +
  struct mem_cgroup_stat_cpu {
  s64 count[MEM_CGROUP_STAT_NSTATS];
  };
 @@ -4292,6 +4299,64 @@ static int mem_cgroup_oom_control_write(struct cgroup 
 *cgrp,
  return 0;
  }

 +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
 +{
 +struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
 +bool root;
 +
 +root = mem_cgroup_is_root(mem);
 +
 +switch (cft-private) {
 +case MEM_CGROUP_DIRTY_RATIO:
 +return root ? vm_dirty_ratio : mem-dirty_param.dirty_ratio;
 +case MEM_CGROUP_DIRTY_BYTES:
 +return root ? vm_dirty_bytes : mem-dirty_param.dirty_bytes;
 +case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
 +return root ? dirty_background_ratio :
 +mem-dirty_param.dirty_background_ratio;
 +case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
 +return root ? dirty_background_bytes :
 +mem-dirty_param.dirty_background_bytes;
 +default:
 +BUG();
 +}
 +}
 +
 +static int
 +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
 +{
 +struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
 +int type = cft-private;
 +
 +if (cgrp-parent == NULL)
 +return -EINVAL;
 +if ((type == MEM_CGROUP_DIRTY_RATIO ||
 + type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO)  val  100)
 +return -EINVAL;
 +switch (type) {
 +case MEM_CGROUP_DIRTY_RATIO:
 +memcg-dirty_param.dirty_ratio = val;
 +memcg-dirty_param.dirty_bytes = 0;
 +break;
 +case MEM_CGROUP_DIRTY_BYTES:
 +memcg-dirty_param.dirty_bytes = val;
 +memcg-dirty_param.dirty_ratio  = 0;
 +break;
 +case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
 +memcg-dirty_param.dirty_background_ratio = val;
 +memcg-dirty_param.dirty_background_bytes = 0;
 +break;
 +case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
 +memcg-dirty_param.dirty_background_bytes = val;
 +memcg-dirty_param.dirty_background_ratio = 0;
 +break;
 +default:
 +BUG();
 +break;
 +}
 +return 0;
 +}
 +
  static struct cftype mem_cgroup_files[] = {
  {
  .name = usage_in_bytes,
 @@ -4355,6 +4420,30 @@ static struct cftype mem_cgroup_files[] = {
  .unregister_event = mem_cgroup_oom_unregister_event,
  .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
  },
 +{
 +.name = dirty_ratio,
 +.read_u64 = mem_cgroup_dirty_read,
 +.write_u64 = mem_cgroup_dirty_write,
 +.private = MEM_CGROUP_DIRTY_RATIO,
 +},
 +{
 +.name = dirty_bytes,
 +.read_u64 = mem_cgroup_dirty_read,
 +.write_u64 = mem_cgroup_dirty_write,
 +.private = MEM_CGROUP_DIRTY_BYTES,
 +},
 +{
   
 Is it a good idea to rename dirty_bytes to dirty_limit_in_bytes ?
 So that it can match with other memcg tunable naming convention.
 We already have memory.memsw.limit_in_bytes, memory.limit_in_bytes,
 memory.soft_limit_in_bytes, etc.

I see your point in trying to be more internally consistent with other
memcg counter.

It's a trade-off, either use names consistent with /proc/sys/vm, or use
names similar to other memory.* control files.  I prefer your suggestion
and will rename as you suggested, unless I hear strong objection.

 +.name = dirty_background_ratio,
 +.read_u64 = mem_cgroup_dirty_read,
 +.write_u64 = mem_cgroup_dirty_write,
 +.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
 +},
 +{
 +.name = dirty_background_bytes,
 +.read_u64 = mem_cgroup_dirty_read,
 +.write_u64 = mem_cgroup_dirty_write,
 +.private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
   
 Similarly dirty_background_bytes to dirty_background_limit_in_bytes ?
 +},
  };

  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
   

PS: I am collecting performance data on patch series 

[Devel] Re: [PATCH] memcg: lock-free clear page writeback (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()

2010-10-07 Thread Minchan Kim
Hi Kame,

On Thu, Oct 7, 2010 at 3:24 PM, KAMEZAWA Hiroyuki
kamezawa.hir...@jp.fujitsu.com wrote:
 Greg, I think clear_page_writeback() will not require _any_ locks with this 
 patch.
 But set_page_writeback() requires it...
 (Maybe adding a special function for clear_page_writeback() is better rather 
 than
  adding some complex to switch() in update_page_stat())

 ==
 From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

 Now, at page information accounting, we do lock_page_cgroup() if 
 pc-mem_cgroup
 points to a cgroup where someone is moving charges from.

 At supporing dirty-page accounting, one of troubles is writeback bit.
 In general, writeback can be cleared via IRQ context. To update writeback bit
 with lock_page_cgroup() in safe way, we'll have to disable IRQ.
 or do something.

 This patch waits for completion of writeback under lock_page() and do
 lock_page_cgroup() in safe way. (We never got end_io via IRQ context.)

 By this, writeback-accounting will never see race with account_move() and
 it can trust pc-mem_cgroup always _without_ any lock.

 Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
 ---
  mm/memcontrol.c |   18 ++
  1 file changed, 18 insertions(+)

 Index: mmotm-0928/mm/memcontrol.c
 ===
 --- mmotm-0928.orig/mm/memcontrol.c
 +++ mmotm-0928/mm/memcontrol.c
 @@ -2183,17 +2183,35 @@ static void __mem_cgroup_move_account(st
  /*
  * check whether the @pc is valid for moving account and call
  * __mem_cgroup_move_account()
 + * Don't call this under pte_lock etc...we'll do lock_page() and wait for
 + * the end of I/O.
  */
  static int mem_cgroup_move_account(struct page_cgroup *pc,
                struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
  {
        int ret = -EINVAL;
 +
 +       /*
 +        * We move severl flags and accounting information here. So we need to
 +        * avoid the races with update_stat routines. For most of routines,
 +        * lock_page_cgroup() is enough for avoiding race. But we need to take
 +        * care of IRQ context. If flag updates comes from IRQ context, This
 +        * move account will be racy (and cause deadlock in 
 lock_page_cgroup())
 +        *
 +        * Now, the only race we have is Writeback flag. We wait for it 
 cleared
 +        * before starting our jobs.
 +        */
 +
 +       lock_page(pc-page);
 +       wait_on_page_writeback(pc-page);
 +
        lock_page_cgroup(pc);
        if (PageCgroupUsed(pc)  pc-mem_cgroup == from) {
                __mem_cgroup_move_account(pc, from, to, uncharge);
                ret = 0;
        }
        unlock_page_cgroup(pc);
 +       unlock_page(pc-page);
        /*
         * check events
         */



Looks good to me.
But let me ask a question.
Why do only move_account need this logic?
Is deadlock candidate is only this place?
How about mem_cgroup_prepare_migration?

unmap_and_move
lock_page
mem_cgroup_prepare_migration
lock_page_cgroup
...
softirq happen
lock_page_cgroup


If race happens only where move_account and writeback, please describe
it as comment.
It would help to review the code in future.

-- 
Kind regards,
Minchan Kim
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()

2010-10-07 Thread Daisuke Nishimura
On Thu, 7 Oct 2010 16:14:54 -0700
Andrew Morton a...@linux-foundation.org wrote:

 On Thu, 7 Oct 2010 17:04:05 +0900
 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:
 
  Now, at task migration among cgroup, memory cgroup scans page table and 
  moving
  account if flags are properly set.
  
  The core code, mem_cgroup_move_charge_pte_range() does
  
  pte_offset_map_lock();
  for all ptes in a page table:
  1. look into page table, find_and_get a page
  2. remove it from LRU.
  3. move charge.
  4. putback to LRU. put_page()
  pte_offset_map_unlock();
  
  for pte entries on a 3rd level? page table.
  
  This pte_offset_map_lock seems a bit long. This patch modifies a rountine as
  
  for 32 pages: pte_offset_map_lock()
find_and_get a page
record it
pte_offset_map_unlock()
  for all recorded pages
isolate it from LRU.
move charge
putback to LRU
  for all recorded pages
put_page()
 
 The patch makes the code larger, more complex and slower!
 
Before this patch:
   textdata bss dec hex filename
  27163   117824100   43045a825 mm/memcontrol.o

After this patch:
   textdata bss dec hex filename
  27307   122944100   43701aab5 mm/memcontrol.o

hmm, allocating mc.target[] statically might be bad, but I'm now wondering
whether I could allocate mc itself dynamically(I'll try).

 I do think we're owed a more complete description of its benefits than
 seems a bit long.  Have problems been observed?  Any measurements
 taken?
 
IIUC, this patch is necessary for [PATCH] memcg: lock-free clear page 
writeback
later, but I agree we should describe it.

Thanks,
Daisuke Nishimura.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()

2010-10-07 Thread KAMEZAWA Hiroyuki
On Thu, 7 Oct 2010 16:14:54 -0700
Andrew Morton a...@linux-foundation.org wrote:

 On Thu, 7 Oct 2010 17:04:05 +0900
 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:
 
  Now, at task migration among cgroup, memory cgroup scans page table and 
  moving
  account if flags are properly set.
  
  The core code, mem_cgroup_move_charge_pte_range() does
  
  pte_offset_map_lock();
  for all ptes in a page table:
  1. look into page table, find_and_get a page
  2. remove it from LRU.
  3. move charge.
  4. putback to LRU. put_page()
  pte_offset_map_unlock();
  
  for pte entries on a 3rd level? page table.
  
  This pte_offset_map_lock seems a bit long. This patch modifies a rountine as
  
  for 32 pages: pte_offset_map_lock()
find_and_get a page
record it
pte_offset_map_unlock()
  for all recorded pages
isolate it from LRU.
move charge
putback to LRU
  for all recorded pages
put_page()
 
 The patch makes the code larger, more complex and slower!
 

Slower ?

 I do think we're owed a more complete description of its benefits than
 seems a bit long.  Have problems been observed?  Any measurements
 taken?
 

I'll rewrite the whole patch against today's mmotom.

Thanks,
-Kame

 
 

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] memcg: lock-free clear page writeback (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()

2010-10-07 Thread KAMEZAWA Hiroyuki
On Fri, 8 Oct 2010 08:35:30 +0900
Minchan Kim minchan@gmail.com wrote:

 Hi Kame,
 
 On Thu, Oct 7, 2010 at 3:24 PM, KAMEZAWA Hiroyuki
 kamezawa.hir...@jp.fujitsu.com wrote:
  Greg, I think clear_page_writeback() will not require _any_ locks with this 
  patch.
  But set_page_writeback() requires it...
  (Maybe adding a special function for clear_page_writeback() is better 
  rather than
   adding some complex to switch() in update_page_stat())
 
  ==
  From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
 
  Now, at page information accounting, we do lock_page_cgroup() if 
  pc-mem_cgroup
  points to a cgroup where someone is moving charges from.
 
  At supporing dirty-page accounting, one of troubles is writeback bit.
  In general, writeback can be cleared via IRQ context. To update writeback 
  bit
  with lock_page_cgroup() in safe way, we'll have to disable IRQ.
  or do something.
 
  This patch waits for completion of writeback under lock_page() and do
  lock_page_cgroup() in safe way. (We never got end_io via IRQ context.)
 
  By this, writeback-accounting will never see race with account_move() and
  it can trust pc-mem_cgroup always _without_ any lock.
 
  Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
  ---
   mm/memcontrol.c |   18 ++
   1 file changed, 18 insertions(+)
 
  Index: mmotm-0928/mm/memcontrol.c
  ===
  --- mmotm-0928.orig/mm/memcontrol.c
  +++ mmotm-0928/mm/memcontrol.c
  @@ -2183,17 +2183,35 @@ static void __mem_cgroup_move_account(st
   /*
   * check whether the @pc is valid for moving account and call
   * __mem_cgroup_move_account()
  + * Don't call this under pte_lock etc...we'll do lock_page() and wait for
  + * the end of I/O.
   */
   static int mem_cgroup_move_account(struct page_cgroup *pc,
                 struct mem_cgroup *from, struct mem_cgroup *to, bool 
  uncharge)
   {
         int ret = -EINVAL;
  +
  +       /*
  +        * We move severl flags and accounting information here. So we need 
  to
  +        * avoid the races with update_stat routines. For most of routines,
  +        * lock_page_cgroup() is enough for avoiding race. But we need to 
  take
  +        * care of IRQ context. If flag updates comes from IRQ context, This
  +        * move account will be racy (and cause deadlock in 
  lock_page_cgroup())
  +        *
  +        * Now, the only race we have is Writeback flag. We wait for it 
  cleared
  +        * before starting our jobs.
  +        */
  +
  +       lock_page(pc-page);
  +       wait_on_page_writeback(pc-page);
  +
         lock_page_cgroup(pc);
         if (PageCgroupUsed(pc)  pc-mem_cgroup == from) {
                 __mem_cgroup_move_account(pc, from, to, uncharge);
                 ret = 0;
         }
         unlock_page_cgroup(pc);
  +       unlock_page(pc-page);
         /*
          * check events
          */
 
 
 
 Looks good to me.
 But let me ask a question.
 Why do only move_account need this logic?

Because charge/uncharge (add/remove to radix-tree or swapcache)
never happens while a page is PG_writeback.

 Is deadlock candidate is only this place?
yes.

 How about mem_cgroup_prepare_migration?
 
 unmap_and_move
 lock_page
 mem_cgroup_prepare_migration
 lock_page_cgroup
 ...
 softirq happen
 lock_page_cgroup
 
 
Nice cactch. I'll move prepare_migraon after wait_on_page_writeback()

 If race happens only where move_account and writeback, please describe
 it as comment.
 It would help to review the code in future.
 

Sure, updates are necessary.

Thanks,
-Kame

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()

2010-10-07 Thread Andrew Morton
On Fri, 8 Oct 2010 13:37:12 +0900 KAMEZAWA Hiroyuki 
kamezawa.hir...@jp.fujitsu.com wrote:

 On Thu, 7 Oct 2010 16:14:54 -0700
 Andrew Morton a...@linux-foundation.org wrote:
 
  On Thu, 7 Oct 2010 17:04:05 +0900
  KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:
  
   Now, at task migration among cgroup, memory cgroup scans page table and 
   moving
   account if flags are properly set.
   
   The core code, mem_cgroup_move_charge_pte_range() does
   
 pte_offset_map_lock();
 for all ptes in a page table:
 1. look into page table, find_and_get a page
 2. remove it from LRU.
 3. move charge.
 4. putback to LRU. put_page()
 pte_offset_map_unlock();
   
   for pte entries on a 3rd level? page table.
   
   This pte_offset_map_lock seems a bit long. This patch modifies a rountine 
   as
   
 for 32 pages: pte_offset_map_lock()
   find_and_get a page
   record it
   pte_offset_map_unlock()
 for all recorded pages
   isolate it from LRU.
   move charge
   putback to LRU
 for all recorded pages
   put_page()
  
  The patch makes the code larger, more complex and slower!
  
 
 Slower ?

Sure.  It walks the same data three times, potentially causing
thrashing in the L1 cache.  It takes and releases locks at a higher
frequency.  It increases the text size.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH v2] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()

2010-10-07 Thread KAMEZAWA Hiroyuki
On Thu, 7 Oct 2010 21:55:56 -0700
Andrew Morton a...@linux-foundation.org wrote:

 On Fri, 8 Oct 2010 13:37:12 +0900 KAMEZAWA Hiroyuki 
 kamezawa.hir...@jp.fujitsu.com wrote:
 
  On Thu, 7 Oct 2010 16:14:54 -0700
  Andrew Morton a...@linux-foundation.org wrote:
  
   On Thu, 7 Oct 2010 17:04:05 +0900
   KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:
   
Now, at task migration among cgroup, memory cgroup scans page table and 
moving
account if flags are properly set.

The core code, mem_cgroup_move_charge_pte_range() does

pte_offset_map_lock();
for all ptes in a page table:
1. look into page table, find_and_get a page
2. remove it from LRU.
3. move charge.
4. putback to LRU. put_page()
pte_offset_map_unlock();

for pte entries on a 3rd level? page table.

This pte_offset_map_lock seems a bit long. This patch modifies a 
rountine as

for 32 pages: pte_offset_map_lock()
  find_and_get a page
  record it
  pte_offset_map_unlock()
for all recorded pages
  isolate it from LRU.
  move charge
  putback to LRU
for all recorded pages
  put_page()
   
   The patch makes the code larger, more complex and slower!
   
  
  Slower ?
 
 Sure.  It walks the same data three times, potentially causing
 thrashing in the L1 cache.

Hmm, make this 2 times, at least.

 It takes and releases locks at a higher frequency.  It increases the text 
 size.
 

But I don't think page_table_lock is a lock which someone can hold so long
that
1. find_get_page
2. spin_lock(zone-lock)
3. remove it from LRU
4. lock_page_cgroup()
5. move charge (This means page 
5. putback to LRU
for 4096/8=1024 pages long.

will try to make the routine smarter.

But I want to get rid of page_table_lock - lock_page_cgroup().

Thanks,
-Kame


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel