Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-08 Thread Michal Hocko
On Fri 08-02-13 17:29:18, Michal Hocko wrote:
[...]
> OK, I have checked the allocator slow path and you are right even
> GFP_KERNEL will not fail. This can lead to similar deadlocks - e.g.
> OOM killed task blocked on down_write(mmap_sem) while the page fault
> handler holding mmap_sem for reading and allocating a new page without
> any progress.

And now that I think about it some more it sounds like it shouldn't be
possible because allocator would fail because it would see
TIF_MEMDIE (OOM killer kills all threads that share the same mm).
But maybe there are other locks that are dangerous, but I think that the
risk is pretty low.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-08 Thread Michal Hocko
On Thu 07-02-13 20:27:00, Greg Thelen wrote:
> On Tue, Feb 05 2013, Michal Hocko wrote:
> 
> > On Tue 05-02-13 10:09:57, Greg Thelen wrote:
> >> On Tue, Feb 05 2013, Michal Hocko wrote:
> >> 
> >> > On Tue 05-02-13 08:48:23, Greg Thelen wrote:
> >> >> On Tue, Feb 05 2013, Michal Hocko wrote:
> >> >> 
> >> >> > On Tue 05-02-13 15:49:47, azurIt wrote:
> >> >> > [...]
> >> >> >> Just to be sure - am i supposed to apply this two patches?
> >> >> >> http://watchdog.sk/lkml/patches/
> >> >> >
> >> >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
> >> >> > mentioned in a follow up email. Here is the full patch:
> >> >> > ---
> >> >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001
> >> >> > From: Michal Hocko 
> >> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100
> >> >> > Subject: [PATCH] memcg: do not trigger OOM from 
> >> >> > add_to_page_cache_locked
> >> >> >
> >> >> > memcg oom killer might deadlock if the process which falls down to
> >> >> > mem_cgroup_handle_oom holds a lock which prevents other task to
> >> >> > terminate because it is blocked on the very same lock.
> >> >> > This can happen when a write system call needs to allocate a page but
> >> >> > the allocation hits the memcg hard limit and there is nothing to 
> >> >> > reclaim
> >> >> > (e.g. there is no swap or swap limit is hit as well and all cache 
> >> >> > pages
> >> >> > have been reclaimed already) and the process selected by memcg OOM
> >> >> > killer is blocked on i_mutex on the same inode (e.g. truncate it).
> >> >> >
> >> >> > Process A
> >> >> > [] do_truncate+0x58/0xa0# takes i_mutex
> >> >> > [] do_last+0x250/0xa30
> >> >> > [] path_openat+0xd7/0x440
> >> >> > [] do_filp_open+0x49/0xa0
> >> >> > [] do_sys_open+0x106/0x240
> >> >> > [] sys_open+0x20/0x30
> >> >> > [] system_call_fastpath+0x18/0x1d
> >> >> > [] 0x
> >> >> >
> >> >> > Process B
> >> >> > [] mem_cgroup_handle_oom+0x241/0x3b0
> >> >> > [] T.1146+0x5ab/0x5c0
> >> >> > [] mem_cgroup_cache_charge+0xbe/0xe0
> >> >> > [] add_to_page_cache_locked+0x4c/0x140
> >> >> > [] add_to_page_cache_lru+0x22/0x50
> >> >> > [] grab_cache_page_write_begin+0x8b/0xe0
> >> >> > [] ext3_write_begin+0x88/0x270
> >> >> > [] generic_file_buffered_write+0x116/0x290
> >> >> > [] __generic_file_aio_write+0x27c/0x480
> >> >> > [] generic_file_aio_write+0x76/0xf0   # 
> >> >> > takes ->i_mutex
> >> >> > [] do_sync_write+0xea/0x130
> >> >> > [] vfs_write+0xf3/0x1f0
> >> >> > [] sys_write+0x51/0x90
> >> >> > [] system_call_fastpath+0x18/0x1d
> >> >> > [] 0x
> >> >> 
> >> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into
> >> >> __page_cache_alloc() and mem_cgroup_cache_charge().  Which makes me
> >> >> think that this deadlock is also possible in the page allocator even
> >> >> before getting to add_to_page_cache_lru.  no?
> >> >
> >> > I am not that familiar with VFS but i_mutex is a high level lock AFAIR
> >> > and it shouldn't be called from the pageout path so __page_cache_alloc
> >> > should be safe.
> >> 
> >> I wasn't clear, sorry.  My concern is not that pageout() grabs i_mutex.
> >> My concern is that __page_cache_alloc() will invoke the oom killer and
> >> select a victim which wants i_mutex.  This victim will deadlock because
> >> the oom killer caller already holds i_mutex.  
> >
> > That would be true for the memcg oom because that one is blocking but
> > the global oom just puts the allocator into sleep for a while and then
> > the allocator should back off eventually (unless this is NOFAIL
> > allocation). I would need to look closer whether this is really the case
> > - I haven't seen that allocator code path for a while...
> 
> I think the page allocator can loop forever waiting for an oom victim to
> terminate even without NOFAIL.  Especially if the oom victim wants a
> resource exclusively held by the allocating thread (e.g. i_mutex).  It
> looks like the same deadlock you describe is also possible (though more
> rare) without memcg.

OK, I have checked the allocator slow path and you are right even
GFP_KERNEL will not fail. This can lead to similar deadlocks - e.g.
OOM killed task blocked on down_write(mmap_sem) while the page fault
handler holding mmap_sem for reading and allocating a new page without
any progress.
Luckily there are memory reserves where the allocator fall back
eventually so the allocation should be able to get some memory and
release the lock. There is still a theoretical chance this would block
though. This sounds like a corner case though so I wouldn't care about
it very much.

> If the looping thread is an eligible oom victim (i.e. not oom disabled,
> not an kernel thread, etc) then the page allocator can return NULL in so
> long as NOFAIL is not used.  So any allocator which is able to call the
> oom killer and is not oom disabled (kernel thread, etc) is already
> exposed to the possibility of page allocator failure.  So if the page
> allocator 

Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-08 Thread Michal Hocko
On Fri 08-02-13 10:40:13, KAMEZAWA Hiroyuki wrote:
> (2013/02/07 20:01), Kamezawa Hiroyuki wrote:
[...]
> >Hmm. do we need to increase the "limit" virtually at memcg oom until
> >the oom-killed process dies ?
> 
> Here is my naive idea...

and the next step would be
http://en.wikipedia.org/wiki/Credit_default_swap :P

But seriously now. The idea is not bad at all. This implementation
would need some tweaks to work though (e.g. you would need to wake oom
sleepers when you get a loan - because those are ones which can block
the resource).  We should also give the borrowed charges only to those
who would oom to prevent from stealing.
I think that it should be mem_cgroup_out_of_memory who establishes the
loan and it can have a look at how much memory the killed task frees -
e.g. some portion of get_mm_rss() or a more precise but much more
expensive traversing via private vmas and check whether they charged
memory from the target memcg hierarchy (this is a slow path anyway).

But who knows maybe a fixed 2MB would work out as well.

Thanks!

> ==
> From 1a46318cf89e7df94bd4844f29105b61dacf335b Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki 
> Date: Fri, 8 Feb 2013 10:43:52 +0900
> Subject: [PATCH] [Don't Apply][PATCH] memcg relax resource at OOM situation.
> 
> When an OOM happens, a task is killed and resources will be freed.
> 
> A problem here is that a task, which is oom-killed, may wait for
> some other resource in which memory resource is required. Some thread
> waits for free memory may holds some mutex and oom-killed process
> wait for the mutex.
> 
> To avoid this, relaxing charged memory by giving virtual resource
> can be a help. The system can get back it at uncharge().
> This is a sample native implementation.
> 
> Signed-off-by: KAMEZAWA Hiroyuki 
> ---
>  mm/memcontrol.c |   79 
> ++-
>  1 file changed, 73 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 25ac5f4..4dea49a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -301,6 +301,9 @@ struct mem_cgroup {
>   /* set when res.limit == memsw.limit */
>   boolmemsw_is_minimum;
> + /* extra resource at emergency situation */
> + unsigned long   loan;
> + spinlock_t  loan_lock;
>   /* protect arrays of thresholds */
>   struct mutex thresholds_lock;
> @@ -2034,6 +2037,61 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup 
> *root_memcg,
>   mem_cgroup_iter_break(root_memcg, victim);
>   return total;
>  }
> +/*
> + * When a memcg is in OOM situation, this lack of resource may cause deadlock
> + * because of complicated lock dependency(i_mutex...). To avoid that, we
> + * need extra resource or avoid charging.
> + *
> + * A memcg can request resource in an emergency state. We call it as loan.
> + * A memcg will return a loan when it does uncharge resource. We disallow
> + * double-loan and moving task to other groups until the loan is fully
> + * returned.
> + *
> + * Note: the problem here is that we cannot know what amount resouce should
> + * be necessary to exiting an emergency state.
> + */
> +#define LOAN_MAX (2 * 1024 * 1024)
> +
> +static void mem_cgroup_make_loan(struct mem_cgroup *memcg)
> +{
> + u64 usage;
> + unsigned long amount;
> +
> + amount = LOAN_MAX;
> +
> + usage = res_counter_read_u64(&memcg->res, RES_USAGE);
> + if (amount > usage /2 )
> + amount = usage / 2;
> + spin_lock(&memcg->loan_lock);
> + if (memcg->loan) {
> + spin_unlock(&memcg->loan_lock);
> + return;
> + }
> + memcg->loan = amount;
> + res_counter_uncharge(&memcg->res, amount);
> + if (do_swap_account)
> + res_counter_uncharge(&memcg->memsw, amount);
> + spin_unlock(&memcg->loan_lock);
> +}
> +
> +/* return amount of free resource which can be uncharged */
> +static unsigned long
> +mem_cgroup_may_return_loan(struct mem_cgroup *memcg, unsigned long val)
> +{
> + unsigned long tmp;
> + /* we don't care small race here */
> + if (unlikely(!memcg->loan))
> + return val;
> + spin_lock(&memcg->loan_lock);
> + if (memcg->loan) {
> + tmp = min(memcg->loan, val);
> + memcg->loan -= tmp;
> + val -= tmp;
> + }
> + spin_unlock(&memcg->loan_lock);
> + return val;
> +}
> +
>  /*
>   * Check OOM-Killer is already running under our hierarchy.
> @@ -2182,6 +2240,7 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup 
> *memcg, gfp_t mask,
>   if (need_to_kill) {
>   finish_wait(&memcg_oom_waitq, &owait.wait);
>   mem_cgroup_out_of_memory(memcg, mask, order);
> + mem_cgroup_make_loan(memcg);
>   } else {
>   schedule();
>   finish_wait(&memcg_oom_waitq, &owait.wait);
> @@ -2748,6 +2807,8 @@ static void __mem_cgroup_cancel_charge(struct 
> mem_cgroup *memcg,
>   if

Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-07 Thread Greg Thelen
On Tue, Feb 05 2013, Michal Hocko wrote:

> On Tue 05-02-13 10:09:57, Greg Thelen wrote:
>> On Tue, Feb 05 2013, Michal Hocko wrote:
>> 
>> > On Tue 05-02-13 08:48:23, Greg Thelen wrote:
>> >> On Tue, Feb 05 2013, Michal Hocko wrote:
>> >> 
>> >> > On Tue 05-02-13 15:49:47, azurIt wrote:
>> >> > [...]
>> >> >> Just to be sure - am i supposed to apply this two patches?
>> >> >> http://watchdog.sk/lkml/patches/
>> >> >
>> >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
>> >> > mentioned in a follow up email. Here is the full patch:
>> >> > ---
>> >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001
>> >> > From: Michal Hocko 
>> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100
>> >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
>> >> >
>> >> > memcg oom killer might deadlock if the process which falls down to
>> >> > mem_cgroup_handle_oom holds a lock which prevents other task to
>> >> > terminate because it is blocked on the very same lock.
>> >> > This can happen when a write system call needs to allocate a page but
>> >> > the allocation hits the memcg hard limit and there is nothing to reclaim
>> >> > (e.g. there is no swap or swap limit is hit as well and all cache pages
>> >> > have been reclaimed already) and the process selected by memcg OOM
>> >> > killer is blocked on i_mutex on the same inode (e.g. truncate it).
>> >> >
>> >> > Process A
>> >> > [] do_truncate+0x58/0xa0  # takes i_mutex
>> >> > [] do_last+0x250/0xa30
>> >> > [] path_openat+0xd7/0x440
>> >> > [] do_filp_open+0x49/0xa0
>> >> > [] do_sys_open+0x106/0x240
>> >> > [] sys_open+0x20/0x30
>> >> > [] system_call_fastpath+0x18/0x1d
>> >> > [] 0x
>> >> >
>> >> > Process B
>> >> > [] mem_cgroup_handle_oom+0x241/0x3b0
>> >> > [] T.1146+0x5ab/0x5c0
>> >> > [] mem_cgroup_cache_charge+0xbe/0xe0
>> >> > [] add_to_page_cache_locked+0x4c/0x140
>> >> > [] add_to_page_cache_lru+0x22/0x50
>> >> > [] grab_cache_page_write_begin+0x8b/0xe0
>> >> > [] ext3_write_begin+0x88/0x270
>> >> > [] generic_file_buffered_write+0x116/0x290
>> >> > [] __generic_file_aio_write+0x27c/0x480
>> >> > [] generic_file_aio_write+0x76/0xf0   # takes 
>> >> > ->i_mutex
>> >> > [] do_sync_write+0xea/0x130
>> >> > [] vfs_write+0xf3/0x1f0
>> >> > [] sys_write+0x51/0x90
>> >> > [] system_call_fastpath+0x18/0x1d
>> >> > [] 0x
>> >> 
>> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into
>> >> __page_cache_alloc() and mem_cgroup_cache_charge().  Which makes me
>> >> think that this deadlock is also possible in the page allocator even
>> >> before getting to add_to_page_cache_lru.  no?
>> >
>> > I am not that familiar with VFS but i_mutex is a high level lock AFAIR
>> > and it shouldn't be called from the pageout path so __page_cache_alloc
>> > should be safe.
>> 
>> I wasn't clear, sorry.  My concern is not that pageout() grabs i_mutex.
>> My concern is that __page_cache_alloc() will invoke the oom killer and
>> select a victim which wants i_mutex.  This victim will deadlock because
>> the oom killer caller already holds i_mutex.  
>
> That would be true for the memcg oom because that one is blocking but
> the global oom just puts the allocator into sleep for a while and then
> the allocator should back off eventually (unless this is NOFAIL
> allocation). I would need to look closer whether this is really the case
> - I haven't seen that allocator code path for a while...

I think the page allocator can loop forever waiting for an oom victim to
terminate even without NOFAIL.  Especially if the oom victim wants a
resource exclusively held by the allocating thread (e.g. i_mutex).  It
looks like the same deadlock you describe is also possible (though more
rare) without memcg.

If the looping thread is an eligible oom victim (i.e. not oom disabled,
not an kernel thread, etc) then the page allocator can return NULL in so
long as NOFAIL is not used.  So any allocator which is able to call the
oom killer and is not oom disabled (kernel thread, etc) is already
exposed to the possibility of page allocator failure.  So if the page
allocator could detect the deadlock, then it could safely return NULL.
Maybe after looping N times without forward progress the page allocator
should consider failing unless NOFAIL is given.

Switching back to the memcg oom situation, can we similarly return NULL
if memcg oom kill has been tried a reasonable number of times.  Simply
failing the memcg charge with ENOMEM seems easier to support than
exceeding limit (Kame's loan patch).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-07 Thread Kamezawa Hiroyuki

(2013/02/07 21:31), Michal Hocko wrote:

On Thu 07-02-13 20:01:45, KAMEZAWA Hiroyuki wrote:

(2013/02/06 23:01), Michal Hocko wrote:

On Wed 06-02-13 02:17:21, azurIt wrote:

5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
mentioned in a follow up email. Here is the full patch:



Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]:
http://www.watchdog.sk/lkml/oom_mysqld6


[...]
WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610()
Hardware name: S5000VSA
gfp_mask:4304 nr_pages:1 oom:0 ret:2
Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1
Call Trace:
  [] warn_slowpath_common+0x7a/0xb0
  [] warn_slowpath_fmt+0x46/0x50
  [] ? mem_cgroup_margin+0x73/0xa0
  [] T.1149+0x2d9/0x610
  [] ? blk_finish_plug+0x18/0x50
  [] mem_cgroup_cache_charge+0xc4/0xf0
  [] add_to_page_cache_locked+0x4f/0x140
  [] add_to_page_cache_lru+0x22/0x50
  [] filemap_fault+0x252/0x4f0
  [] __do_fault+0x78/0x5a0
  [] handle_pte_fault+0x84/0x940
  [] ? vma_prio_tree_insert+0x30/0x50
  [] ? vma_link+0x88/0xe0
  [] handle_mm_fault+0x138/0x260
  [] do_page_fault+0x13d/0x460
  [] ? do_mmap_pgoff+0x3dc/0x430
  [] page_fault+0x1f/0x30
---[ end trace 8817670349022007 ]---
apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
apache2 cpuset=uid mems_allowed=0
Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1
Call Trace:
  [] dump_header+0x7e/0x1e0
  [] ? find_lock_task_mm+0x2f/0x70
  [] oom_kill_process+0x85/0x2a0
  [] out_of_memory+0xe5/0x200
  [] pagefault_out_of_memory+0xbd/0x110
  [] mm_fault_error+0xb6/0x1a0
  [] do_page_fault+0x3ee/0x460
  [] ? do_mmap_pgoff+0x3dc/0x430
  [] page_fault+0x1f/0x30

The first trace comes from the debugging WARN and it clearly points to
a file fault path. __do_fault pre-charges a page in case we need to
do CoW (copy-on-write) for the returned page. This one falls back to
memcg OOM and never returns ENOMEM as I have mentioned earlier.
However, the fs fault handler (filemap_fault here) can fallback to
page_cache_read if the readahead (do_sync_mmap_readahead) fails
to get page to the page cache. And we can see this happening in
the first trace. page_cache_read then calls add_to_page_cache_lru
and eventually gets to add_to_page_cache_locked which calls
mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should
happen. This ENOMEM gets to the fault handler and kaboom.



Hmm. do we need to increase the "limit" virtually at memcg oom until
the oom-killed process dies ? It may be doable by increasing stock->cache
of each cpuI think kernel can offer extra virtual charge up to
oom-killed process's memory usage.


If we can guarantee that the overflow charges do not exceed the memory
usage of the killed process then this would work. The question is, how
do we find out how much we can overflow. immigrate_on_move will play
some role as well as the amount of the shared memory. I am afraid this
would get too complex. Nevertheless the idea is nice.


Yes, that's the problem. If we don't do in correct way, resouce usage
undeflow can happen. I guess we can count it per task_struct at charging
page-faulted anon pages.

_Or_ in other consideration, for example, we do charge 1MB per thread
regardless of its memory usage. And use it as a security at OOM-killing.
Implemtation will be easy but explanation may be difficult..

Thanks,
-Kame




Thanks,
-Kame



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-07 Thread Kamezawa Hiroyuki

(2013/02/07 20:01), Kamezawa Hiroyuki wrote:

(2013/02/06 23:01), Michal Hocko wrote:

On Wed 06-02-13 02:17:21, azurIt wrote:

5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
mentioned in a follow up email. Here is the full patch:



Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]:
http://www.watchdog.sk/lkml/oom_mysqld6


[...]
WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610()
Hardware name: S5000VSA
gfp_mask:4304 nr_pages:1 oom:0 ret:2
Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1
Call Trace:
  [] warn_slowpath_common+0x7a/0xb0
  [] warn_slowpath_fmt+0x46/0x50
  [] ? mem_cgroup_margin+0x73/0xa0
  [] T.1149+0x2d9/0x610
  [] ? blk_finish_plug+0x18/0x50
  [] mem_cgroup_cache_charge+0xc4/0xf0
  [] add_to_page_cache_locked+0x4f/0x140
  [] add_to_page_cache_lru+0x22/0x50
  [] filemap_fault+0x252/0x4f0
  [] __do_fault+0x78/0x5a0
  [] handle_pte_fault+0x84/0x940
  [] ? vma_prio_tree_insert+0x30/0x50
  [] ? vma_link+0x88/0xe0
  [] handle_mm_fault+0x138/0x260
  [] do_page_fault+0x13d/0x460
  [] ? do_mmap_pgoff+0x3dc/0x430
  [] page_fault+0x1f/0x30
---[ end trace 8817670349022007 ]---
apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
apache2 cpuset=uid mems_allowed=0
Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1
Call Trace:
  [] dump_header+0x7e/0x1e0
  [] ? find_lock_task_mm+0x2f/0x70
  [] oom_kill_process+0x85/0x2a0
  [] out_of_memory+0xe5/0x200
  [] pagefault_out_of_memory+0xbd/0x110
  [] mm_fault_error+0xb6/0x1a0
  [] do_page_fault+0x3ee/0x460
  [] ? do_mmap_pgoff+0x3dc/0x430
  [] page_fault+0x1f/0x30

The first trace comes from the debugging WARN and it clearly points to
a file fault path. __do_fault pre-charges a page in case we need to
do CoW (copy-on-write) for the returned page. This one falls back to
memcg OOM and never returns ENOMEM as I have mentioned earlier.
However, the fs fault handler (filemap_fault here) can fallback to
page_cache_read if the readahead (do_sync_mmap_readahead) fails
to get page to the page cache. And we can see this happening in
the first trace. page_cache_read then calls add_to_page_cache_lru
and eventually gets to add_to_page_cache_locked which calls
mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should
happen. This ENOMEM gets to the fault handler and kaboom.



Hmm. do we need to increase the "limit" virtually at memcg oom until
the oom-killed process dies ?


Here is my naive idea...
==
From 1a46318cf89e7df94bd4844f29105b61dacf335b Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki 
Date: Fri, 8 Feb 2013 10:43:52 +0900
Subject: [PATCH] [Don't Apply][PATCH] memcg relax resource at OOM situation.

When an OOM happens, a task is killed and resources will be freed.

A problem here is that a task, which is oom-killed, may wait for
some other resource in which memory resource is required. Some thread
waits for free memory may holds some mutex and oom-killed process
wait for the mutex.

To avoid this, relaxing charged memory by giving virtual resource
can be a help. The system can get back it at uncharge().
This is a sample native implementation.

Signed-off-by: KAMEZAWA Hiroyuki 
---
 mm/memcontrol.c |   79 ++-
 1 file changed, 73 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 25ac5f4..4dea49a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -301,6 +301,9 @@ struct mem_cgroup {
/* set when res.limit == memsw.limit */
boolmemsw_is_minimum;
 
+	/* extra resource at emergency situation */

+   unsigned long   loan;
+   spinlock_t  loan_lock;
/* protect arrays of thresholds */
struct mutex thresholds_lock;
 
@@ -2034,6 +2037,61 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,

mem_cgroup_iter_break(root_memcg, victim);
return total;
 }
+/*
+ * When a memcg is in OOM situation, this lack of resource may cause deadlock
+ * because of complicated lock dependency(i_mutex...). To avoid that, we
+ * need extra resource or avoid charging.
+ *
+ * A memcg can request resource in an emergency state. We call it as loan.
+ * A memcg will return a loan when it does uncharge resource. We disallow
+ * double-loan and moving task to other groups until the loan is fully
+ * returned.
+ *
+ * Note: the problem here is that we cannot know what amount resouce should
+ * be necessary to exiting an emergency state.
+ */
+#define LOAN_MAX   (2 * 1024 * 1024)
+
+static void mem_cgroup_make_loan(struct mem_cgroup *memcg)
+{
+   u64 usage;
+   unsigned long amount;
+
+   amount = LOAN_MAX;
+
+   usage = res_counter_read_u64(&memcg->res, RES_USAGE);
+   if (amount > usage /2 )
+   amount = usage / 2;
+   spin_lock(&memcg->loan_lock);
+   if (memcg->loan) {
+   spin_unlock(&memcg->loan_lock);
+   return;
+   }
+

Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-07 Thread Michal Hocko
On Thu 07-02-13 20:01:45, KAMEZAWA Hiroyuki wrote:
> (2013/02/06 23:01), Michal Hocko wrote:
> >On Wed 06-02-13 02:17:21, azurIt wrote:
> >>>5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
> >>>mentioned in a follow up email. Here is the full patch:
> >>
> >>
> >>Here is the log where OOM, again, killed MySQL server [search for 
> >>"(mysqld)"]:
> >>http://www.watchdog.sk/lkml/oom_mysqld6
> >
> >[...]
> >WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610()
> >Hardware name: S5000VSA
> >gfp_mask:4304 nr_pages:1 oom:0 ret:2
> >Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1
> >Call Trace:
> >  [] warn_slowpath_common+0x7a/0xb0
> >  [] warn_slowpath_fmt+0x46/0x50
> >  [] ? mem_cgroup_margin+0x73/0xa0
> >  [] T.1149+0x2d9/0x610
> >  [] ? blk_finish_plug+0x18/0x50
> >  [] mem_cgroup_cache_charge+0xc4/0xf0
> >  [] add_to_page_cache_locked+0x4f/0x140
> >  [] add_to_page_cache_lru+0x22/0x50
> >  [] filemap_fault+0x252/0x4f0
> >  [] __do_fault+0x78/0x5a0
> >  [] handle_pte_fault+0x84/0x940
> >  [] ? vma_prio_tree_insert+0x30/0x50
> >  [] ? vma_link+0x88/0xe0
> >  [] handle_mm_fault+0x138/0x260
> >  [] do_page_fault+0x13d/0x460
> >  [] ? do_mmap_pgoff+0x3dc/0x430
> >  [] page_fault+0x1f/0x30
> >---[ end trace 8817670349022007 ]---
> >apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
> >apache2 cpuset=uid mems_allowed=0
> >Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1
> >Call Trace:
> >  [] dump_header+0x7e/0x1e0
> >  [] ? find_lock_task_mm+0x2f/0x70
> >  [] oom_kill_process+0x85/0x2a0
> >  [] out_of_memory+0xe5/0x200
> >  [] pagefault_out_of_memory+0xbd/0x110
> >  [] mm_fault_error+0xb6/0x1a0
> >  [] do_page_fault+0x3ee/0x460
> >  [] ? do_mmap_pgoff+0x3dc/0x430
> >  [] page_fault+0x1f/0x30
> >
> >The first trace comes from the debugging WARN and it clearly points to
> >a file fault path. __do_fault pre-charges a page in case we need to
> >do CoW (copy-on-write) for the returned page. This one falls back to
> >memcg OOM and never returns ENOMEM as I have mentioned earlier.
> >However, the fs fault handler (filemap_fault here) can fallback to
> >page_cache_read if the readahead (do_sync_mmap_readahead) fails
> >to get page to the page cache. And we can see this happening in
> >the first trace. page_cache_read then calls add_to_page_cache_lru
> >and eventually gets to add_to_page_cache_locked which calls
> >mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should
> >happen. This ENOMEM gets to the fault handler and kaboom.
> >
> 
> Hmm. do we need to increase the "limit" virtually at memcg oom until
> the oom-killed process dies ? It may be doable by increasing stock->cache
> of each cpuI think kernel can offer extra virtual charge up to
> oom-killed process's memory usage.

If we can guarantee that the overflow charges do not exceed the memory
usage of the killed process then this would work. The question is, how
do we find out how much we can overflow. immigrate_on_move will play
some role as well as the amount of the shared memory. I am afraid this
would get too complex. Nevertheless the idea is nice.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-07 Thread Kamezawa Hiroyuki

(2013/02/06 23:01), Michal Hocko wrote:

On Wed 06-02-13 02:17:21, azurIt wrote:

5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
mentioned in a follow up email. Here is the full patch:



Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]:
http://www.watchdog.sk/lkml/oom_mysqld6


[...]
WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610()
Hardware name: S5000VSA
gfp_mask:4304 nr_pages:1 oom:0 ret:2
Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1
Call Trace:
  [] warn_slowpath_common+0x7a/0xb0
  [] warn_slowpath_fmt+0x46/0x50
  [] ? mem_cgroup_margin+0x73/0xa0
  [] T.1149+0x2d9/0x610
  [] ? blk_finish_plug+0x18/0x50
  [] mem_cgroup_cache_charge+0xc4/0xf0
  [] add_to_page_cache_locked+0x4f/0x140
  [] add_to_page_cache_lru+0x22/0x50
  [] filemap_fault+0x252/0x4f0
  [] __do_fault+0x78/0x5a0
  [] handle_pte_fault+0x84/0x940
  [] ? vma_prio_tree_insert+0x30/0x50
  [] ? vma_link+0x88/0xe0
  [] handle_mm_fault+0x138/0x260
  [] do_page_fault+0x13d/0x460
  [] ? do_mmap_pgoff+0x3dc/0x430
  [] page_fault+0x1f/0x30
---[ end trace 8817670349022007 ]---
apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
apache2 cpuset=uid mems_allowed=0
Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1
Call Trace:
  [] dump_header+0x7e/0x1e0
  [] ? find_lock_task_mm+0x2f/0x70
  [] oom_kill_process+0x85/0x2a0
  [] out_of_memory+0xe5/0x200
  [] pagefault_out_of_memory+0xbd/0x110
  [] mm_fault_error+0xb6/0x1a0
  [] do_page_fault+0x3ee/0x460
  [] ? do_mmap_pgoff+0x3dc/0x430
  [] page_fault+0x1f/0x30

The first trace comes from the debugging WARN and it clearly points to
a file fault path. __do_fault pre-charges a page in case we need to
do CoW (copy-on-write) for the returned page. This one falls back to
memcg OOM and never returns ENOMEM as I have mentioned earlier.
However, the fs fault handler (filemap_fault here) can fallback to
page_cache_read if the readahead (do_sync_mmap_readahead) fails
to get page to the page cache. And we can see this happening in
the first trace. page_cache_read then calls add_to_page_cache_lru
and eventually gets to add_to_page_cache_locked which calls
mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should
happen. This ENOMEM gets to the fault handler and kaboom.



Hmm. do we need to increase the "limit" virtually at memcg oom until
the oom-killed process dies ? It may be doable by increasing stock->cache
of each cpuI think kernel can offer extra virtual charge up to
oom-killed process's memory usage.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-06 Thread Michal Hocko
On Wed 06-02-13 15:01:19, Michal Hocko wrote:
> On Wed 06-02-13 02:17:21, azurIt wrote:
> > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
> > >mentioned in a follow up email. Here is the full patch:
> > 
> > 
> > Here is the log where OOM, again, killed MySQL server [search for 
> > "(mysqld)"]:
> > http://www.watchdog.sk/lkml/oom_mysqld6
> 
> [...]
> WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610()
> Hardware name: S5000VSA
> gfp_mask:4304 nr_pages:1 oom:0 ret:2
> Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1
> Call Trace:
>  [] warn_slowpath_common+0x7a/0xb0
>  [] warn_slowpath_fmt+0x46/0x50
>  [] ? mem_cgroup_margin+0x73/0xa0
>  [] T.1149+0x2d9/0x610
>  [] ? blk_finish_plug+0x18/0x50
>  [] mem_cgroup_cache_charge+0xc4/0xf0
>  [] add_to_page_cache_locked+0x4f/0x140
>  [] add_to_page_cache_lru+0x22/0x50
>  [] filemap_fault+0x252/0x4f0
>  [] __do_fault+0x78/0x5a0
>  [] handle_pte_fault+0x84/0x940
>  [] ? vma_prio_tree_insert+0x30/0x50
>  [] ? vma_link+0x88/0xe0
>  [] handle_mm_fault+0x138/0x260
>  [] do_page_fault+0x13d/0x460
>  [] ? do_mmap_pgoff+0x3dc/0x430
>  [] page_fault+0x1f/0x30
> ---[ end trace 8817670349022007 ]---
> apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
> apache2 cpuset=uid mems_allowed=0
> Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1
> Call Trace:
>  [] dump_header+0x7e/0x1e0
>  [] ? find_lock_task_mm+0x2f/0x70
>  [] oom_kill_process+0x85/0x2a0
>  [] out_of_memory+0xe5/0x200
>  [] pagefault_out_of_memory+0xbd/0x110
>  [] mm_fault_error+0xb6/0x1a0
>  [] do_page_fault+0x3ee/0x460
>  [] ? do_mmap_pgoff+0x3dc/0x430
>  [] page_fault+0x1f/0x30
> 
> The first trace comes from the debugging WARN and it clearly points to
> a file fault path. __do_fault pre-charges a page in case we need to
> do CoW (copy-on-write) for the returned page. This one falls back to
> memcg OOM and never returns ENOMEM as I have mentioned earlier. 
> However, the fs fault handler (filemap_fault here) can fallback to
> page_cache_read if the readahead (do_sync_mmap_readahead) fails
> to get page to the page cache. And we can see this happening in
> the first trace. page_cache_read then calls add_to_page_cache_lru
> and eventually gets to add_to_page_cache_locked which calls
> mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should
> happen. This ENOMEM gets to the fault handler and kaboom.
> 
> So the fix is really much more complex than I thought. Although
> add_to_page_cache_locked sounded like a good place it turned out to be
> not in fact.
> 
> We need something more clever appaerently. One way would be not misusing
> __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32
> bits for those flags in gfp_t so there should be some room there. 
> Or we could do this per task flag, same we do for NO_IO in the current
> -mm tree.
> The later one seems easier wrt. gfp_mask passing horror - e.g.
> __generic_file_aio_write doesn't pass flags and it can be called from
> unlocked contexts as well.

Ouch, PF_ flags space seem to be drained already because
task_struct::flags is just unsigned int so there is just one bit left. I
am not sure this is the best use for it. This will be a real pain!

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-06 Thread Michal Hocko
On Wed 06-02-13 02:17:21, azurIt wrote:
> >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
> >mentioned in a follow up email. Here is the full patch:
> 
> 
> Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]:
> http://www.watchdog.sk/lkml/oom_mysqld6

[...]
WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610()
Hardware name: S5000VSA
gfp_mask:4304 nr_pages:1 oom:0 ret:2
Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1
Call Trace:
 [] warn_slowpath_common+0x7a/0xb0
 [] warn_slowpath_fmt+0x46/0x50
 [] ? mem_cgroup_margin+0x73/0xa0
 [] T.1149+0x2d9/0x610
 [] ? blk_finish_plug+0x18/0x50
 [] mem_cgroup_cache_charge+0xc4/0xf0
 [] add_to_page_cache_locked+0x4f/0x140
 [] add_to_page_cache_lru+0x22/0x50
 [] filemap_fault+0x252/0x4f0
 [] __do_fault+0x78/0x5a0
 [] handle_pte_fault+0x84/0x940
 [] ? vma_prio_tree_insert+0x30/0x50
 [] ? vma_link+0x88/0xe0
 [] handle_mm_fault+0x138/0x260
 [] do_page_fault+0x13d/0x460
 [] ? do_mmap_pgoff+0x3dc/0x430
 [] page_fault+0x1f/0x30
---[ end trace 8817670349022007 ]---
apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
apache2 cpuset=uid mems_allowed=0
Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1
Call Trace:
 [] dump_header+0x7e/0x1e0
 [] ? find_lock_task_mm+0x2f/0x70
 [] oom_kill_process+0x85/0x2a0
 [] out_of_memory+0xe5/0x200
 [] pagefault_out_of_memory+0xbd/0x110
 [] mm_fault_error+0xb6/0x1a0
 [] do_page_fault+0x3ee/0x460
 [] ? do_mmap_pgoff+0x3dc/0x430
 [] page_fault+0x1f/0x30

The first trace comes from the debugging WARN and it clearly points to
a file fault path. __do_fault pre-charges a page in case we need to
do CoW (copy-on-write) for the returned page. This one falls back to
memcg OOM and never returns ENOMEM as I have mentioned earlier. 
However, the fs fault handler (filemap_fault here) can fallback to
page_cache_read if the readahead (do_sync_mmap_readahead) fails
to get page to the page cache. And we can see this happening in
the first trace. page_cache_read then calls add_to_page_cache_lru
and eventually gets to add_to_page_cache_locked which calls
mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should
happen. This ENOMEM gets to the fault handler and kaboom.

So the fix is really much more complex than I thought. Although
add_to_page_cache_locked sounded like a good place it turned out to be
not in fact.

We need something more clever appaerently. One way would be not misusing
__GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32
bits for those flags in gfp_t so there should be some room there. 
Or we could do this per task flag, same we do for NO_IO in the current
-mm tree.
The later one seems easier wrt. gfp_mask passing horror - e.g.
__generic_file_aio_write doesn't pass flags and it can be called from
unlocked contexts as well.

I have to think about it some more.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-05 Thread azurIt
>5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
>mentioned in a follow up email. Here is the full patch:


Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]:
http://www.watchdog.sk/lkml/oom_mysqld6

azur
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-05 Thread Michal Hocko
On Tue 05-02-13 10:09:57, Greg Thelen wrote:
> On Tue, Feb 05 2013, Michal Hocko wrote:
> 
> > On Tue 05-02-13 08:48:23, Greg Thelen wrote:
> >> On Tue, Feb 05 2013, Michal Hocko wrote:
> >> 
> >> > On Tue 05-02-13 15:49:47, azurIt wrote:
> >> > [...]
> >> >> Just to be sure - am i supposed to apply this two patches?
> >> >> http://watchdog.sk/lkml/patches/
> >> >
> >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
> >> > mentioned in a follow up email. Here is the full patch:
> >> > ---
> >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001
> >> > From: Michal Hocko 
> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100
> >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
> >> >
> >> > memcg oom killer might deadlock if the process which falls down to
> >> > mem_cgroup_handle_oom holds a lock which prevents other task to
> >> > terminate because it is blocked on the very same lock.
> >> > This can happen when a write system call needs to allocate a page but
> >> > the allocation hits the memcg hard limit and there is nothing to reclaim
> >> > (e.g. there is no swap or swap limit is hit as well and all cache pages
> >> > have been reclaimed already) and the process selected by memcg OOM
> >> > killer is blocked on i_mutex on the same inode (e.g. truncate it).
> >> >
> >> > Process A
> >> > [] do_truncate+0x58/0xa0   # takes i_mutex
> >> > [] do_last+0x250/0xa30
> >> > [] path_openat+0xd7/0x440
> >> > [] do_filp_open+0x49/0xa0
> >> > [] do_sys_open+0x106/0x240
> >> > [] sys_open+0x20/0x30
> >> > [] system_call_fastpath+0x18/0x1d
> >> > [] 0x
> >> >
> >> > Process B
> >> > [] mem_cgroup_handle_oom+0x241/0x3b0
> >> > [] T.1146+0x5ab/0x5c0
> >> > [] mem_cgroup_cache_charge+0xbe/0xe0
> >> > [] add_to_page_cache_locked+0x4c/0x140
> >> > [] add_to_page_cache_lru+0x22/0x50
> >> > [] grab_cache_page_write_begin+0x8b/0xe0
> >> > [] ext3_write_begin+0x88/0x270
> >> > [] generic_file_buffered_write+0x116/0x290
> >> > [] __generic_file_aio_write+0x27c/0x480
> >> > [] generic_file_aio_write+0x76/0xf0   # takes 
> >> > ->i_mutex
> >> > [] do_sync_write+0xea/0x130
> >> > [] vfs_write+0xf3/0x1f0
> >> > [] sys_write+0x51/0x90
> >> > [] system_call_fastpath+0x18/0x1d
> >> > [] 0x
> >> 
> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into
> >> __page_cache_alloc() and mem_cgroup_cache_charge().  Which makes me
> >> think that this deadlock is also possible in the page allocator even
> >> before getting to add_to_page_cache_lru.  no?
> >
> > I am not that familiar with VFS but i_mutex is a high level lock AFAIR
> > and it shouldn't be called from the pageout path so __page_cache_alloc
> > should be safe.
> 
> I wasn't clear, sorry.  My concern is not that pageout() grabs i_mutex.
> My concern is that __page_cache_alloc() will invoke the oom killer and
> select a victim which wants i_mutex.  This victim will deadlock because
> the oom killer caller already holds i_mutex.  

That would be true for the memcg oom because that one is blocking but
the global oom just puts the allocator into sleep for a while and then
the allocator should back off eventually (unless this is NOFAIL
allocation). I would need to look closer whether this is really the case
- I haven't seen that allocator code path for a while...

> The wild accusation I am making is that anyone who invokes the oom
> killer and waits on the victim to die is essentially grabbing all of
> the locks that any of the oom killer victims may grab (e.g. i_mutex).

True.

> To avoid deadlock the oom killer can only be called is while holding
> no locks that the oom victim demands.  I think some locks are grabbed
> in a way that allows the lock request to fail if the task has a fatal
> signal pending, so they are safe.  But any locks acquisitions that
> cannot fail (e.g. mutex_lock) will deadlock with the oom killing
> process.  So the oom killing process cannot hold any such locks which
> the victim will attempt to grab.  Hopefully I'm missing something.

Agreed.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-05 Thread Greg Thelen
On Tue, Feb 05 2013, Michal Hocko wrote:

> On Tue 05-02-13 08:48:23, Greg Thelen wrote:
>> On Tue, Feb 05 2013, Michal Hocko wrote:
>> 
>> > On Tue 05-02-13 15:49:47, azurIt wrote:
>> > [...]
>> >> Just to be sure - am i supposed to apply this two patches?
>> >> http://watchdog.sk/lkml/patches/
>> >
>> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
>> > mentioned in a follow up email. Here is the full patch:
>> > ---
>> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001
>> > From: Michal Hocko 
>> > Date: Mon, 26 Nov 2012 11:47:57 +0100
>> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
>> >
>> > memcg oom killer might deadlock if the process which falls down to
>> > mem_cgroup_handle_oom holds a lock which prevents other task to
>> > terminate because it is blocked on the very same lock.
>> > This can happen when a write system call needs to allocate a page but
>> > the allocation hits the memcg hard limit and there is nothing to reclaim
>> > (e.g. there is no swap or swap limit is hit as well and all cache pages
>> > have been reclaimed already) and the process selected by memcg OOM
>> > killer is blocked on i_mutex on the same inode (e.g. truncate it).
>> >
>> > Process A
>> > [] do_truncate+0x58/0xa0 # takes i_mutex
>> > [] do_last+0x250/0xa30
>> > [] path_openat+0xd7/0x440
>> > [] do_filp_open+0x49/0xa0
>> > [] do_sys_open+0x106/0x240
>> > [] sys_open+0x20/0x30
>> > [] system_call_fastpath+0x18/0x1d
>> > [] 0x
>> >
>> > Process B
>> > [] mem_cgroup_handle_oom+0x241/0x3b0
>> > [] T.1146+0x5ab/0x5c0
>> > [] mem_cgroup_cache_charge+0xbe/0xe0
>> > [] add_to_page_cache_locked+0x4c/0x140
>> > [] add_to_page_cache_lru+0x22/0x50
>> > [] grab_cache_page_write_begin+0x8b/0xe0
>> > [] ext3_write_begin+0x88/0x270
>> > [] generic_file_buffered_write+0x116/0x290
>> > [] __generic_file_aio_write+0x27c/0x480
>> > [] generic_file_aio_write+0x76/0xf0   # takes 
>> > ->i_mutex
>> > [] do_sync_write+0xea/0x130
>> > [] vfs_write+0xf3/0x1f0
>> > [] sys_write+0x51/0x90
>> > [] system_call_fastpath+0x18/0x1d
>> > [] 0x
>> 
>> It looks like grab_cache_page_write_begin() passes __GFP_FS into
>> __page_cache_alloc() and mem_cgroup_cache_charge().  Which makes me
>> think that this deadlock is also possible in the page allocator even
>> before getting to add_to_page_cache_lru.  no?
>
> I am not that familiar with VFS but i_mutex is a high level lock AFAIR
> and it shouldn't be called from the pageout path so __page_cache_alloc
> should be safe.

I wasn't clear, sorry.  My concern is not that pageout() grabs i_mutex.
My concern is that __page_cache_alloc() will invoke the oom killer and
select a victim which wants i_mutex.  This victim will deadlock because
the oom killer caller already holds i_mutex.  The wild accusation I am
making is that anyone who invokes the oom killer and waits on the victim
to die is essentially grabbing all of the locks that any of the oom
killer victims may grab (e.g. i_mutex).  To avoid deadlock the oom
killer can only be called is while holding no locks that the oom victim
demands.  I think some locks are grabbed in a way that allows the lock
request to fail if the task has a fatal signal pending, so they are
safe.  But any locks acquisitions that cannot fail (e.g. mutex_lock)
will deadlock with the oom killing process.  So the oom killing process
cannot hold any such locks which the victim will attempt to grab.
Hopefully I'm missing something.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-05 Thread Michal Hocko
On Tue 05-02-13 08:48:23, Greg Thelen wrote:
> On Tue, Feb 05 2013, Michal Hocko wrote:
> 
> > On Tue 05-02-13 15:49:47, azurIt wrote:
> > [...]
> >> Just to be sure - am i supposed to apply this two patches?
> >> http://watchdog.sk/lkml/patches/
> >
> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
> > mentioned in a follow up email. Here is the full patch:
> > ---
> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko 
> > Date: Mon, 26 Nov 2012 11:47:57 +0100
> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
> >
> > memcg oom killer might deadlock if the process which falls down to
> > mem_cgroup_handle_oom holds a lock which prevents other task to
> > terminate because it is blocked on the very same lock.
> > This can happen when a write system call needs to allocate a page but
> > the allocation hits the memcg hard limit and there is nothing to reclaim
> > (e.g. there is no swap or swap limit is hit as well and all cache pages
> > have been reclaimed already) and the process selected by memcg OOM
> > killer is blocked on i_mutex on the same inode (e.g. truncate it).
> >
> > Process A
> > [] do_truncate+0x58/0xa0  # takes i_mutex
> > [] do_last+0x250/0xa30
> > [] path_openat+0xd7/0x440
> > [] do_filp_open+0x49/0xa0
> > [] do_sys_open+0x106/0x240
> > [] sys_open+0x20/0x30
> > [] system_call_fastpath+0x18/0x1d
> > [] 0x
> >
> > Process B
> > [] mem_cgroup_handle_oom+0x241/0x3b0
> > [] T.1146+0x5ab/0x5c0
> > [] mem_cgroup_cache_charge+0xbe/0xe0
> > [] add_to_page_cache_locked+0x4c/0x140
> > [] add_to_page_cache_lru+0x22/0x50
> > [] grab_cache_page_write_begin+0x8b/0xe0
> > [] ext3_write_begin+0x88/0x270
> > [] generic_file_buffered_write+0x116/0x290
> > [] __generic_file_aio_write+0x27c/0x480
> > [] generic_file_aio_write+0x76/0xf0   # takes 
> > ->i_mutex
> > [] do_sync_write+0xea/0x130
> > [] vfs_write+0xf3/0x1f0
> > [] sys_write+0x51/0x90
> > [] system_call_fastpath+0x18/0x1d
> > [] 0x
> 
> It looks like grab_cache_page_write_begin() passes __GFP_FS into
> __page_cache_alloc() and mem_cgroup_cache_charge().  Which makes me
> think that this deadlock is also possible in the page allocator even
> before getting to add_to_page_cache_lru.  no?

I am not that familiar with VFS but i_mutex is a high level lock AFAIR
and it shouldn't be called from the pageout path so __page_cache_alloc
should be safe.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-05 Thread Greg Thelen
On Tue, Feb 05 2013, Michal Hocko wrote:

> On Tue 05-02-13 15:49:47, azurIt wrote:
> [...]
>> Just to be sure - am i supposed to apply this two patches?
>> http://watchdog.sk/lkml/patches/
>
> 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
> mentioned in a follow up email. Here is the full patch:
> ---
> From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Mon, 26 Nov 2012 11:47:57 +0100
> Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
>
> memcg oom killer might deadlock if the process which falls down to
> mem_cgroup_handle_oom holds a lock which prevents other task to
> terminate because it is blocked on the very same lock.
> This can happen when a write system call needs to allocate a page but
> the allocation hits the memcg hard limit and there is nothing to reclaim
> (e.g. there is no swap or swap limit is hit as well and all cache pages
> have been reclaimed already) and the process selected by memcg OOM
> killer is blocked on i_mutex on the same inode (e.g. truncate it).
>
> Process A
> [] do_truncate+0x58/0xa0# takes i_mutex
> [] do_last+0x250/0xa30
> [] path_openat+0xd7/0x440
> [] do_filp_open+0x49/0xa0
> [] do_sys_open+0x106/0x240
> [] sys_open+0x20/0x30
> [] system_call_fastpath+0x18/0x1d
> [] 0x
>
> Process B
> [] mem_cgroup_handle_oom+0x241/0x3b0
> [] T.1146+0x5ab/0x5c0
> [] mem_cgroup_cache_charge+0xbe/0xe0
> [] add_to_page_cache_locked+0x4c/0x140
> [] add_to_page_cache_lru+0x22/0x50
> [] grab_cache_page_write_begin+0x8b/0xe0
> [] ext3_write_begin+0x88/0x270
> [] generic_file_buffered_write+0x116/0x290
> [] __generic_file_aio_write+0x27c/0x480
> [] generic_file_aio_write+0x76/0xf0   # takes 
> ->i_mutex
> [] do_sync_write+0xea/0x130
> [] vfs_write+0xf3/0x1f0
> [] sys_write+0x51/0x90
> [] system_call_fastpath+0x18/0x1d
> [] 0x

It looks like grab_cache_page_write_begin() passes __GFP_FS into
__page_cache_alloc() and mem_cgroup_cache_charge().  Which makes me
think that this deadlock is also possible in the page allocator even
before getting to add_to_page_cache_lru.  no?

Can callers holding fs resources (e.g. i_mutex) pass __GFP_FS into the
page allocator?  If __GFP_FS was avoided, then I think memcg user page
charging would need a !__GFP_FS check to avoid invoking oom killer, but
at least then we'd avoid both deadlocks and cover both page allocation
and memcg page charging in similar fashion.

Example from memcg_charge_kmem:
may_oom = (gfp & __GFP_FS) && !(gfp & __GFP_NORETRY);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-05 Thread azurIt
>5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
>mentioned in a follow up email.

ou, it wasn't complete? i used it in my last test.. sorry, i'm litte confused 
by all those patches. will try it this night and report back.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-05 Thread Michal Hocko
On Tue 05-02-13 15:49:47, azurIt wrote:
[...]
> I have another old problem which is maybe also related to this. I
> wasn't connecting it with this before but now i'm not sure. Two of our
> servers, which are affected by this cgroup problem, are also randomly
> freezing completely (few times per month). These are the symptoms:
>  - servers are answering to ping
>  - it is possible to connect via SSH but connection is freezed after
>  sending the password
>  - it is possible to login via console but it is freezed after typeing
>  the login
> These symptoms are very similar to HDD problems or HDD overload (but
> there is no overload for sure). The only way to fix it is, probably,
> hard rebooting the server (didn't find any other way). What do you
> think? Can this be related?

This is hard to tell without further information.

> Maybe HDDs are locked in the similar way the cgroups are - we already
> found out that cgroup freezeing is related also to HDD activity. Maybe
> there is a little chance that the whole HDD subsystem ends in
> deadlock?

"HDD subsystem" whatever that means cannot be blocked by memcg being
stuck. Certain access to soem files might be an issue because those
could have locks held but I do not see other relations.

I would start by checking the HW, trying to focus on reducing elements
that could contribute - aka try to nail down to the minimum set which
reproduces the issue. I cannot help you much with that I am afraid.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-05 Thread Michal Hocko
On Tue 05-02-13 15:49:47, azurIt wrote:
[...]
> Just to be sure - am i supposed to apply this two patches?
> http://watchdog.sk/lkml/patches/

5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I
mentioned in a follow up email. Here is the full patch:
---
>From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Mon, 26 Nov 2012 11:47:57 +0100
Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked

memcg oom killer might deadlock if the process which falls down to
mem_cgroup_handle_oom holds a lock which prevents other task to
terminate because it is blocked on the very same lock.
This can happen when a write system call needs to allocate a page but
the allocation hits the memcg hard limit and there is nothing to reclaim
(e.g. there is no swap or swap limit is hit as well and all cache pages
have been reclaimed already) and the process selected by memcg OOM
killer is blocked on i_mutex on the same inode (e.g. truncate it).

Process A
[] do_truncate+0x58/0xa0  # takes i_mutex
[] do_last+0x250/0xa30
[] path_openat+0xd7/0x440
[] do_filp_open+0x49/0xa0
[] do_sys_open+0x106/0x240
[] sys_open+0x20/0x30
[] system_call_fastpath+0x18/0x1d
[] 0x

Process B
[] mem_cgroup_handle_oom+0x241/0x3b0
[] T.1146+0x5ab/0x5c0
[] mem_cgroup_cache_charge+0xbe/0xe0
[] add_to_page_cache_locked+0x4c/0x140
[] add_to_page_cache_lru+0x22/0x50
[] grab_cache_page_write_begin+0x8b/0xe0
[] ext3_write_begin+0x88/0x270
[] generic_file_buffered_write+0x116/0x290
[] __generic_file_aio_write+0x27c/0x480
[] generic_file_aio_write+0x76/0xf0   # takes 
->i_mutex
[] do_sync_write+0xea/0x130
[] vfs_write+0xf3/0x1f0
[] sys_write+0x51/0x90
[] system_call_fastpath+0x18/0x1d
[] 0x

This is not a hard deadlock though because administrator can still
intervene and increase the limit on the group which helps the writer to
finish the allocation and release the lock.

This patch heals the problem by forbidding OOM from page cache charges
(namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper
function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which
then tells mem_cgroup_charge_common that OOM is not allowed for the
charge. No OOM from this path, except for fixing the bug, also make some
sense as we really do not want to cause an OOM because of a page cache
usage.
As a possibly visible result add_to_page_cache_lru might fail more often
with ENOMEM but this is to be expected if the limit is set and it is
preferable than OOM killer IMO.

__GFP_NORETRY is abused for this memcg specific flag because no user
accounted allocation use this flag except for THP which have memcg oom
disabled already.

Reported-by: azurIt 
Signed-off-by: Michal Hocko 
---
 include/linux/gfp.h|3 +++
 include/linux/memcontrol.h |   13 +
 mm/filemap.c   |8 +++-
 mm/memcontrol.c|   10 ++
 4 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3a76faf..806fb54 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -146,6 +146,9 @@ struct vm_area_struct;
 /* 4GB DMA on some platforms */
 #define GFP_DMA32  __GFP_DMA32
 
+/* memcg oom killer is not allowed */
+#define GFP_MEMCG_NO_OOM   __GFP_NORETRY
+
 /* Convert GFP flags to their corresponding migrate type */
 static inline int allocflags_to_migratetype(gfp_t gfp_flags)
 {
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 81572af..bf0e575 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -63,6 +63,13 @@ extern void mem_cgroup_cancel_charge_swapin(struct 
mem_cgroup *ptr);
 
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
+
+static inline int mem_cgroup_cache_charge_no_oom(struct page *page,
+   struct mm_struct *mm, gfp_t gfp_mask)
+{
+   return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM);
+}
+
 extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
 extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
 extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
@@ -178,6 +185,12 @@ static inline int mem_cgroup_cache_charge(struct page 
*page,
return 0;
 }
 
+static inline int mem_cgroup_cache_charge_no_oom(struct page *page,
+   struct mm_struct *mm, gfp_t gfp_mask)
+{
+   return 0;
+}
+
 static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr)
 {
diff --git a/mm/filemap.c b/mm/filemap.c
index 556858c..ef182a9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -449,7 +449,13 @@ int add_to_page_cache_locked(struct page *page, struct 
address_space *mapping,
VM_BUG_ON(!PageLoc

Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-05 Thread azurIt
>Sorry, to get back to this that late but I was busy as hell since the
>beginning of the year.


Thank you for your time!


>Has the issue repeated since then?


Yes, it's happening all the time but meanwhile i wrote a script which is 
monitoring the problem and killing freezed processes when it occurs. But i 
don't like it much, it's not a solution for me :( i also noticed, that problem 
is always affecting the whole server but not so much as freezed cgroup. Depends 
on number of freezed processes, sometimes it has almost no imapct on the rest 
of the server, sometimes the whole server is lagging much.

I have another old problem which is maybe also related to this. I wasn't 
connecting it with this before but now i'm not sure. Two of our servers, which 
are affected by this cgroup problem, are also randomly freezing completely (few 
times per month). These are the symptoms:
 - servers are answering to ping
 - it is possible to connect via SSH but connection is freezed after sending 
the password
 - it is possible to login via console but it is freezed after typeing the login
These symptoms are very similar to HDD problems or HDD overload (but there is 
no overload for sure). The only way to fix it is, probably, hard rebooting the 
server (didn't find any other way). What do you think? Can this be related? 
Maybe HDDs are locked in the similar way the cgroups are - we already found out 
that cgroup freezeing is related also to HDD activity. Maybe there is a little 
chance that the whole HDD subsystem ends in deadlock?


>You said you didn't apply other than the above mentioned patch. Could
>you apply also debugging part of the patches I have sent?
>In case you don't have it handy then it should be this one:


Just to be sure - am i supposed to apply this two patches?
http://watchdog.sk/lkml/patches/


azur
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-02-05 Thread Michal Hocko
On Fri 25-01-13 17:31:30, Michal Hocko wrote:
> On Fri 25-01-13 16:07:23, azurIt wrote:
> > Any news? Thnx!
> 
> Sorry, but I didn't get to this one yet.

Sorry, to get back to this that late but I was busy as hell since the
beginning of the year.

Has the issue repeated since then?

You said you didn't apply other than the above mentioned patch. Could
you apply also debugging part of the patches I have sent?
In case you don't have it handy then it should be this one:
---
>From 1623420d964e7e8bc88e2a6239563052df891bf7 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Mon, 3 Dec 2012 16:16:01 +0100
Subject: [PATCH] more debugging

---
 mm/huge_memory.c |6 +++---
 mm/memcontrol.c  |1 +
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 470cbb4..01a11f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -671,7 +671,7 @@ static inline struct page *alloc_hugepage(int defrag)
 }
 #endif
 
-int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct 
*vma,
+noinline int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
   unsigned long address, pmd_t *pmd,
   unsigned int flags)
 {
@@ -790,7 +790,7 @@ pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
return pgtable;
 }
 
-static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
+static noinline int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address,
pmd_t *pmd, pmd_t orig_pmd,
@@ -883,7 +883,7 @@ out_free_pages:
goto out;
 }
 
-int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+noinline int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct 
*vma,
unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
 {
int ret = 0;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c8425b1..1986c65 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2397,6 +2397,7 @@ done:
return 0;
 nomem:
*ptr = NULL;
+   __WARN_printf("gfp_mask:%u nr_pages:%u oom:%d ret:%d\n", gfp_mask, 
nr_pages, oom, ret);
return -ENOMEM;
 bypass:
*ptr = NULL;
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-01-25 Thread Michal Hocko
On Fri 25-01-13 16:07:23, azurIt wrote:
> Any news? Thnx!

Sorry, but I didn't get to this one yet.

> 
> azur
> 
> 
> 
> __
> > Od: "Michal Hocko" 
> > Komu: azurIt 
> > Dátum: 30.12.2012 12:08
> > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from 
> > add_to_page_cache_locked
> >
> > CC: linux-kernel@vger.kernel.org, linux...@kvack.org, "cgroups mailinglist" 
> > , "KAMEZAWA Hiroyuki" 
> > , "Johannes Weiner" 
> >On Sun 30-12-12 02:09:47, azurIt wrote:
> >> >which suggests that the patch is incomplete and that I am blind :/
> >> >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache
> >> >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following
> >> >follow-up patch on top of the one you already have (which should catch
> >> >all the remaining cases).
> >> >Sorry about that...
> >> 
> >> 
> >> This was, again, killing my MySQL server (search for "(mysqld)"):
> >> http://www.watchdog.sk/lkml/oom_mysqld5
> >
> >grep "Kill process" oom_mysqld5 
> >Dec 30 01:53:34 server01 kernel: [  367.061801] Memory cgroup out of memory: 
> >Kill process 5512 (apache2) score 716 or sacrifice child
> >Dec 30 01:53:35 server01 kernel: [  367.338024] Memory cgroup out of memory: 
> >Kill process 5517 (apache2) score 718 or sacrifice child
> >Dec 30 01:53:35 server01 kernel: [  367.747888] Memory cgroup out of memory: 
> >Kill process 5513 (apache2) score 721 or sacrifice child
> >Dec 30 01:53:36 server01 kernel: [  368.159860] Memory cgroup out of memory: 
> >Kill process 5516 (apache2) score 726 or sacrifice child
> >Dec 30 01:53:36 server01 kernel: [  368.665606] Memory cgroup out of memory: 
> >Kill process 5520 (apache2) score 733 or sacrifice child
> >Dec 30 01:53:36 server01 kernel: [  368.765652] Out of memory: Kill process 
> >1778 (mysqld) score 39 or sacrifice child
> >Dec 30 01:53:36 server01 kernel: [  369.101753] Memory cgroup out of memory: 
> >Kill process 5519 (apache2) score 754 or sacrifice child
> >Dec 30 01:53:37 server01 kernel: [  369.464262] Memory cgroup out of memory: 
> >Kill process 5583 (apache2) score 762 or sacrifice child
> >Dec 30 01:53:37 server01 kernel: [  369.465017] Out of memory: Kill process 
> >5506 (apache2) score 18 or sacrifice child
> >Dec 30 01:53:37 server01 kernel: [  369.574932] Memory cgroup out of memory: 
> >Kill process 5523 (apache2) score 759 or sacrifice child
> >
> >So your mysqld has been killed by the global OOM not memcg. But why when
> >you seem to be perfectly fine regarding memory? I guess the following
> >backtrace is relevant:
> >Dec 30 01:53:36 server01 kernel: [  368.569720] DMA: 0*4kB 1*8kB 0*16kB 
> >1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB
> >Dec 30 01:53:36 server01 kernel: [  368.570447] DMA32: 9*4kB 10*8kB 8*16kB 
> >6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 
> >2523636kB
> >Dec 30 01:53:36 server01 kernel: [  368.571175] Normal: 5*4kB 2060*8kB 
> >4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 
> >4*2048kB 1855*4096kB = 8134036kB
> >Dec 30 01:53:36 server01 kernel: [  368.571906] 308964 total pagecache pages
> >Dec 30 01:53:36 server01 kernel: [  368.572023] 0 pages in swap cache
> >Dec 30 01:53:36 server01 kernel: [  368.572140] Swap cache stats: add 0, 
> >delete 0, find 0/0
> >Dec 30 01:53:36 server01 kernel: [  368.572260] Free swap  = 0kB
> >Dec 30 01:53:36 server01 kernel: [  368.572375] Total swap = 0kB
> >Dec 30 01:53:36 server01 kernel: [  368.597836] apache2 invoked oom-killer: 
> >gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
> >Dec 30 01:53:36 server01 kernel: [  368.598034] apache2 cpuset=uid 
> >mems_allowed=0
> >Dec 30 01:53:36 server01 kernel: [  368.598152] Pid: 5385, comm: apache2 Not 
> >tainted 3.2.35-grsec #1
> >Dec 30 01:53:36 server01 kernel: [  368.598273] Call Trace:
> >Dec 30 01:53:36 server01 kernel: [  368.598396]  [] 
> >dump_header+0x7e/0x1e0
> >Dec 30 01:53:36 server01 kernel: [  368.598516]  [] ? 
> >find_lock_task_mm+0x2f/0x70
> >Dec 30 01:53:36 server01 kernel: [  368.598638]  [] 
> >oom_kill_process+0x85/0x2a0
> >Dec 30 01:53:36 server01 kernel: [  368.598759]  [] 
> >out_of_memory+0xe5/0x200
> >Dec 30 01:53:36 server01 kernel: [  368.598880]  [] 
> >pagefault_out_of_memory+0xbd/0x110
> >Dec 30 01:53:36 server01 kernel: [  368.

Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2013-01-25 Thread azurIt
Any news? Thnx!

azur



__
> Od: "Michal Hocko" 
> Komu: azurIt 
> Dátum: 30.12.2012 12:08
> Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from 
> add_to_page_cache_locked
>
> CC: linux-kernel@vger.kernel.org, linux...@kvack.org, "cgroups mailinglist" 
> , "KAMEZAWA Hiroyuki" 
> , "Johannes Weiner" 
>On Sun 30-12-12 02:09:47, azurIt wrote:
>> >which suggests that the patch is incomplete and that I am blind :/
>> >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache
>> >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following
>> >follow-up patch on top of the one you already have (which should catch
>> >all the remaining cases).
>> >Sorry about that...
>> 
>> 
>> This was, again, killing my MySQL server (search for "(mysqld)"):
>> http://www.watchdog.sk/lkml/oom_mysqld5
>
>grep "Kill process" oom_mysqld5 
>Dec 30 01:53:34 server01 kernel: [  367.061801] Memory cgroup out of memory: 
>Kill process 5512 (apache2) score 716 or sacrifice child
>Dec 30 01:53:35 server01 kernel: [  367.338024] Memory cgroup out of memory: 
>Kill process 5517 (apache2) score 718 or sacrifice child
>Dec 30 01:53:35 server01 kernel: [  367.747888] Memory cgroup out of memory: 
>Kill process 5513 (apache2) score 721 or sacrifice child
>Dec 30 01:53:36 server01 kernel: [  368.159860] Memory cgroup out of memory: 
>Kill process 5516 (apache2) score 726 or sacrifice child
>Dec 30 01:53:36 server01 kernel: [  368.665606] Memory cgroup out of memory: 
>Kill process 5520 (apache2) score 733 or sacrifice child
>Dec 30 01:53:36 server01 kernel: [  368.765652] Out of memory: Kill process 
>1778 (mysqld) score 39 or sacrifice child
>Dec 30 01:53:36 server01 kernel: [  369.101753] Memory cgroup out of memory: 
>Kill process 5519 (apache2) score 754 or sacrifice child
>Dec 30 01:53:37 server01 kernel: [  369.464262] Memory cgroup out of memory: 
>Kill process 5583 (apache2) score 762 or sacrifice child
>Dec 30 01:53:37 server01 kernel: [  369.465017] Out of memory: Kill process 
>5506 (apache2) score 18 or sacrifice child
>Dec 30 01:53:37 server01 kernel: [  369.574932] Memory cgroup out of memory: 
>Kill process 5523 (apache2) score 759 or sacrifice child
>
>So your mysqld has been killed by the global OOM not memcg. But why when
>you seem to be perfectly fine regarding memory? I guess the following
>backtrace is relevant:
>Dec 30 01:53:36 server01 kernel: [  368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB 
>2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB
>Dec 30 01:53:36 server01 kernel: [  368.570447] DMA32: 9*4kB 10*8kB 8*16kB 
>6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB
>Dec 30 01:53:36 server01 kernel: [  368.571175] Normal: 5*4kB 2060*8kB 
>4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB 
>1855*4096kB = 8134036kB
>Dec 30 01:53:36 server01 kernel: [  368.571906] 308964 total pagecache pages
>Dec 30 01:53:36 server01 kernel: [  368.572023] 0 pages in swap cache
>Dec 30 01:53:36 server01 kernel: [  368.572140] Swap cache stats: add 0, 
>delete 0, find 0/0
>Dec 30 01:53:36 server01 kernel: [  368.572260] Free swap  = 0kB
>Dec 30 01:53:36 server01 kernel: [  368.572375] Total swap = 0kB
>Dec 30 01:53:36 server01 kernel: [  368.597836] apache2 invoked oom-killer: 
>gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
>Dec 30 01:53:36 server01 kernel: [  368.598034] apache2 cpuset=uid 
>mems_allowed=0
>Dec 30 01:53:36 server01 kernel: [  368.598152] Pid: 5385, comm: apache2 Not 
>tainted 3.2.35-grsec #1
>Dec 30 01:53:36 server01 kernel: [  368.598273] Call Trace:
>Dec 30 01:53:36 server01 kernel: [  368.598396]  [] 
>dump_header+0x7e/0x1e0
>Dec 30 01:53:36 server01 kernel: [  368.598516]  [] ? 
>find_lock_task_mm+0x2f/0x70
>Dec 30 01:53:36 server01 kernel: [  368.598638]  [] 
>oom_kill_process+0x85/0x2a0
>Dec 30 01:53:36 server01 kernel: [  368.598759]  [] 
>out_of_memory+0xe5/0x200
>Dec 30 01:53:36 server01 kernel: [  368.598880]  [] 
>pagefault_out_of_memory+0xbd/0x110
>Dec 30 01:53:36 server01 kernel: [  368.599006]  [] 
>mm_fault_error+0xb6/0x1a0
>Dec 30 01:53:36 server01 kernel: [  368.599127]  [] 
>do_page_fault+0x3ee/0x460
>Dec 30 01:53:36 server01 kernel: [  368.599250]  [] ? 
>mntput+0x1f/0x30
>Dec 30 01:53:36 server01 kernel: [  368.599371]  [] ? 
>fput+0x156/0x200
>Dec 30 01:53:36 server01 kernel: [  368.599496]  [] 
>page_fault+0x1f/0x30
>
>This would suggest that an unexpected ENOMEM leaked during page fault
>path. I do not see which one could that be because you said THP
>

Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-30 Thread Michal Hocko
On Sun 30-12-12 02:09:47, azurIt wrote:
> >which suggests that the patch is incomplete and that I am blind :/
> >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache
> >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following
> >follow-up patch on top of the one you already have (which should catch
> >all the remaining cases).
> >Sorry about that...
> 
> 
> This was, again, killing my MySQL server (search for "(mysqld)"):
> http://www.watchdog.sk/lkml/oom_mysqld5

grep "Kill process" oom_mysqld5 
Dec 30 01:53:34 server01 kernel: [  367.061801] Memory cgroup out of memory: 
Kill process 5512 (apache2) score 716 or sacrifice child
Dec 30 01:53:35 server01 kernel: [  367.338024] Memory cgroup out of memory: 
Kill process 5517 (apache2) score 718 or sacrifice child
Dec 30 01:53:35 server01 kernel: [  367.747888] Memory cgroup out of memory: 
Kill process 5513 (apache2) score 721 or sacrifice child
Dec 30 01:53:36 server01 kernel: [  368.159860] Memory cgroup out of memory: 
Kill process 5516 (apache2) score 726 or sacrifice child
Dec 30 01:53:36 server01 kernel: [  368.665606] Memory cgroup out of memory: 
Kill process 5520 (apache2) score 733 or sacrifice child
Dec 30 01:53:36 server01 kernel: [  368.765652] Out of memory: Kill process 
1778 (mysqld) score 39 or sacrifice child
Dec 30 01:53:36 server01 kernel: [  369.101753] Memory cgroup out of memory: 
Kill process 5519 (apache2) score 754 or sacrifice child
Dec 30 01:53:37 server01 kernel: [  369.464262] Memory cgroup out of memory: 
Kill process 5583 (apache2) score 762 or sacrifice child
Dec 30 01:53:37 server01 kernel: [  369.465017] Out of memory: Kill process 
5506 (apache2) score 18 or sacrifice child
Dec 30 01:53:37 server01 kernel: [  369.574932] Memory cgroup out of memory: 
Kill process 5523 (apache2) score 759 or sacrifice child

So your mysqld has been killed by the global OOM not memcg. But why when
you seem to be perfectly fine regarding memory? I guess the following
backtrace is relevant:
Dec 30 01:53:36 server01 kernel: [  368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB 
2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB
Dec 30 01:53:36 server01 kernel: [  368.570447] DMA32: 9*4kB 10*8kB 8*16kB 
6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB
Dec 30 01:53:36 server01 kernel: [  368.571175] Normal: 5*4kB 2060*8kB 
4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB 
1855*4096kB = 8134036kB
Dec 30 01:53:36 server01 kernel: [  368.571906] 308964 total pagecache pages
Dec 30 01:53:36 server01 kernel: [  368.572023] 0 pages in swap cache
Dec 30 01:53:36 server01 kernel: [  368.572140] Swap cache stats: add 0, delete 
0, find 0/0
Dec 30 01:53:36 server01 kernel: [  368.572260] Free swap  = 0kB
Dec 30 01:53:36 server01 kernel: [  368.572375] Total swap = 0kB
Dec 30 01:53:36 server01 kernel: [  368.597836] apache2 invoked oom-killer: 
gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
Dec 30 01:53:36 server01 kernel: [  368.598034] apache2 cpuset=uid 
mems_allowed=0
Dec 30 01:53:36 server01 kernel: [  368.598152] Pid: 5385, comm: apache2 Not 
tainted 3.2.35-grsec #1
Dec 30 01:53:36 server01 kernel: [  368.598273] Call Trace:
Dec 30 01:53:36 server01 kernel: [  368.598396]  [] 
dump_header+0x7e/0x1e0
Dec 30 01:53:36 server01 kernel: [  368.598516]  [] ? 
find_lock_task_mm+0x2f/0x70
Dec 30 01:53:36 server01 kernel: [  368.598638]  [] 
oom_kill_process+0x85/0x2a0
Dec 30 01:53:36 server01 kernel: [  368.598759]  [] 
out_of_memory+0xe5/0x200
Dec 30 01:53:36 server01 kernel: [  368.598880]  [] 
pagefault_out_of_memory+0xbd/0x110
Dec 30 01:53:36 server01 kernel: [  368.599006]  [] 
mm_fault_error+0xb6/0x1a0
Dec 30 01:53:36 server01 kernel: [  368.599127]  [] 
do_page_fault+0x3ee/0x460
Dec 30 01:53:36 server01 kernel: [  368.599250]  [] ? 
mntput+0x1f/0x30
Dec 30 01:53:36 server01 kernel: [  368.599371]  [] ? 
fput+0x156/0x200
Dec 30 01:53:36 server01 kernel: [  368.599496]  [] 
page_fault+0x1f/0x30

This would suggest that an unexpected ENOMEM leaked during page fault
path. I do not see which one could that be because you said THP
(CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have
mentioned in the thread should fix that issue - btw. the patch is
already scheduled for stable tree).
 __do_fault, do_anonymous_page and do_wp_page call
mem_cgroup_newpage_charge with GFP_KERNEL which means that
we do memcg OOM and never return ENOMEM. do_swap_page calls
mem_cgroup_try_charge_swapin with GFP_KERNEL as well.

I might have missed something but I will not get to look closer before
2nd January.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-29 Thread azurIt
>which suggests that the patch is incomplete and that I am blind :/
>mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache
>and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following
>follow-up patch on top of the one you already have (which should catch
>all the remaining cases).
>Sorry about that...


This was, again, killing my MySQL server (search for "(mysqld)"):
http://www.watchdog.sk/lkml/oom_mysqld5
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-28 Thread Michal Hocko
On Mon 24-12-12 14:38:50, azurIt wrote:
> >OK, good to hear and fingers crossed. I will try to get back to the
> >original problem and a better solution sometimes early next year when
> >all the things settle a bit.
> 
> 
> Btw, i noticed one more thing when problem is happening (=when any
> cgroup is stucked), i fogot to mention it before, sorry :( . It's
> related to HDDs, something is slowing them down in a strange way. All
> services are working normally and i really cannot notice any slowness,
> the only thing which i noticed is affeceted is our backup software (
> www.Bacula.org ). When problem occurs at night, so it's happening when
> backup is running, backup is extremely slow and usually don't finish
> until i kill processes inside affected cgroup (=until i resolve the
> problem). Backup software is NOT doing big HDD bandwidth BUT it's
> doing quite huge number of disk operations (it needs to stat every
> file and directory). I believe that only speed of disk operations are
> affected and are very slow.

I would bet that this is caused by the blocked proceses in memcg oom
handler which hold i_mutex and the backup process wants to access the
same inode with an operation which requires the lock.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-28 Thread Michal Hocko
On Mon 24-12-12 14:25:26, azurIt wrote:
> >OK, good to hear and fingers crossed. I will try to get back to the
> >original problem and a better solution sometimes early next year when
> >all the things settle a bit.
> 
> 
> Michal, problem, unfortunately, happened again :( twice. When it
> happened first time (two days ago) i don't want to believe it so i
> recompiled the kernel and boot it again to be sure i really used your
> patch. Today it happened again, here is report:
> http://watchdog.sk/lkml/memcg-bug-3.tar.gz

Hmm, 1356352982/1507/stack says
[] mem_cgroup_handle_oom+0x241/0x3b0
[] T.1147+0x5ab/0x5c0
[] mem_cgroup_cache_charge+0xbe/0xe0
[] add_to_page_cache_locked+0x4f/0x140
[] add_to_page_cache_lru+0x22/0x50
[] find_or_create_page+0x73/0xb0
[] __getblk+0xea/0x2c0
[] ext3_getblk+0xeb/0x240
[] ext3_bread+0x19/0x90
[] ext3_dx_find_entry+0x83/0x1e0
[] ext3_find_entry+0x2e4/0x480
[] ext3_lookup+0x4d/0x120
[] d_alloc_and_lookup+0x45/0x90
[] do_lookup+0x278/0x390
[] path_lookupat+0xae/0x7e0
[] do_path_lookup+0x35/0xe0
[] user_path_at_empty+0x59/0xb0
[] user_path_at+0x11/0x20
[] vfs_fstatat+0x47/0x80
[] vfs_lstat+0x1e/0x20
[] sys_newlstat+0x24/0x50
[] system_call_fastpath+0x18/0x1d
[] 0x

which suggests that the patch is incomplete and that I am blind :/
mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache
and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following
follow-up patch on top of the one you already have (which should catch
all the remaining cases).
Sorry about that...
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 89997ac..559a54d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2779,6 +2779,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, 
struct mem_cgroup *memcg,
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask)
 {
+   bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM);
struct mem_cgroup *memcg = NULL;
int ret;
 
@@ -2791,7 +2792,7 @@ int mem_cgroup_cache_charge(struct page *page, struct 
mm_struct *mm,
mm = &init_mm;
 
if (page_is_file_cache(page)) {
-   ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true);
+   ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom);
if (ret || !memcg)
return ret;
 
@@ -2827,6 +2828,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 struct page *page,
 gfp_t mask, struct mem_cgroup **ptr)
 {
+   bool oom = !(mask & GFP_MEMCG_NO_OOM);
struct mem_cgroup *memcg;
int ret;
 
@@ -2849,13 +2851,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
if (!memcg)
goto charge_cur_mm;
*ptr = memcg;
-   ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true);
+   ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom);
css_put(&memcg->css);
return ret;
 charge_cur_mm:
if (unlikely(!mm))
mm = &init_mm;
-   return __mem_cgroup_try_charge(mm, mask, 1, ptr, true);
+   return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom);
 }
 
 static void
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-24 Thread azurIt
>OK, good to hear and fingers crossed. I will try to get back to the
>original problem and a better solution sometimes early next year when
>all the things settle a bit.


Btw, i noticed one more thing when problem is happening (=when any cgroup is 
stucked), i fogot to mention it before, sorry :( . It's related to HDDs, 
something is slowing them down in a strange way. All services are working 
normally and i really cannot notice any slowness, the only thing which i 
noticed is affeceted is our backup software ( www.Bacula.org ). When problem 
occurs at night, so it's happening when backup is running, backup is extremely 
slow and usually don't finish until i kill processes inside affected cgroup 
(=until i resolve the problem). Backup software is NOT doing big HDD bandwidth 
BUT it's doing quite huge number of disk operations (it needs to stat every 
file and directory). I believe that only speed of disk operations are affected 
and are very slow.

Merry christmas!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-24 Thread azurIt
>OK, good to hear and fingers crossed. I will try to get back to the
>original problem and a better solution sometimes early next year when
>all the things settle a bit.


Michal, problem, unfortunately, happened again :( twice. When it happened first 
time (two days ago) i don't want to believe it so i recompiled the kernel and 
boot it again to be sure i really used your patch. Today it happened again, 
here is report:
http://watchdog.sk/lkml/memcg-bug-3.tar.gz

Here is patch which i used (kernel 3.2.35, i didn't use any other from your 
patches):
http://watchdog.sk/lkml/5-memcg-fix.patch

azur
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-18 Thread Michal Hocko
On Tue 18-12-12 15:22:23, azurIt wrote:
> >It should mitigate the problem. The real fix shouldn't be that specific
> >(as per discussion in other thread). The chance this will get upstream
> >is not big and that means that it will not get to the stable tree
> >either.
> 
> 
> OOM is no longer killing processes outside target cgroups, so
> everything looks fine so far. Will report back when i will have more
> info. Thnks!

OK, good to hear and fingers crossed. I will try to get back to the
original problem and a better solution sometimes early next year when
all the things settle a bit.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-18 Thread azurIt
>It should mitigate the problem. The real fix shouldn't be that specific
>(as per discussion in other thread). The chance this will get upstream
>is not big and that means that it will not get to the stable tree
>either.


OOM is no longer killing processes outside target cgroups, so everything looks 
fine so far. Will report back when i will have more info. Thnks!

azur
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-17 Thread Michal Hocko
On Mon 17-12-12 19:23:01, azurIt wrote:
> >[Ohh, I am really an idiot. I screwed the first patch]
> >-   bool oom = true;
> >+   bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM);
> >
> >Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM).
> >  No idea how I could have missed that. I am really sorry about that.
> 
> 
> :D no problem :) so, now it should really work as expected and
> completely fix my original problem?

It should mitigate the problem. The real fix shouldn't be that specific
(as per discussion in other thread). The chance this will get upstream
is not big and that means that it will not get to the stable tree
either.

> is it safe to apply it on 3.2.35?

I didn't check what are the differences but I do not think there is
anything to conflict with it.

> Thank you very much!

HTH

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-17 Thread azurIt
>[Ohh, I am really an idiot. I screwed the first patch]
>-   bool oom = true;
>+   bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM);
>
>Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM).
>  No idea how I could have missed that. I am really sorry about that.


:D no problem :) so, now it should really work as expected and completely fix 
my original problem? is it safe to apply it on 3.2.35? Thank you very much!

azur
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-17 Thread Michal Hocko
On Mon 17-12-12 02:34:30, azurIt wrote:
> >I would try to limit changes to minimum. So the original kernel you were
> >using + the first patch to prevent OOM from the write path + 2 debugging
> >patches.
> 
> 
> It didn't take off the whole system this time (but i was
> prepared to record a video of console ;) ), here it is:
> http://www.watchdog.sk/lkml/oom_mysqld4

[...]
[ 1248.059429] [ cut here ]
[ 1248.059586] WARNING: at mm/memcontrol.c:2400 T.1146+0x2d9/0x610()
[ 1248.059723] Hardware name: S5000VSA
[ 1248.059855] gfp_mask:208 nr_pages:1 oom:0 ret:2

This is GFP_KERNEL allocation which is expected. It is also a simple
page which is not that expected because we shouldn't return ENOMEM on
those unless this was GFP_ATOMIC allocation (which it wasn't) or the
caller told us to not trigger OOM which is the case only for THP pages
(see mem_cgroup_charge_common). So the big question is how have we ended
up with oom=false here...

[Ohh, I am really an idiot. I screwed the first patch]
-   bool oom = true;
+   bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM);

Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM).
  No idea how I could have missed that. I am really sorry about that.
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c04676d..1f35a74 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2704,7 +2704,7 @@ static int mem_cgroup_charge_common(struct page *page, 
struct mm_struct *mm,
struct mem_cgroup *memcg = NULL;
unsigned int nr_pages = 1;
struct page_cgroup *pc;
-   bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM);
+   bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM);
int ret;
 
if (PageTransHuge(page)) {
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-16 Thread azurIt
>I would try to limit changes to minimum. So the original kernel you were
>using + the first patch to prevent OOM from the write path + 2 debugging
>patches.


It didn't take off the whole system this time (but i was prepared to record a 
video of console ;) ), here it is:
http://www.watchdog.sk/lkml/oom_mysqld4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-10 Thread azurIt
>I would try to limit changes to minimum. So the original kernel you were
>using + the first patch to prevent OOM from the write path + 2 debugging
>patches.


ok.


>But was it at least related to the debugging from the patch or it was
>rather a totally unrelated thing?


I wasn't reading it much but i think it looks like a traces i was sending you 
before.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-10 Thread Michal Hocko
On Mon 10-12-12 11:18:17, azurIt wrote:
> >Hmm, this is _really_ surprising. The latest patch didn't add any new
> >logging actually. It just enahanced messages which were already printed
> >out previously + changed few functions to be not inlined so they show up
> >in the traces. So the only explanation is that the workload has changed
> >or the patches got misapplied.
> 
> 
> This time i installed 3.2.35, maybe some changes between .34 and .35
> did this? Should i try .34?

I would try to limit changes to minimum. So the original kernel you were
using + the first patch to prevent OOM from the write path + 2 debugging
patches.
 
> >> Dec 10 02:03:29 server01 kernel: [  220.366486] grsec: From 
> >> 141.105.120.152: bruteforce prevention initiated for the next 30 minutes 
> >> or until service restarted, stalling each fork 30 seconds.  Please 
> >> investigate the crash report for 
> >> /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 
> >> gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] 
> >> uid/euid:0/0 gid/egid:0/0
> >
> >This explains why you have seen your machine hung. I am not familiar
> >with grsec but stalling each fork 30s sounds really bad.
> 
> 
> Btw, i never ever saw such a message from grsecurity yet. Will write to grsec 
> mailing list about explanation.
> 
> 
> >Anyway this will not help me much. Do you happen to still have any of
> >those logged traces from the last run?
> 
> 
> Unfortunately not, it didn't log anything and tons of messages were
> printed only to console (i was logged via IP-KVM). It looked that
> printing is infinite, i rebooted it after few minutes.

But was it at least related to the debugging from the patch or it was
rather a totally unrelated thing?

> >Apart from that. If my current understanding is correct then this is
> >related to transparent huge pages (and leaking charge to the page fault
> >handler). Do you see the same problem if you disable THP before you
> >start your workload? (echo never > 
> >/sys/kernel/mm/transparent_hugepage/enabled)
> 
> # cat /sys/kernel/mm/transparent_hugepage/enabled
> cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory

Weee. Then it cannot be related to THP at all. Which makes this even
bigger mystery.
We really need to find out who is leaking that charge.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-10 Thread azurIt
>Hmm, this is _really_ surprising. The latest patch didn't add any new
>logging actually. It just enahanced messages which were already printed
>out previously + changed few functions to be not inlined so they show up
>in the traces. So the only explanation is that the workload has changed
>or the patches got misapplied.


This time i installed 3.2.35, maybe some changes between .34 and .35 did this? 
Should i try .34?


>> Dec 10 02:03:29 server01 kernel: [  220.366486] grsec: From 141.105.120.152: 
>> bruteforce prevention initiated for the next 30 minutes or until service 
>> restarted, stalling each fork 30 seconds.  Please investigate the crash 
>> report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 
>> gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] 
>> uid/euid:0/0 gid/egid:0/0
>
>This explains why you have seen your machine hung. I am not familiar
>with grsec but stalling each fork 30s sounds really bad.


Btw, i never ever saw such a message from grsecurity yet. Will write to grsec 
mailing list about explanation.


>Anyway this will not help me much. Do you happen to still have any of
>those logged traces from the last run?


Unfortunately not, it didn't log anything and tons of messages were printed 
only to console (i was logged via IP-KVM). It looked that printing is infinite, 
i rebooted it after few minutes.


>Apart from that. If my current understanding is correct then this is
>related to transparent huge pages (and leaking charge to the page fault
>handler). Do you see the same problem if you disable THP before you
>start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled)

# cat /sys/kernel/mm/transparent_hugepage/enabled
cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory

# ls -la /sys/kernel/mm 
total 0
drwx-- 3 root root 0 Dec 10 11:11 .
drwx-- 5 root root 0 Dec 10 02:06 ..
drwx-- 2 root root 0 Dec 10 11:11 cleancache
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-10 Thread Michal Hocko
On Mon 10-12-12 02:20:38, azurIt wrote:
[...]
> Michal,

Hi,
 
> this was printing so many debug messages to console that the whole
> server hangs

Hmm, this is _really_ surprising. The latest patch didn't add any new
logging actually. It just enahanced messages which were already printed
out previously + changed few functions to be not inlined so they show up
in the traces. So the only explanation is that the workload has changed
or the patches got misapplied.

> and i had to hard reset it after several minutes :( Sorry
> but i cannot test such a things in production. There's no problem with
> one soft reset which takes 4 minutes but this hard reset creates about
> 20 minutes outage (mainly cos of disk quotas checking).

Understood.

> Last logged message:
> 
> Dec 10 02:03:29 server01 kernel: [  220.366486] grsec: From 141.105.120.152: 
> bruteforce prevention initiated for the next 30 minutes or until service 
> restarted, stalling each fork 30 seconds.  Please investigate the crash 
> report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 
> gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] 
> uid/euid:0/0 gid/egid:0/0

This explains why you have seen your machine hung. I am not familiar
with grsec but stalling each fork 30s sounds really bad.

Anyway this will not help me much. Do you happen to still have any of
those logged traces from the last run?

Apart from that. If my current understanding is correct then this is
related to transparent huge pages (and leaking charge to the page fault
handler). Do you see the same problem if you disable THP before you
start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled)
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-09 Thread azurIt
>There are no other callers AFAICS so I am getting clueless. Maybe more
>debugging will tell us something (the inlining has been reduced for thp
>paths which can reduce performance in thp page fault heavy workloads but
>this will give us better traces - I hope).


Michal,

this was printing so many debug messages to console that the whole server hangs 
and i had to hard reset it after several minutes :( Sorry but i cannot test 
such a things in production. There's no problem with one soft reset which takes 
4 minutes but this hard reset creates about 20 minutes outage (mainly cos of 
disk quotas checking). Last logged message:

Dec 10 02:03:29 server01 kernel: [  220.366486] grsec: From 141.105.120.152: 
bruteforce prevention initiated for the next 30 minutes or until service 
restarted, stalling each fork 30 seconds.  Please investigate the crash report 
for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 
gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] 
uid/euid:0/0 gid/egid:0/0
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-06 Thread Michal Hocko
On Thu 06-12-12 11:12:49, azurIt wrote:
> >Dohh. The very same stack mem_cgroup_newpage_charge called from the page
> >fault. The heavy inlining is not particularly helping here... So there
> >must be some other THP charge leaking out.
> >[/me is diving into the code again]
> >
> >* do_huge_pmd_anonymous_page falls back to handle_pte_fault
> >* do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't
> >  charge the huge page
> >* do_huge_pmd_wp_page splits the huge page and retries with fallback to
> >  handle_pte_fault
> >* collapse_huge_page is not called in the page fault path
> >* do_wp_page, do_anonymous_page and __do_fault  operate on a single page
> >  so the memcg charging cannot return ENOMEM
> >
> >There are no other callers AFAICS so I am getting clueless. Maybe more
> >debugging will tell us something (the inlining has been reduced for thp
> >paths which can reduce performance in thp page fault heavy workloads but
> >this will give us better traces - I hope).
> 
> 
> Should i apply all patches togather? (fix for this bug, more log
> messages, backported fix from 3.5 and this new one)

Yes please
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-06 Thread azurIt
>Dohh. The very same stack mem_cgroup_newpage_charge called from the page
>fault. The heavy inlining is not particularly helping here... So there
>must be some other THP charge leaking out.
>[/me is diving into the code again]
>
>* do_huge_pmd_anonymous_page falls back to handle_pte_fault
>* do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't
>  charge the huge page
>* do_huge_pmd_wp_page splits the huge page and retries with fallback to
>  handle_pte_fault
>* collapse_huge_page is not called in the page fault path
>* do_wp_page, do_anonymous_page and __do_fault  operate on a single page
>  so the memcg charging cannot return ENOMEM
>
>There are no other callers AFAICS so I am getting clueless. Maybe more
>debugging will tell us something (the inlining has been reduced for thp
>paths which can reduce performance in thp page fault heavy workloads but
>this will give us better traces - I hope).


Should i apply all patches togather? (fix for this bug, more log messages, 
backported fix from 3.5 and this new one)


>Anyway do you see the same problem if transparent huge pages are
>disabled?
>echo never > /sys/kernel/mm/transparent_hugepage/enabled)


# cat /sys/kernel/mm/transparent_hugepage/enabled
cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-06 Thread Michal Hocko
On Thu 06-12-12 01:29:24, azurIt wrote:
> >OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge.
> >This can only happen if this was an atomic allocation request
> >(!__GFP_WAIT) or if oom is not allowed which is the case only for
> >transparent huge page allocation.
> >The first case can be excluded (in the clean 3.2 stable kernel) because
> >all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one
> >should be OK because the page fault should fallback to a regular page if
> >THP allocation/charge fails.
> >[/me goes to double check]
> >Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with
> >VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback
> >instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split
> >hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The
> >patch applies to 3.2 without any further modifications. I didn't have
> >time to test it but if it helps you we should push this to the stable
> >tree.
> 
> 
> This, unfortunately, didn't fix the problem :(
> http://www.watchdog.sk/lkml/oom_mysqld3

Dohh. The very same stack mem_cgroup_newpage_charge called from the page
fault. The heavy inlining is not particularly helping here... So there
must be some other THP charge leaking out.
[/me is diving into the code again]

* do_huge_pmd_anonymous_page falls back to handle_pte_fault
* do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't
  charge the huge page
* do_huge_pmd_wp_page splits the huge page and retries with fallback to
  handle_pte_fault
* collapse_huge_page is not called in the page fault path
* do_wp_page, do_anonymous_page and __do_fault  operate on a single page
  so the memcg charging cannot return ENOMEM

There are no other callers AFAICS so I am getting clueless. Maybe more
debugging will tell us something (the inlining has been reduced for thp
paths which can reduce performance in thp page fault heavy workloads but
this will give us better traces - I hope).

Anyway do you see the same problem if transparent huge pages are
disabled?
echo never > /sys/kernel/mm/transparent_hugepage/enabled)
---
>From 93a30140b50d8474a047b91c698f4880149635db Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Thu, 6 Dec 2012 10:40:17 +0100
Subject: [PATCH] more debugging

---
 mm/huge_memory.c |6 +++---
 mm/memcontrol.c  |2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 470cbb4..01a11f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -671,7 +671,7 @@ static inline struct page *alloc_hugepage(int defrag)
 }
 #endif
 
-int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct 
*vma,
+noinline int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
   unsigned long address, pmd_t *pmd,
   unsigned int flags)
 {
@@ -790,7 +790,7 @@ pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
return pgtable;
 }
 
-static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
+static noinline int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address,
pmd_t *pmd, pmd_t orig_pmd,
@@ -883,7 +883,7 @@ out_free_pages:
goto out;
 }
 
-int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+noinline int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct 
*vma,
unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
 {
int ret = 0;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9e5b56b..1986c65 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2397,7 +2397,7 @@ done:
return 0;
 nomem:
*ptr = NULL;
-   __WARN();
+   __WARN_printf("gfp_mask:%u nr_pages:%u oom:%d ret:%d\n", gfp_mask, 
nr_pages, oom, ret);
return -ENOMEM;
 bypass:
*ptr = NULL;
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-05 Thread azurIt
>OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge.
>This can only happen if this was an atomic allocation request
>(!__GFP_WAIT) or if oom is not allowed which is the case only for
>transparent huge page allocation.
>The first case can be excluded (in the clean 3.2 stable kernel) because
>all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one
>should be OK because the page fault should fallback to a regular page if
>THP allocation/charge fails.
>[/me goes to double check]
>Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with
>VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback
>instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split
>hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The
>patch applies to 3.2 without any further modifications. I didn't have
>time to test it but if it helps you we should push this to the stable
>tree.


This, unfortunately, didn't fix the problem :(
http://www.watchdog.sk/lkml/oom_mysqld3
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-05 Thread Michal Hocko
On Wed 05-12-12 02:36:44, azurIt wrote:
> >The following should print the traces when we hand over ENOMEM to the
> >caller. It should catch all charge paths (migration is not covered but
> >that one is not important here). If we don't see any traces from here
> >and there is still global OOM striking then there must be something else
> >to trigger this.
> >Could you test this with the patch which aims at fixing your deadlock,
> >please? I realise that this is a production environment but I do not see
> >anything relevant in the code.
> 
> 
> Michal,
> 
> i think/hope this is what you wanted:
> http://www.watchdog.sk/lkml/oom_mysqld2

Dec  5 02:20:48 server01 kernel: [  380.995947] WARNING: at 
mm/memcontrol.c:2400 T.1146+0x2c1/0x5d0()
Dec  5 02:20:48 server01 kernel: [  380.995950] Hardware name: S5000VSA
Dec  5 02:20:48 server01 kernel: [  380.995952] Pid: 5351, comm: apache2 Not 
tainted 3.2.34-grsec #1
Dec  5 02:20:48 server01 kernel: [  380.995954] Call Trace:
Dec  5 02:20:48 server01 kernel: [  380.995960]  [] 
warn_slowpath_common+0x7a/0xb0
Dec  5 02:20:48 server01 kernel: [  380.995963]  [] 
warn_slowpath_null+0x1a/0x20
Dec  5 02:20:48 server01 kernel: [  380.995965]  [] 
T.1146+0x2c1/0x5d0
Dec  5 02:20:48 server01 kernel: [  380.995967]  [] 
mem_cgroup_charge_common+0x53/0x90
Dec  5 02:20:48 server01 kernel: [  380.995970]  [] 
mem_cgroup_newpage_charge+0x45/0x50
Dec  5 02:20:48 server01 kernel: [  380.995974]  [] 
handle_pte_fault+0x609/0x940
Dec  5 02:20:48 server01 kernel: [  380.995978]  [] ? 
pte_alloc_one+0x3f/0x50
Dec  5 02:20:48 server01 kernel: [  380.995981]  [] 
handle_mm_fault+0x138/0x260
Dec  5 02:20:48 server01 kernel: [  380.995983]  [] 
do_page_fault+0x13d/0x460
Dec  5 02:20:48 server01 kernel: [  380.995986]  [] ? 
do_mmap_pgoff+0x3dc/0x430
Dec  5 02:20:48 server01 kernel: [  380.995988]  [] ? 
remove_vma+0x5d/0x80
Dec  5 02:20:48 server01 kernel: [  380.995992]  [] 
page_fault+0x1f/0x30
Dec  5 02:20:48 server01 kernel: [  380.995994] ---[ end trace 25bbb3e634c25b7f 
]---
Dec  5 02:20:48 server01 kernel: [  380.996373] apache2 invoked oom-killer: 
gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
Dec  5 02:20:48 server01 kernel: [  380.996377] apache2 cpuset=uid 
mems_allowed=0
Dec  5 02:20:48 server01 kernel: [  380.996379] Pid: 5351, comm: apache2 
Tainted: GW3.2.34-grsec #1
Dec  5 02:20:48 server01 kernel: [  380.996380] Call Trace:
Dec  5 02:20:48 server01 kernel: [  380.996384]  [] 
dump_header+0x7e/0x1e0
Dec  5 02:20:48 server01 kernel: [  380.996387]  [] ? 
find_lock_task_mm+0x2f/0x70
Dec  5 02:20:48 server01 kernel: [  380.996389]  [] 
oom_kill_process+0x85/0x2a0
Dec  5 02:20:48 server01 kernel: [  380.996392]  [] 
out_of_memory+0xe5/0x200
Dec  5 02:20:48 server01 kernel: [  380.996394]  [] ? 
pte_alloc_one+0x3f/0x50
Dec  5 02:20:48 server01 kernel: [  380.996397]  [] 
pagefault_out_of_memory+0xbd/0x110
Dec  5 02:20:48 server01 kernel: [  380.996399]  [] 
mm_fault_error+0xb6/0x1a0
Dec  5 02:20:48 server01 kernel: [  380.996401]  [] 
do_page_fault+0x3ee/0x460
Dec  5 02:20:48 server01 kernel: [  380.996403]  [] ? 
do_mmap_pgoff+0x3dc/0x430
Dec  5 02:20:48 server01 kernel: [  380.996405]  [] ? 
remove_vma+0x5d/0x80
Dec  5 02:20:48 server01 kernel: [  380.996408]  [] 
page_fault+0x1f/0x30

OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge.
This can only happen if this was an atomic allocation request
(!__GFP_WAIT) or if oom is not allowed which is the case only for
transparent huge page allocation.
The first case can be excluded (in the clean 3.2 stable kernel) because
all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one
should be OK because the page fault should fallback to a regular page if
THP allocation/charge fails.
[/me goes to double check]
Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with
VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback
instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split
hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The
patch applies to 3.2 without any further modifications. I didn't have
time to test it but if it helps you we should push this to the stable
tree.
---
>From 765f5e0121c4410faa19c088e9ada75976bde178 Mon Sep 17 00:00:00 2001
From: David Rientjes 
Date: Tue, 29 May 2012 15:06:23 -0700
Subject: [PATCH] thp, memcg: split hugepage for memcg oom on cow

On COW, a new hugepage is allocated and charged to the memcg.  If the
system is oom or the charge to the memcg fails, however, the fault
handler will return VM_FAULT_OOM which results in an oom kill.

Instead, it's possible to fallback to splitting the hugepage so that the
COW results only in an order-0 page being allocated and charged to the
memcg which has a higher liklihood to succeed.  This is expensive
because the hugepage must be split in the page fault handler, but it is
much better than unnecessarily oom killing a process.

Signed-off-by: David Rientjes 
Cc:

Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-04 Thread azurIt
>The following should print the traces when we hand over ENOMEM to the
>caller. It should catch all charge paths (migration is not covered but
>that one is not important here). If we don't see any traces from here
>and there is still global OOM striking then there must be something else
>to trigger this.
>Could you test this with the patch which aims at fixing your deadlock,
>please? I realise that this is a production environment but I do not see
>anything relevant in the code.


Michal,

i think/hope this is what you wanted:
http://www.watchdog.sk/lkml/oom_mysqld2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-12-03 Thread Michal Hocko
On Fri 30-11-12 17:19:23, Michal Hocko wrote:
[...]
> The important question is why you see VM_FAULT_OOM and whether memcg
> charging failure can trigger that. I don not see how this could happen
> right now because __GFP_NORETRY is not used for user pages (except for
> THP which disable memcg OOM already), file backed page faults (aka
> __do_fault) use mem_cgroup_newpage_charge which doesn't disable OOM.
> This is a real head scratcher.

The following should print the traces when we hand over ENOMEM to the
caller. It should catch all charge paths (migration is not covered but
that one is not important here). If we don't see any traces from here
and there is still global OOM striking then there must be something else
to trigger this.
Could you test this with the patch which aims at fixing your deadlock,
please? I realise that this is a production environment but I do not see
anything relevant in the code.
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c8425b1..9e5b56b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2397,6 +2397,7 @@ done:
return 0;
 nomem:
*ptr = NULL;
+   __WARN();
return -ENOMEM;
 bypass:
*ptr = NULL;

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-30 Thread azurIt
>The only strange thing I noticed is that some groups have 0 limit. Is
>this intentional?
>grep memory.limit_in_bytes cgroups | grep -v uid | sed 's@.*/@@' | sort | uniq 
>-c
>  3 memory.limit_in_bytes:0


These are users who are not allowed to run anything.


azur
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-30 Thread Michal Hocko
On Fri 30-11-12 17:26:51, azurIt wrote:
> >Could you also post your complete containers configuration, maybe there
> >is something strange in there (basically grep . -r YOUR_CGROUP_MNT
> >except for tasks files which are of no use right now).
> 
> 
> Here it is:
> http://www.watchdog.sk/lkml/cgroups.gz

The only strange thing I noticed is that some groups have 0 limit. Is
this intentional?
grep memory.limit_in_bytes cgroups | grep -v uid | sed 's@.*/@@' | sort | uniq 
-c
  3 memory.limit_in_bytes:0
254 memory.limit_in_bytes:104857600
107 memory.limit_in_bytes:157286400
 68 memory.limit_in_bytes:209715200
 10 memory.limit_in_bytes:262144000
 28 memory.limit_in_bytes:314572800
  1 memory.limit_in_bytes:346030080
  1 memory.limit_in_bytes:524288000
  2 memory.limit_in_bytes:9223372036854775807
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-30 Thread azurIt
>Could you also post your complete containers configuration, maybe there
>is something strange in there (basically grep . -r YOUR_CGROUP_MNT
>except for tasks files which are of no use right now).


Here it is:
http://www.watchdog.sk/lkml/cgroups.gz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-30 Thread Michal Hocko
On Fri 30-11-12 16:59:37, azurIt wrote:
> >> Here is the full boot log:
> >> www.watchdog.sk/lkml/kern.log
> >
> >The log is not complete. Could you paste the comple dmesg output? Or
> >even better, do you have logs from the previous run?
> 
> 
> What is missing there? All kernel messages are logging into
> /var/log/kern.log (it's the same as dmesg), dmesg itself was already
> rewrited by other messages. I think it's all what that kernel printed.

Early boot messages are missing - so exactly the BIOS memory map I was
asking for. As the NUMA has been excluded it is probably not that
relevant anymore.
The important question is why you see VM_FAULT_OOM and whether memcg
charging failure can trigger that. I don not see how this could happen
right now because __GFP_NORETRY is not used for user pages (except for
THP which disable memcg OOM already), file backed page faults (aka
__do_fault) use mem_cgroup_newpage_charge which doesn't disable OOM.
This is a real head scratcher.

Could you also post your complete containers configuration, maybe there
is something strange in there (basically grep . -r YOUR_CGROUP_MNT
except for tasks files which are of no use right now).
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-30 Thread azurIt
>> Here is the full boot log:
>> www.watchdog.sk/lkml/kern.log
>
>The log is not complete. Could you paste the comple dmesg output? Or
>even better, do you have logs from the previous run?


What is missing there? All kernel messages are logging into /var/log/kern.log 
(it's the same as dmesg), dmesg itself was already rewrited by other messages. 
I think it's all what that kernel printed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-30 Thread Michal Hocko
On Fri 30-11-12 16:08:11, azurIt wrote:
> >DMA32 zone is usually fills up first 4G unless your HW remaps the rest
> >of the memory above 4G or you have a numa machine and the rest of the
> >memory is at other node. Could you post your memory map printed during
> >the boot? (e820: BIOS-provided physical RAM map: and following lines)
> 
> 
> Here is the full boot log:
> www.watchdog.sk/lkml/kern.log

The log is not complete. Could you paste the comple dmesg output? Or
even better, do you have logs from the previous run?

> >You have mentioned that you are comounting with cpuset. If this happens
> >to be a NUMA machine have you made the access to all nodes available?
> >Also what does /proc/sys/vm/zone_reclaim_mode says?
> 
> 
> Don't really know what NUMA means and which nodes are you talking
> about, sorry :(

http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access
 
> # cat /proc/sys/vm/zone_reclaim_mode
> cat: /proc/sys/vm/zone_reclaim_mode: No such file or directory

OK, so the NUMA is not enabled.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-30 Thread Michal Hocko
On Fri 30-11-12 16:03:47, Michal Hocko wrote:
[...]
> Anyway, the more interesting thing is gfp_mask is GFP_NOWAIT allocation
> from the page fault? Huh this shouldn't happen - ever.

OK, it starts making sense now. The message came from
pagefault_out_of_memory which doesn't have gfp nor the required node
information any longer. This suggests that VM_FAULT_OOM has been
returned by the fault handler. So this hasn't been triggered by the page
fault allocator.
I am wondering whether this could be caused by the patch but the effect
of that one should be limitted to the write (unlike the later version
for -mm tree which hooks into the shmem as well).

Will have to think about it some more.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-30 Thread azurIt
>DMA32 zone is usually fills up first 4G unless your HW remaps the rest
>of the memory above 4G or you have a numa machine and the rest of the
>memory is at other node. Could you post your memory map printed during
>the boot? (e820: BIOS-provided physical RAM map: and following lines)


Here is the full boot log:
www.watchdog.sk/lkml/kern.log


>You have mentioned that you are comounting with cpuset. If this happens
>to be a NUMA machine have you made the access to all nodes available?
>Also what does /proc/sys/vm/zone_reclaim_mode says?


Don't really know what NUMA means and which nodes are you talking about, sorry 
:(

# cat /proc/sys/vm/zone_reclaim_mode
cat: /proc/sys/vm/zone_reclaim_mode: No such file or directory



azur
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-30 Thread Michal Hocko
On Fri 30-11-12 15:44:31, Michal Hocko wrote:
> On Fri 30-11-12 14:44:27, azurIt wrote:
> > >Anyway your system is under both global and local memory pressure. You
> > >didn't see apache going down previously because it was probably the one
> > >which was stuck and could be killed.
> > >Anyway you need to setup your system more carefully.
> > 
> > 
> > There is, also, an evidence that system has enough of memory! :) Just
> > take column 'rss' from process list in OOM message and sum it - you
> > will get 2489911. It's probably in KB so it's about 2.4 GB. System has
> > 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of
> > 14.
> 
> Hmm, that corresponds to the ZONE_DMA32 size pretty nicely but that zone
> is hardly touched:
> Nov 30 02:53:56 server01 kernel: [  818.241291] DMA32 free:2523636kB 
> min:2672kB low:3340kB high:4008kB active_anon:0kB inactive_anon:0kB 
> active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB 
> isolated(file):0kB present:2542248kB mlocked:0kB dirty:0kB writeback:0kB 
> mapped:4kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB 
> kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB 
> pages_scanned:0 all_unreclaimable? no
> 
> DMA32 zone is usually fills up first 4G unless your HW remaps the rest
> of the memory above 4G or you have a numa machine and the rest of the
> memory is at other node. Could you post your memory map printed during
> the boot? (e820: BIOS-provided physical RAM map: and following lines)
> 
> There is also ZONE_NORMAL which is also not used much
> Nov 30 02:53:56 server01 kernel: [  818.242163] Normal free:6924716kB 
> min:12512kB low:15640kB high:18768kB active_anon:1463128kB 
> inactive_anon:2072kB active_file:1803964kB inactive_file:1072628kB 
> unevictable:3924kB isolated(anon):0kB isolated(file):0kB present:11893760kB 
> mlocked:3924kB dirty:1000kB writeback:776kB mapped:35656kB shmem:3828kB 
> slab_reclaimable:202560kB slab_unreclaimable:50696kB kernel_stack:2944kB 
> pagetables:158616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 
> all_unreclaimable? no
> 
> You have mentioned that you are comounting with cpuset. If this happens
> to be a NUMA machine have you made the access to all nodes available?

And now that I am looking at the oom message more closely I can see
Nov 30 02:53:56 server01 kernel: [  818.232812] apache2 invoked oom-killer: 
gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0
Nov 30 02:53:56 server01 kernel: [  818.233029] apache2 cpuset=uid 
mems_allowed=0
Nov 30 02:53:56 server01 kernel: [  818.233159] Pid: 9247, comm: apache2 Not 
tainted 3.2.34-grsec #1
Nov 30 02:53:56 server01 kernel: [  818.233289] Call Trace:
Nov 30 02:53:56 server01 kernel: [  818.233470]  [] 
dump_header+0x7e/0x1e0
Nov 30 02:53:56 server01 kernel: [  818.233600]  [] ? 
find_lock_task_mm+0x2f/0x70
Nov 30 02:53:56 server01 kernel: [  818.233721]  [] 
oom_kill_process+0x85/0x2a0
Nov 30 02:53:56 server01 kernel: [  818.233842]  [] 
out_of_memory+0xe5/0x200
Nov 30 02:53:56 server01 kernel: [  818.233963]  [] ? 
pte_alloc_one+0x3f/0x50
Nov 30 02:53:56 server01 kernel: [  818.234082]  [] 
pagefault_out_of_memory+0xbd/0x110
Nov 30 02:53:56 server01 kernel: [  818.234204]  [] 
mm_fault_error+0xb6/0x1a0
Nov 30 02:53:56 server01 kernel: [  818.235886]  [] 
do_page_fault+0x3ee/0x460
Nov 30 02:53:56 server01 kernel: [  818.236006]  [] ? 
vma_merge+0x1f7/0x2c0
Nov 30 02:53:56 server01 kernel: [  818.236124]  [] ? 
do_brk+0x267/0x400
Nov 30 02:53:56 server01 kernel: [  818.236244]  [] ? 
gr_learn_resource+0x42/0x1e0
Nov 30 02:53:56 server01 kernel: [  818.236367]  [] 
page_fault+0x1f/0x30

Which is interesting from 2 perspectives. Only the first node (Node-0)
is allowed which would suggest that the cpuset controller is not
configured to all nodes. It is still surprising Node 0 wouldn't have any
memory (I would expect ZONE_DMA32 would be sitting there).

Anyway, the more interesting thing is gfp_mask is GFP_NOWAIT allocation
from the page fault? Huh this shouldn't happen - ever.

> Also what does /proc/sys/vm/zone_reclaim_mode says?
> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-30 Thread Michal Hocko
On Fri 30-11-12 14:44:27, azurIt wrote:
> >Anyway your system is under both global and local memory pressure. You
> >didn't see apache going down previously because it was probably the one
> >which was stuck and could be killed.
> >Anyway you need to setup your system more carefully.
> 
> 
> There is, also, an evidence that system has enough of memory! :) Just
> take column 'rss' from process list in OOM message and sum it - you
> will get 2489911. It's probably in KB so it's about 2.4 GB. System has
> 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of
> 14.

Hmm, that corresponds to the ZONE_DMA32 size pretty nicely but that zone
is hardly touched:
Nov 30 02:53:56 server01 kernel: [  818.241291] DMA32 free:2523636kB min:2672kB 
low:3340kB high:4008kB active_anon:0kB inactive_anon:0kB active_file:0kB 
inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:2542248kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB 
slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB 
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no

DMA32 zone is usually fills up first 4G unless your HW remaps the rest
of the memory above 4G or you have a numa machine and the rest of the
memory is at other node. Could you post your memory map printed during
the boot? (e820: BIOS-provided physical RAM map: and following lines)

There is also ZONE_NORMAL which is also not used much
Nov 30 02:53:56 server01 kernel: [  818.242163] Normal free:6924716kB 
min:12512kB low:15640kB high:18768kB active_anon:1463128kB inactive_anon:2072kB 
active_file:1803964kB inactive_file:1072628kB unevictable:3924kB 
isolated(anon):0kB isolated(file):0kB present:11893760kB mlocked:3924kB 
dirty:1000kB writeback:776kB mapped:35656kB shmem:3828kB 
slab_reclaimable:202560kB slab_unreclaimable:50696kB kernel_stack:2944kB 
pagetables:158616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 
all_unreclaimable? no

You have mentioned that you are comounting with cpuset. If this happens
to be a NUMA machine have you made the access to all nodes available?
Also what does /proc/sys/vm/zone_reclaim_mode says?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-30 Thread azurIt
>Anyway your system is under both global and local memory pressure. You
>didn't see apache going down previously because it was probably the one
>which was stuck and could be killed.
>Anyway you need to setup your system more carefully.


There is, also, an evidence that system has enough of memory! :) Just take 
column 'rss' from process list in OOM message and sum it - you will get 
2489911. It's probably in KB so it's about 2.4 GB. System has 14 GB of RAM so 
this also match data on my graph - 2.4 is about 17% of 14.

azur
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-30 Thread azurIt
>Anyway your system is under both global and local memory pressure. You
>didn't see apache going down previously because it was probably the one
>which was stuck and could be killed.
>Anyway you need to setup your system more carefully.


No, it wasn't, i'm 1000% sure (i was on SSH). Here is the memory usage graph 
from that system on that time:
http://www.watchdog.sk/lkml/memory.png

The blank part is rebooting into new kernel. MySQL server was killed several 
times, then i rebooted into previous kernel and problem was gone (not a single 
MySQL kill). You can see two MySQL kills there on 03:54 and 03:04:30.


>
>> Maybe i should mention that MySQL server has it's own cgroup (called
>> 'mysql') but with no limits to any resources.
>
>Where is that group in the hierarchy?



In root.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-30 Thread Michal Hocko
On Fri 30-11-12 03:29:18, azurIt wrote:
> >Here we go with the patch for 3.2.34. Could you test with this one,
> >please?
> 
> 
> Michal, unfortunately i had to boot to another kernel because the one
> with this patch keeps killing my MySQL server :( it was, probably,
> doing it on OOM in any cgroup - looks like OOM was not choosing
> processes only from cgroup which is out of memory. Here is the log
> from syslog: http://www.watchdog.sk/lkml/oom_mysqld

You are seeing also global OOM:
Nov 30 02:53:56 server01 kernel: [  818.233159] Pid: 9247, comm: apache2 Not 
tainted 3.2.34-grsec #1
Nov 30 02:53:56 server01 kernel: [  818.233289] Call Trace:
Nov 30 02:53:56 server01 kernel: [  818.233470]  [] 
dump_header+0x7e/0x1e0
Nov 30 02:53:56 server01 kernel: [  818.233600]  [] ? 
find_lock_task_mm+0x2f/0x70
Nov 30 02:53:56 server01 kernel: [  818.233721]  [] 
oom_kill_process+0x85/0x2a0
Nov 30 02:53:56 server01 kernel: [  818.233842]  [] 
out_of_memory+0xe5/0x200
Nov 30 02:53:56 server01 kernel: [  818.233963]  [] ? 
pte_alloc_one+0x3f/0x50
Nov 30 02:53:56 server01 kernel: [  818.234082]  [] 
pagefault_out_of_memory+0xbd/0x110
Nov 30 02:53:56 server01 kernel: [  818.234204]  [] 
mm_fault_error+0xb6/0x1a0
Nov 30 02:53:56 server01 kernel: [  818.235886]  [] 
do_page_fault+0x3ee/0x460
Nov 30 02:53:56 server01 kernel: [  818.236006]  [] ? 
vma_merge+0x1f7/0x2c0
Nov 30 02:53:56 server01 kernel: [  818.236124]  [] ? 
do_brk+0x267/0x400
Nov 30 02:53:56 server01 kernel: [  818.236244]  [] ? 
gr_learn_resource+0x42/0x1e0
Nov 30 02:53:56 server01 kernel: [  818.236367]  [] 
page_fault+0x1f/0x30
[...]
Nov 30 02:53:56 server01 kernel: [  818.356297] Out of memory: Kill process 
2188 (mysqld) score 60 or sacrifice child
Nov 30 02:53:56 server01 kernel: [  818.356493] Killed process 2188 (mysqld) 
total-vm:3330016kB, anon-rss:864176kB, file-rss:8072kB

Then you also have memcg oom killer:
Nov 30 02:53:56 server01 kernel: [  818.375717] Task in /1037/uid killed as a 
result of limit of /1037
Nov 30 02:53:56 server01 kernel: [  818.375886] memory: usage 102400kB, limit 
102400kB, failcnt 736
Nov 30 02:53:56 server01 kernel: [  818.376008] memory+swap: usage 102400kB, 
limit 102400kB, failcnt 0

The messages are intermixed and I guess rate limitting jumped in as
well, because I cannot associate all the oom messages to a specific OOM
event.

Anyway your system is under both global and local memory pressure. You
didn't see apache going down previously because it was probably the one
which was stuck and could be killed.
Anyway you need to setup your system more carefully.

> Maybe i should mention that MySQL server has it's own cgroup (called
> 'mysql') but with no limits to any resources.

Where is that group in the hierarchy?
> 
> azurIt
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-29 Thread azurIt
>Here we go with the patch for 3.2.34. Could you test with this one,
>please?


Michal, unfortunately i had to boot to another kernel because the one with this 
patch keeps killing my MySQL server :( it was, probably, doing it on OOM in any 
cgroup - looks like OOM was not choosing processes only from cgroup which is 
out of memory. Here is the log from syslog: 
http://www.watchdog.sk/lkml/oom_mysqld

Maybe i should mention that MySQL server has it's own cgroup (called 'mysql') 
but with no limits to any resources.

azurIt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-29 Thread azurIt
>Here we go with the patch for 3.2.34. Could you test with this one,
>please?


I installed kernel with this patch, will report back if problem occurs again OR 
in few weeks if everything will be ok. Thank you!

azurIt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-26 Thread azurIt
>Here we go with the patch for 3.2.34. Could you test with this one,
>please?

Michal, regarding to your conversation with Johannes Weiner, should i try this 
patch or not?

azur
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked

2012-11-26 Thread Michal Hocko
Here we go with the patch for 3.2.34. Could you test with this one,
please?
---
>From 0d2d915c16f93918051b7ab8039d30b5a922049c Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Mon, 26 Nov 2012 11:47:57 +0100
Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked

memcg oom killer might deadlock if the process which falls down to
mem_cgroup_handle_oom holds a lock which prevents other task to
terminate because it is blocked on the very same lock.
This can happen when a write system call needs to allocate a page but
the allocation hits the memcg hard limit and there is nothing to reclaim
(e.g. there is no swap or swap limit is hit as well and all cache pages
have been reclaimed already) and the process selected by memcg OOM
killer is blocked on i_mutex on the same inode (e.g. truncate it).

Process A
[] do_truncate+0x58/0xa0  # takes i_mutex
[] do_last+0x250/0xa30
[] path_openat+0xd7/0x440
[] do_filp_open+0x49/0xa0
[] do_sys_open+0x106/0x240
[] sys_open+0x20/0x30
[] system_call_fastpath+0x18/0x1d
[] 0x

Process B
[] mem_cgroup_handle_oom+0x241/0x3b0
[] T.1146+0x5ab/0x5c0
[] mem_cgroup_cache_charge+0xbe/0xe0
[] add_to_page_cache_locked+0x4c/0x140
[] add_to_page_cache_lru+0x22/0x50
[] grab_cache_page_write_begin+0x8b/0xe0
[] ext3_write_begin+0x88/0x270
[] generic_file_buffered_write+0x116/0x290
[] __generic_file_aio_write+0x27c/0x480
[] generic_file_aio_write+0x76/0xf0   # takes 
->i_mutex
[] do_sync_write+0xea/0x130
[] vfs_write+0xf3/0x1f0
[] sys_write+0x51/0x90
[] system_call_fastpath+0x18/0x1d
[] 0x

This is not a hard deadlock though because administrator can still
intervene and increase the limit on the group which helps the writer to
finish the allocation and release the lock.

This patch heals the problem by forbidding OOM from page cache charges
(namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper
function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which
then tells mem_cgroup_charge_common that OOM is not allowed for the
charge. No OOM from this path, except for fixing the bug, also make some
sense as we really do not want to cause an OOM because of a page cache
usage.
As a possibly visible result add_to_page_cache_lru might fail more often
with ENOMEM but this is to be expected if the limit is set and it is
preferable than OOM killer IMO.

__GFP_NORETRY is abused for this memcg specific flag because no user
accounted allocation use this flag except for THP which have memcg oom
disabled already.

Reported-by: azurIt 
Signed-off-by: Michal Hocko 
---
 include/linux/gfp.h|3 +++
 include/linux/memcontrol.h |   13 +
 mm/filemap.c   |8 +++-
 mm/memcontrol.c|2 +-
 4 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3a76faf..806fb54 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -146,6 +146,9 @@ struct vm_area_struct;
 /* 4GB DMA on some platforms */
 #define GFP_DMA32  __GFP_DMA32
 
+/* memcg oom killer is not allowed */
+#define GFP_MEMCG_NO_OOM   __GFP_NORETRY
+
 /* Convert GFP flags to their corresponding migrate type */
 static inline int allocflags_to_migratetype(gfp_t gfp_flags)
 {
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 81572af..bf0e575 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -63,6 +63,13 @@ extern void mem_cgroup_cancel_charge_swapin(struct 
mem_cgroup *ptr);
 
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
+
+static inline int mem_cgroup_cache_charge_no_oom(struct page *page,
+   struct mm_struct *mm, gfp_t gfp_mask)
+{
+   return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM);
+}
+
 extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
 extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
 extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
@@ -178,6 +185,12 @@ static inline int mem_cgroup_cache_charge(struct page 
*page,
return 0;
 }
 
+static inline int mem_cgroup_cache_charge_no_oom(struct page *page,
+   struct mm_struct *mm, gfp_t gfp_mask)
+{
+   return 0;
+}
+
 static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr)
 {
diff --git a/mm/filemap.c b/mm/filemap.c
index 556858c..ef182a9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -449,7 +449,13 @@ int add_to_page_cache_locked(struct page *page, struct 
address_space *mapping,
VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(PageSwapBacked(page));
 
-   error = mem_cgroup_cache_charge(page, current->mm,
+   /*
+* Cannot trigger OOM even if gfp_mask would allow that normally
+