Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Fri 08-02-13 17:29:18, Michal Hocko wrote: [...] > OK, I have checked the allocator slow path and you are right even > GFP_KERNEL will not fail. This can lead to similar deadlocks - e.g. > OOM killed task blocked on down_write(mmap_sem) while the page fault > handler holding mmap_sem for reading and allocating a new page without > any progress. And now that I think about it some more it sounds like it shouldn't be possible because allocator would fail because it would see TIF_MEMDIE (OOM killer kills all threads that share the same mm). But maybe there are other locks that are dangerous, but I think that the risk is pretty low. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Thu 07-02-13 20:27:00, Greg Thelen wrote: > On Tue, Feb 05 2013, Michal Hocko wrote: > > > On Tue 05-02-13 10:09:57, Greg Thelen wrote: > >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> > >> > On Tue 05-02-13 08:48:23, Greg Thelen wrote: > >> >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> >> > >> >> > On Tue 05-02-13 15:49:47, azurIt wrote: > >> >> > [...] > >> >> >> Just to be sure - am i supposed to apply this two patches? > >> >> >> http://watchdog.sk/lkml/patches/ > >> >> > > >> >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >> >> > mentioned in a follow up email. Here is the full patch: > >> >> > --- > >> >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > >> >> > From: Michal Hocko > >> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > >> >> > Subject: [PATCH] memcg: do not trigger OOM from > >> >> > add_to_page_cache_locked > >> >> > > >> >> > memcg oom killer might deadlock if the process which falls down to > >> >> > mem_cgroup_handle_oom holds a lock which prevents other task to > >> >> > terminate because it is blocked on the very same lock. > >> >> > This can happen when a write system call needs to allocate a page but > >> >> > the allocation hits the memcg hard limit and there is nothing to > >> >> > reclaim > >> >> > (e.g. there is no swap or swap limit is hit as well and all cache > >> >> > pages > >> >> > have been reclaimed already) and the process selected by memcg OOM > >> >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). > >> >> > > >> >> > Process A > >> >> > [] do_truncate+0x58/0xa0# takes i_mutex > >> >> > [] do_last+0x250/0xa30 > >> >> > [] path_openat+0xd7/0x440 > >> >> > [] do_filp_open+0x49/0xa0 > >> >> > [] do_sys_open+0x106/0x240 > >> >> > [] sys_open+0x20/0x30 > >> >> > [] system_call_fastpath+0x18/0x1d > >> >> > [] 0x > >> >> > > >> >> > Process B > >> >> > [] mem_cgroup_handle_oom+0x241/0x3b0 > >> >> > [] T.1146+0x5ab/0x5c0 > >> >> > [] mem_cgroup_cache_charge+0xbe/0xe0 > >> >> > [] add_to_page_cache_locked+0x4c/0x140 > >> >> > [] add_to_page_cache_lru+0x22/0x50 > >> >> > [] grab_cache_page_write_begin+0x8b/0xe0 > >> >> > [] ext3_write_begin+0x88/0x270 > >> >> > [] generic_file_buffered_write+0x116/0x290 > >> >> > [] __generic_file_aio_write+0x27c/0x480 > >> >> > [] generic_file_aio_write+0x76/0xf0 # > >> >> > takes ->i_mutex > >> >> > [] do_sync_write+0xea/0x130 > >> >> > [] vfs_write+0xf3/0x1f0 > >> >> > [] sys_write+0x51/0x90 > >> >> > [] system_call_fastpath+0x18/0x1d > >> >> > [] 0x > >> >> > >> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into > >> >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me > >> >> think that this deadlock is also possible in the page allocator even > >> >> before getting to add_to_page_cache_lru. no? > >> > > >> > I am not that familiar with VFS but i_mutex is a high level lock AFAIR > >> > and it shouldn't be called from the pageout path so __page_cache_alloc > >> > should be safe. > >> > >> I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. > >> My concern is that __page_cache_alloc() will invoke the oom killer and > >> select a victim which wants i_mutex. This victim will deadlock because > >> the oom killer caller already holds i_mutex. > > > > That would be true for the memcg oom because that one is blocking but > > the global oom just puts the allocator into sleep for a while and then > > the allocator should back off eventually (unless this is NOFAIL > > allocation). I would need to look closer whether this is really the case > > - I haven't seen that allocator code path for a while... > > I think the page allocator can loop forever waiting for an oom victim to > terminate even without NOFAIL. Especially if the oom victim wants a > resource exclusively held by the allocating thread (e.g. i_mutex). It > looks like the same deadlock you describe is also possible (though more > rare) without memcg. OK, I have checked the allocator slow path and you are right even GFP_KERNEL will not fail. This can lead to similar deadlocks - e.g. OOM killed task blocked on down_write(mmap_sem) while the page fault handler holding mmap_sem for reading and allocating a new page without any progress. Luckily there are memory reserves where the allocator fall back eventually so the allocation should be able to get some memory and release the lock. There is still a theoretical chance this would block though. This sounds like a corner case though so I wouldn't care about it very much. > If the looping thread is an eligible oom victim (i.e. not oom disabled, > not an kernel thread, etc) then the page allocator can return NULL in so > long as NOFAIL is not used. So any allocator which is able to call the > oom killer and is not oom disabled (kernel thread, etc) is already > exposed to the possibility of page allocator failure. So if the page > allocator
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Fri 08-02-13 10:40:13, KAMEZAWA Hiroyuki wrote: > (2013/02/07 20:01), Kamezawa Hiroyuki wrote: [...] > >Hmm. do we need to increase the "limit" virtually at memcg oom until > >the oom-killed process dies ? > > Here is my naive idea... and the next step would be http://en.wikipedia.org/wiki/Credit_default_swap :P But seriously now. The idea is not bad at all. This implementation would need some tweaks to work though (e.g. you would need to wake oom sleepers when you get a loan - because those are ones which can block the resource). We should also give the borrowed charges only to those who would oom to prevent from stealing. I think that it should be mem_cgroup_out_of_memory who establishes the loan and it can have a look at how much memory the killed task frees - e.g. some portion of get_mm_rss() or a more precise but much more expensive traversing via private vmas and check whether they charged memory from the target memcg hierarchy (this is a slow path anyway). But who knows maybe a fixed 2MB would work out as well. Thanks! > == > From 1a46318cf89e7df94bd4844f29105b61dacf335b Mon Sep 17 00:00:00 2001 > From: KAMEZAWA Hiroyuki > Date: Fri, 8 Feb 2013 10:43:52 +0900 > Subject: [PATCH] [Don't Apply][PATCH] memcg relax resource at OOM situation. > > When an OOM happens, a task is killed and resources will be freed. > > A problem here is that a task, which is oom-killed, may wait for > some other resource in which memory resource is required. Some thread > waits for free memory may holds some mutex and oom-killed process > wait for the mutex. > > To avoid this, relaxing charged memory by giving virtual resource > can be a help. The system can get back it at uncharge(). > This is a sample native implementation. > > Signed-off-by: KAMEZAWA Hiroyuki > --- > mm/memcontrol.c | 79 > ++- > 1 file changed, 73 insertions(+), 6 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 25ac5f4..4dea49a 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -301,6 +301,9 @@ struct mem_cgroup { > /* set when res.limit == memsw.limit */ > boolmemsw_is_minimum; > + /* extra resource at emergency situation */ > + unsigned long loan; > + spinlock_t loan_lock; > /* protect arrays of thresholds */ > struct mutex thresholds_lock; > @@ -2034,6 +2037,61 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup > *root_memcg, > mem_cgroup_iter_break(root_memcg, victim); > return total; > } > +/* > + * When a memcg is in OOM situation, this lack of resource may cause deadlock > + * because of complicated lock dependency(i_mutex...). To avoid that, we > + * need extra resource or avoid charging. > + * > + * A memcg can request resource in an emergency state. We call it as loan. > + * A memcg will return a loan when it does uncharge resource. We disallow > + * double-loan and moving task to other groups until the loan is fully > + * returned. > + * > + * Note: the problem here is that we cannot know what amount resouce should > + * be necessary to exiting an emergency state. > + */ > +#define LOAN_MAX (2 * 1024 * 1024) > + > +static void mem_cgroup_make_loan(struct mem_cgroup *memcg) > +{ > + u64 usage; > + unsigned long amount; > + > + amount = LOAN_MAX; > + > + usage = res_counter_read_u64(&memcg->res, RES_USAGE); > + if (amount > usage /2 ) > + amount = usage / 2; > + spin_lock(&memcg->loan_lock); > + if (memcg->loan) { > + spin_unlock(&memcg->loan_lock); > + return; > + } > + memcg->loan = amount; > + res_counter_uncharge(&memcg->res, amount); > + if (do_swap_account) > + res_counter_uncharge(&memcg->memsw, amount); > + spin_unlock(&memcg->loan_lock); > +} > + > +/* return amount of free resource which can be uncharged */ > +static unsigned long > +mem_cgroup_may_return_loan(struct mem_cgroup *memcg, unsigned long val) > +{ > + unsigned long tmp; > + /* we don't care small race here */ > + if (unlikely(!memcg->loan)) > + return val; > + spin_lock(&memcg->loan_lock); > + if (memcg->loan) { > + tmp = min(memcg->loan, val); > + memcg->loan -= tmp; > + val -= tmp; > + } > + spin_unlock(&memcg->loan_lock); > + return val; > +} > + > /* > * Check OOM-Killer is already running under our hierarchy. > @@ -2182,6 +2240,7 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup > *memcg, gfp_t mask, > if (need_to_kill) { > finish_wait(&memcg_oom_waitq, &owait.wait); > mem_cgroup_out_of_memory(memcg, mask, order); > + mem_cgroup_make_loan(memcg); > } else { > schedule(); > finish_wait(&memcg_oom_waitq, &owait.wait); > @@ -2748,6 +2807,8 @@ static void __mem_cgroup_cancel_charge(struct > mem_cgroup *memcg, > if
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Tue, Feb 05 2013, Michal Hocko wrote: > On Tue 05-02-13 10:09:57, Greg Thelen wrote: >> On Tue, Feb 05 2013, Michal Hocko wrote: >> >> > On Tue 05-02-13 08:48:23, Greg Thelen wrote: >> >> On Tue, Feb 05 2013, Michal Hocko wrote: >> >> >> >> > On Tue 05-02-13 15:49:47, azurIt wrote: >> >> > [...] >> >> >> Just to be sure - am i supposed to apply this two patches? >> >> >> http://watchdog.sk/lkml/patches/ >> >> > >> >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >> >> > mentioned in a follow up email. Here is the full patch: >> >> > --- >> >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 >> >> > From: Michal Hocko >> >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 >> >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked >> >> > >> >> > memcg oom killer might deadlock if the process which falls down to >> >> > mem_cgroup_handle_oom holds a lock which prevents other task to >> >> > terminate because it is blocked on the very same lock. >> >> > This can happen when a write system call needs to allocate a page but >> >> > the allocation hits the memcg hard limit and there is nothing to reclaim >> >> > (e.g. there is no swap or swap limit is hit as well and all cache pages >> >> > have been reclaimed already) and the process selected by memcg OOM >> >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). >> >> > >> >> > Process A >> >> > [] do_truncate+0x58/0xa0 # takes i_mutex >> >> > [] do_last+0x250/0xa30 >> >> > [] path_openat+0xd7/0x440 >> >> > [] do_filp_open+0x49/0xa0 >> >> > [] do_sys_open+0x106/0x240 >> >> > [] sys_open+0x20/0x30 >> >> > [] system_call_fastpath+0x18/0x1d >> >> > [] 0x >> >> > >> >> > Process B >> >> > [] mem_cgroup_handle_oom+0x241/0x3b0 >> >> > [] T.1146+0x5ab/0x5c0 >> >> > [] mem_cgroup_cache_charge+0xbe/0xe0 >> >> > [] add_to_page_cache_locked+0x4c/0x140 >> >> > [] add_to_page_cache_lru+0x22/0x50 >> >> > [] grab_cache_page_write_begin+0x8b/0xe0 >> >> > [] ext3_write_begin+0x88/0x270 >> >> > [] generic_file_buffered_write+0x116/0x290 >> >> > [] __generic_file_aio_write+0x27c/0x480 >> >> > [] generic_file_aio_write+0x76/0xf0 # takes >> >> > ->i_mutex >> >> > [] do_sync_write+0xea/0x130 >> >> > [] vfs_write+0xf3/0x1f0 >> >> > [] sys_write+0x51/0x90 >> >> > [] system_call_fastpath+0x18/0x1d >> >> > [] 0x >> >> >> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into >> >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me >> >> think that this deadlock is also possible in the page allocator even >> >> before getting to add_to_page_cache_lru. no? >> > >> > I am not that familiar with VFS but i_mutex is a high level lock AFAIR >> > and it shouldn't be called from the pageout path so __page_cache_alloc >> > should be safe. >> >> I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. >> My concern is that __page_cache_alloc() will invoke the oom killer and >> select a victim which wants i_mutex. This victim will deadlock because >> the oom killer caller already holds i_mutex. > > That would be true for the memcg oom because that one is blocking but > the global oom just puts the allocator into sleep for a while and then > the allocator should back off eventually (unless this is NOFAIL > allocation). I would need to look closer whether this is really the case > - I haven't seen that allocator code path for a while... I think the page allocator can loop forever waiting for an oom victim to terminate even without NOFAIL. Especially if the oom victim wants a resource exclusively held by the allocating thread (e.g. i_mutex). It looks like the same deadlock you describe is also possible (though more rare) without memcg. If the looping thread is an eligible oom victim (i.e. not oom disabled, not an kernel thread, etc) then the page allocator can return NULL in so long as NOFAIL is not used. So any allocator which is able to call the oom killer and is not oom disabled (kernel thread, etc) is already exposed to the possibility of page allocator failure. So if the page allocator could detect the deadlock, then it could safely return NULL. Maybe after looping N times without forward progress the page allocator should consider failing unless NOFAIL is given. Switching back to the memcg oom situation, can we similarly return NULL if memcg oom kill has been tried a reasonable number of times. Simply failing the memcg charge with ENOMEM seems easier to support than exceeding limit (Kame's loan patch). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
(2013/02/07 21:31), Michal Hocko wrote: On Thu 07-02-13 20:01:45, KAMEZAWA Hiroyuki wrote: (2013/02/06 23:01), Michal Hocko wrote: On Wed 06-02-13 02:17:21, azurIt wrote: 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I mentioned in a follow up email. Here is the full patch: Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: http://www.watchdog.sk/lkml/oom_mysqld6 [...] WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() Hardware name: S5000VSA gfp_mask:4304 nr_pages:1 oom:0 ret:2 Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1 Call Trace: [] warn_slowpath_common+0x7a/0xb0 [] warn_slowpath_fmt+0x46/0x50 [] ? mem_cgroup_margin+0x73/0xa0 [] T.1149+0x2d9/0x610 [] ? blk_finish_plug+0x18/0x50 [] mem_cgroup_cache_charge+0xc4/0xf0 [] add_to_page_cache_locked+0x4f/0x140 [] add_to_page_cache_lru+0x22/0x50 [] filemap_fault+0x252/0x4f0 [] __do_fault+0x78/0x5a0 [] handle_pte_fault+0x84/0x940 [] ? vma_prio_tree_insert+0x30/0x50 [] ? vma_link+0x88/0xe0 [] handle_mm_fault+0x138/0x260 [] do_page_fault+0x13d/0x460 [] ? do_mmap_pgoff+0x3dc/0x430 [] page_fault+0x1f/0x30 ---[ end trace 8817670349022007 ]--- apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 apache2 cpuset=uid mems_allowed=0 Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1 Call Trace: [] dump_header+0x7e/0x1e0 [] ? find_lock_task_mm+0x2f/0x70 [] oom_kill_process+0x85/0x2a0 [] out_of_memory+0xe5/0x200 [] pagefault_out_of_memory+0xbd/0x110 [] mm_fault_error+0xb6/0x1a0 [] do_page_fault+0x3ee/0x460 [] ? do_mmap_pgoff+0x3dc/0x430 [] page_fault+0x1f/0x30 The first trace comes from the debugging WARN and it clearly points to a file fault path. __do_fault pre-charges a page in case we need to do CoW (copy-on-write) for the returned page. This one falls back to memcg OOM and never returns ENOMEM as I have mentioned earlier. However, the fs fault handler (filemap_fault here) can fallback to page_cache_read if the readahead (do_sync_mmap_readahead) fails to get page to the page cache. And we can see this happening in the first trace. page_cache_read then calls add_to_page_cache_lru and eventually gets to add_to_page_cache_locked which calls mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should happen. This ENOMEM gets to the fault handler and kaboom. Hmm. do we need to increase the "limit" virtually at memcg oom until the oom-killed process dies ? It may be doable by increasing stock->cache of each cpuI think kernel can offer extra virtual charge up to oom-killed process's memory usage. If we can guarantee that the overflow charges do not exceed the memory usage of the killed process then this would work. The question is, how do we find out how much we can overflow. immigrate_on_move will play some role as well as the amount of the shared memory. I am afraid this would get too complex. Nevertheless the idea is nice. Yes, that's the problem. If we don't do in correct way, resouce usage undeflow can happen. I guess we can count it per task_struct at charging page-faulted anon pages. _Or_ in other consideration, for example, we do charge 1MB per thread regardless of its memory usage. And use it as a security at OOM-killing. Implemtation will be easy but explanation may be difficult.. Thanks, -Kame Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
(2013/02/07 20:01), Kamezawa Hiroyuki wrote: (2013/02/06 23:01), Michal Hocko wrote: On Wed 06-02-13 02:17:21, azurIt wrote: 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I mentioned in a follow up email. Here is the full patch: Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: http://www.watchdog.sk/lkml/oom_mysqld6 [...] WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() Hardware name: S5000VSA gfp_mask:4304 nr_pages:1 oom:0 ret:2 Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1 Call Trace: [] warn_slowpath_common+0x7a/0xb0 [] warn_slowpath_fmt+0x46/0x50 [] ? mem_cgroup_margin+0x73/0xa0 [] T.1149+0x2d9/0x610 [] ? blk_finish_plug+0x18/0x50 [] mem_cgroup_cache_charge+0xc4/0xf0 [] add_to_page_cache_locked+0x4f/0x140 [] add_to_page_cache_lru+0x22/0x50 [] filemap_fault+0x252/0x4f0 [] __do_fault+0x78/0x5a0 [] handle_pte_fault+0x84/0x940 [] ? vma_prio_tree_insert+0x30/0x50 [] ? vma_link+0x88/0xe0 [] handle_mm_fault+0x138/0x260 [] do_page_fault+0x13d/0x460 [] ? do_mmap_pgoff+0x3dc/0x430 [] page_fault+0x1f/0x30 ---[ end trace 8817670349022007 ]--- apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 apache2 cpuset=uid mems_allowed=0 Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1 Call Trace: [] dump_header+0x7e/0x1e0 [] ? find_lock_task_mm+0x2f/0x70 [] oom_kill_process+0x85/0x2a0 [] out_of_memory+0xe5/0x200 [] pagefault_out_of_memory+0xbd/0x110 [] mm_fault_error+0xb6/0x1a0 [] do_page_fault+0x3ee/0x460 [] ? do_mmap_pgoff+0x3dc/0x430 [] page_fault+0x1f/0x30 The first trace comes from the debugging WARN and it clearly points to a file fault path. __do_fault pre-charges a page in case we need to do CoW (copy-on-write) for the returned page. This one falls back to memcg OOM and never returns ENOMEM as I have mentioned earlier. However, the fs fault handler (filemap_fault here) can fallback to page_cache_read if the readahead (do_sync_mmap_readahead) fails to get page to the page cache. And we can see this happening in the first trace. page_cache_read then calls add_to_page_cache_lru and eventually gets to add_to_page_cache_locked which calls mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should happen. This ENOMEM gets to the fault handler and kaboom. Hmm. do we need to increase the "limit" virtually at memcg oom until the oom-killed process dies ? Here is my naive idea... == From 1a46318cf89e7df94bd4844f29105b61dacf335b Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Fri, 8 Feb 2013 10:43:52 +0900 Subject: [PATCH] [Don't Apply][PATCH] memcg relax resource at OOM situation. When an OOM happens, a task is killed and resources will be freed. A problem here is that a task, which is oom-killed, may wait for some other resource in which memory resource is required. Some thread waits for free memory may holds some mutex and oom-killed process wait for the mutex. To avoid this, relaxing charged memory by giving virtual resource can be a help. The system can get back it at uncharge(). This is a sample native implementation. Signed-off-by: KAMEZAWA Hiroyuki --- mm/memcontrol.c | 79 ++- 1 file changed, 73 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 25ac5f4..4dea49a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -301,6 +301,9 @@ struct mem_cgroup { /* set when res.limit == memsw.limit */ boolmemsw_is_minimum; + /* extra resource at emergency situation */ + unsigned long loan; + spinlock_t loan_lock; /* protect arrays of thresholds */ struct mutex thresholds_lock; @@ -2034,6 +2037,61 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, mem_cgroup_iter_break(root_memcg, victim); return total; } +/* + * When a memcg is in OOM situation, this lack of resource may cause deadlock + * because of complicated lock dependency(i_mutex...). To avoid that, we + * need extra resource or avoid charging. + * + * A memcg can request resource in an emergency state. We call it as loan. + * A memcg will return a loan when it does uncharge resource. We disallow + * double-loan and moving task to other groups until the loan is fully + * returned. + * + * Note: the problem here is that we cannot know what amount resouce should + * be necessary to exiting an emergency state. + */ +#define LOAN_MAX (2 * 1024 * 1024) + +static void mem_cgroup_make_loan(struct mem_cgroup *memcg) +{ + u64 usage; + unsigned long amount; + + amount = LOAN_MAX; + + usage = res_counter_read_u64(&memcg->res, RES_USAGE); + if (amount > usage /2 ) + amount = usage / 2; + spin_lock(&memcg->loan_lock); + if (memcg->loan) { + spin_unlock(&memcg->loan_lock); + return; + } +
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Thu 07-02-13 20:01:45, KAMEZAWA Hiroyuki wrote: > (2013/02/06 23:01), Michal Hocko wrote: > >On Wed 06-02-13 02:17:21, azurIt wrote: > >>>5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >>>mentioned in a follow up email. Here is the full patch: > >> > >> > >>Here is the log where OOM, again, killed MySQL server [search for > >>"(mysqld)"]: > >>http://www.watchdog.sk/lkml/oom_mysqld6 > > > >[...] > >WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > >Hardware name: S5000VSA > >gfp_mask:4304 nr_pages:1 oom:0 ret:2 > >Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1 > >Call Trace: > > [] warn_slowpath_common+0x7a/0xb0 > > [] warn_slowpath_fmt+0x46/0x50 > > [] ? mem_cgroup_margin+0x73/0xa0 > > [] T.1149+0x2d9/0x610 > > [] ? blk_finish_plug+0x18/0x50 > > [] mem_cgroup_cache_charge+0xc4/0xf0 > > [] add_to_page_cache_locked+0x4f/0x140 > > [] add_to_page_cache_lru+0x22/0x50 > > [] filemap_fault+0x252/0x4f0 > > [] __do_fault+0x78/0x5a0 > > [] handle_pte_fault+0x84/0x940 > > [] ? vma_prio_tree_insert+0x30/0x50 > > [] ? vma_link+0x88/0xe0 > > [] handle_mm_fault+0x138/0x260 > > [] do_page_fault+0x13d/0x460 > > [] ? do_mmap_pgoff+0x3dc/0x430 > > [] page_fault+0x1f/0x30 > >---[ end trace 8817670349022007 ]--- > >apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > >apache2 cpuset=uid mems_allowed=0 > >Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1 > >Call Trace: > > [] dump_header+0x7e/0x1e0 > > [] ? find_lock_task_mm+0x2f/0x70 > > [] oom_kill_process+0x85/0x2a0 > > [] out_of_memory+0xe5/0x200 > > [] pagefault_out_of_memory+0xbd/0x110 > > [] mm_fault_error+0xb6/0x1a0 > > [] do_page_fault+0x3ee/0x460 > > [] ? do_mmap_pgoff+0x3dc/0x430 > > [] page_fault+0x1f/0x30 > > > >The first trace comes from the debugging WARN and it clearly points to > >a file fault path. __do_fault pre-charges a page in case we need to > >do CoW (copy-on-write) for the returned page. This one falls back to > >memcg OOM and never returns ENOMEM as I have mentioned earlier. > >However, the fs fault handler (filemap_fault here) can fallback to > >page_cache_read if the readahead (do_sync_mmap_readahead) fails > >to get page to the page cache. And we can see this happening in > >the first trace. page_cache_read then calls add_to_page_cache_lru > >and eventually gets to add_to_page_cache_locked which calls > >mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > >happen. This ENOMEM gets to the fault handler and kaboom. > > > > Hmm. do we need to increase the "limit" virtually at memcg oom until > the oom-killed process dies ? It may be doable by increasing stock->cache > of each cpuI think kernel can offer extra virtual charge up to > oom-killed process's memory usage. If we can guarantee that the overflow charges do not exceed the memory usage of the killed process then this would work. The question is, how do we find out how much we can overflow. immigrate_on_move will play some role as well as the amount of the shared memory. I am afraid this would get too complex. Nevertheless the idea is nice. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
(2013/02/06 23:01), Michal Hocko wrote: On Wed 06-02-13 02:17:21, azurIt wrote: 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I mentioned in a follow up email. Here is the full patch: Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: http://www.watchdog.sk/lkml/oom_mysqld6 [...] WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() Hardware name: S5000VSA gfp_mask:4304 nr_pages:1 oom:0 ret:2 Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1 Call Trace: [] warn_slowpath_common+0x7a/0xb0 [] warn_slowpath_fmt+0x46/0x50 [] ? mem_cgroup_margin+0x73/0xa0 [] T.1149+0x2d9/0x610 [] ? blk_finish_plug+0x18/0x50 [] mem_cgroup_cache_charge+0xc4/0xf0 [] add_to_page_cache_locked+0x4f/0x140 [] add_to_page_cache_lru+0x22/0x50 [] filemap_fault+0x252/0x4f0 [] __do_fault+0x78/0x5a0 [] handle_pte_fault+0x84/0x940 [] ? vma_prio_tree_insert+0x30/0x50 [] ? vma_link+0x88/0xe0 [] handle_mm_fault+0x138/0x260 [] do_page_fault+0x13d/0x460 [] ? do_mmap_pgoff+0x3dc/0x430 [] page_fault+0x1f/0x30 ---[ end trace 8817670349022007 ]--- apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 apache2 cpuset=uid mems_allowed=0 Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1 Call Trace: [] dump_header+0x7e/0x1e0 [] ? find_lock_task_mm+0x2f/0x70 [] oom_kill_process+0x85/0x2a0 [] out_of_memory+0xe5/0x200 [] pagefault_out_of_memory+0xbd/0x110 [] mm_fault_error+0xb6/0x1a0 [] do_page_fault+0x3ee/0x460 [] ? do_mmap_pgoff+0x3dc/0x430 [] page_fault+0x1f/0x30 The first trace comes from the debugging WARN and it clearly points to a file fault path. __do_fault pre-charges a page in case we need to do CoW (copy-on-write) for the returned page. This one falls back to memcg OOM and never returns ENOMEM as I have mentioned earlier. However, the fs fault handler (filemap_fault here) can fallback to page_cache_read if the readahead (do_sync_mmap_readahead) fails to get page to the page cache. And we can see this happening in the first trace. page_cache_read then calls add_to_page_cache_lru and eventually gets to add_to_page_cache_locked which calls mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should happen. This ENOMEM gets to the fault handler and kaboom. Hmm. do we need to increase the "limit" virtually at memcg oom until the oom-killed process dies ? It may be doable by increasing stock->cache of each cpuI think kernel can offer extra virtual charge up to oom-killed process's memory usage. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Wed 06-02-13 15:01:19, Michal Hocko wrote: > On Wed 06-02-13 02:17:21, azurIt wrote: > > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > > >mentioned in a follow up email. Here is the full patch: > > > > > > Here is the log where OOM, again, killed MySQL server [search for > > "(mysqld)"]: > > http://www.watchdog.sk/lkml/oom_mysqld6 > > [...] > WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() > Hardware name: S5000VSA > gfp_mask:4304 nr_pages:1 oom:0 ret:2 > Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1 > Call Trace: > [] warn_slowpath_common+0x7a/0xb0 > [] warn_slowpath_fmt+0x46/0x50 > [] ? mem_cgroup_margin+0x73/0xa0 > [] T.1149+0x2d9/0x610 > [] ? blk_finish_plug+0x18/0x50 > [] mem_cgroup_cache_charge+0xc4/0xf0 > [] add_to_page_cache_locked+0x4f/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] filemap_fault+0x252/0x4f0 > [] __do_fault+0x78/0x5a0 > [] handle_pte_fault+0x84/0x940 > [] ? vma_prio_tree_insert+0x30/0x50 > [] ? vma_link+0x88/0xe0 > [] handle_mm_fault+0x138/0x260 > [] do_page_fault+0x13d/0x460 > [] ? do_mmap_pgoff+0x3dc/0x430 > [] page_fault+0x1f/0x30 > ---[ end trace 8817670349022007 ]--- > apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > apache2 cpuset=uid mems_allowed=0 > Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1 > Call Trace: > [] dump_header+0x7e/0x1e0 > [] ? find_lock_task_mm+0x2f/0x70 > [] oom_kill_process+0x85/0x2a0 > [] out_of_memory+0xe5/0x200 > [] pagefault_out_of_memory+0xbd/0x110 > [] mm_fault_error+0xb6/0x1a0 > [] do_page_fault+0x3ee/0x460 > [] ? do_mmap_pgoff+0x3dc/0x430 > [] page_fault+0x1f/0x30 > > The first trace comes from the debugging WARN and it clearly points to > a file fault path. __do_fault pre-charges a page in case we need to > do CoW (copy-on-write) for the returned page. This one falls back to > memcg OOM and never returns ENOMEM as I have mentioned earlier. > However, the fs fault handler (filemap_fault here) can fallback to > page_cache_read if the readahead (do_sync_mmap_readahead) fails > to get page to the page cache. And we can see this happening in > the first trace. page_cache_read then calls add_to_page_cache_lru > and eventually gets to add_to_page_cache_locked which calls > mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should > happen. This ENOMEM gets to the fault handler and kaboom. > > So the fix is really much more complex than I thought. Although > add_to_page_cache_locked sounded like a good place it turned out to be > not in fact. > > We need something more clever appaerently. One way would be not misusing > __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 > bits for those flags in gfp_t so there should be some room there. > Or we could do this per task flag, same we do for NO_IO in the current > -mm tree. > The later one seems easier wrt. gfp_mask passing horror - e.g. > __generic_file_aio_write doesn't pass flags and it can be called from > unlocked contexts as well. Ouch, PF_ flags space seem to be drained already because task_struct::flags is just unsigned int so there is just one bit left. I am not sure this is the best use for it. This will be a real pain! -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Wed 06-02-13 02:17:21, azurIt wrote: > >5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >mentioned in a follow up email. Here is the full patch: > > > Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: > http://www.watchdog.sk/lkml/oom_mysqld6 [...] WARNING: at mm/memcontrol.c:2409 T.1149+0x2d9/0x610() Hardware name: S5000VSA gfp_mask:4304 nr_pages:1 oom:0 ret:2 Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1 Call Trace: [] warn_slowpath_common+0x7a/0xb0 [] warn_slowpath_fmt+0x46/0x50 [] ? mem_cgroup_margin+0x73/0xa0 [] T.1149+0x2d9/0x610 [] ? blk_finish_plug+0x18/0x50 [] mem_cgroup_cache_charge+0xc4/0xf0 [] add_to_page_cache_locked+0x4f/0x140 [] add_to_page_cache_lru+0x22/0x50 [] filemap_fault+0x252/0x4f0 [] __do_fault+0x78/0x5a0 [] handle_pte_fault+0x84/0x940 [] ? vma_prio_tree_insert+0x30/0x50 [] ? vma_link+0x88/0xe0 [] handle_mm_fault+0x138/0x260 [] do_page_fault+0x13d/0x460 [] ? do_mmap_pgoff+0x3dc/0x430 [] page_fault+0x1f/0x30 ---[ end trace 8817670349022007 ]--- apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 apache2 cpuset=uid mems_allowed=0 Pid: 3545, comm: apache2 Tainted: GW3.2.37-grsec #1 Call Trace: [] dump_header+0x7e/0x1e0 [] ? find_lock_task_mm+0x2f/0x70 [] oom_kill_process+0x85/0x2a0 [] out_of_memory+0xe5/0x200 [] pagefault_out_of_memory+0xbd/0x110 [] mm_fault_error+0xb6/0x1a0 [] do_page_fault+0x3ee/0x460 [] ? do_mmap_pgoff+0x3dc/0x430 [] page_fault+0x1f/0x30 The first trace comes from the debugging WARN and it clearly points to a file fault path. __do_fault pre-charges a page in case we need to do CoW (copy-on-write) for the returned page. This one falls back to memcg OOM and never returns ENOMEM as I have mentioned earlier. However, the fs fault handler (filemap_fault here) can fallback to page_cache_read if the readahead (do_sync_mmap_readahead) fails to get page to the page cache. And we can see this happening in the first trace. page_cache_read then calls add_to_page_cache_lru and eventually gets to add_to_page_cache_locked which calls mem_cgroup_cache_charge_no_oom so we will get ENOMEM if oom should happen. This ENOMEM gets to the fault handler and kaboom. So the fix is really much more complex than I thought. Although add_to_page_cache_locked sounded like a good place it turned out to be not in fact. We need something more clever appaerently. One way would be not misusing __GFP_NORETRY for GFP_MEMCG_NO_OOM and give it a real flag. We have 32 bits for those flags in gfp_t so there should be some room there. Or we could do this per task flag, same we do for NO_IO in the current -mm tree. The later one seems easier wrt. gfp_mask passing horror - e.g. __generic_file_aio_write doesn't pass flags and it can be called from unlocked contexts as well. I have to think about it some more. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >mentioned in a follow up email. Here is the full patch: Here is the log where OOM, again, killed MySQL server [search for "(mysqld)"]: http://www.watchdog.sk/lkml/oom_mysqld6 azur -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Tue 05-02-13 10:09:57, Greg Thelen wrote: > On Tue, Feb 05 2013, Michal Hocko wrote: > > > On Tue 05-02-13 08:48:23, Greg Thelen wrote: > >> On Tue, Feb 05 2013, Michal Hocko wrote: > >> > >> > On Tue 05-02-13 15:49:47, azurIt wrote: > >> > [...] > >> >> Just to be sure - am i supposed to apply this two patches? > >> >> http://watchdog.sk/lkml/patches/ > >> > > >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > >> > mentioned in a follow up email. Here is the full patch: > >> > --- > >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > >> > From: Michal Hocko > >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 > >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > >> > > >> > memcg oom killer might deadlock if the process which falls down to > >> > mem_cgroup_handle_oom holds a lock which prevents other task to > >> > terminate because it is blocked on the very same lock. > >> > This can happen when a write system call needs to allocate a page but > >> > the allocation hits the memcg hard limit and there is nothing to reclaim > >> > (e.g. there is no swap or swap limit is hit as well and all cache pages > >> > have been reclaimed already) and the process selected by memcg OOM > >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). > >> > > >> > Process A > >> > [] do_truncate+0x58/0xa0 # takes i_mutex > >> > [] do_last+0x250/0xa30 > >> > [] path_openat+0xd7/0x440 > >> > [] do_filp_open+0x49/0xa0 > >> > [] do_sys_open+0x106/0x240 > >> > [] sys_open+0x20/0x30 > >> > [] system_call_fastpath+0x18/0x1d > >> > [] 0x > >> > > >> > Process B > >> > [] mem_cgroup_handle_oom+0x241/0x3b0 > >> > [] T.1146+0x5ab/0x5c0 > >> > [] mem_cgroup_cache_charge+0xbe/0xe0 > >> > [] add_to_page_cache_locked+0x4c/0x140 > >> > [] add_to_page_cache_lru+0x22/0x50 > >> > [] grab_cache_page_write_begin+0x8b/0xe0 > >> > [] ext3_write_begin+0x88/0x270 > >> > [] generic_file_buffered_write+0x116/0x290 > >> > [] __generic_file_aio_write+0x27c/0x480 > >> > [] generic_file_aio_write+0x76/0xf0 # takes > >> > ->i_mutex > >> > [] do_sync_write+0xea/0x130 > >> > [] vfs_write+0xf3/0x1f0 > >> > [] sys_write+0x51/0x90 > >> > [] system_call_fastpath+0x18/0x1d > >> > [] 0x > >> > >> It looks like grab_cache_page_write_begin() passes __GFP_FS into > >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me > >> think that this deadlock is also possible in the page allocator even > >> before getting to add_to_page_cache_lru. no? > > > > I am not that familiar with VFS but i_mutex is a high level lock AFAIR > > and it shouldn't be called from the pageout path so __page_cache_alloc > > should be safe. > > I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. > My concern is that __page_cache_alloc() will invoke the oom killer and > select a victim which wants i_mutex. This victim will deadlock because > the oom killer caller already holds i_mutex. That would be true for the memcg oom because that one is blocking but the global oom just puts the allocator into sleep for a while and then the allocator should back off eventually (unless this is NOFAIL allocation). I would need to look closer whether this is really the case - I haven't seen that allocator code path for a while... > The wild accusation I am making is that anyone who invokes the oom > killer and waits on the victim to die is essentially grabbing all of > the locks that any of the oom killer victims may grab (e.g. i_mutex). True. > To avoid deadlock the oom killer can only be called is while holding > no locks that the oom victim demands. I think some locks are grabbed > in a way that allows the lock request to fail if the task has a fatal > signal pending, so they are safe. But any locks acquisitions that > cannot fail (e.g. mutex_lock) will deadlock with the oom killing > process. So the oom killing process cannot hold any such locks which > the victim will attempt to grab. Hopefully I'm missing something. Agreed. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Tue, Feb 05 2013, Michal Hocko wrote: > On Tue 05-02-13 08:48:23, Greg Thelen wrote: >> On Tue, Feb 05 2013, Michal Hocko wrote: >> >> > On Tue 05-02-13 15:49:47, azurIt wrote: >> > [...] >> >> Just to be sure - am i supposed to apply this two patches? >> >> http://watchdog.sk/lkml/patches/ >> > >> > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >> > mentioned in a follow up email. Here is the full patch: >> > --- >> > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 >> > From: Michal Hocko >> > Date: Mon, 26 Nov 2012 11:47:57 +0100 >> > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked >> > >> > memcg oom killer might deadlock if the process which falls down to >> > mem_cgroup_handle_oom holds a lock which prevents other task to >> > terminate because it is blocked on the very same lock. >> > This can happen when a write system call needs to allocate a page but >> > the allocation hits the memcg hard limit and there is nothing to reclaim >> > (e.g. there is no swap or swap limit is hit as well and all cache pages >> > have been reclaimed already) and the process selected by memcg OOM >> > killer is blocked on i_mutex on the same inode (e.g. truncate it). >> > >> > Process A >> > [] do_truncate+0x58/0xa0 # takes i_mutex >> > [] do_last+0x250/0xa30 >> > [] path_openat+0xd7/0x440 >> > [] do_filp_open+0x49/0xa0 >> > [] do_sys_open+0x106/0x240 >> > [] sys_open+0x20/0x30 >> > [] system_call_fastpath+0x18/0x1d >> > [] 0x >> > >> > Process B >> > [] mem_cgroup_handle_oom+0x241/0x3b0 >> > [] T.1146+0x5ab/0x5c0 >> > [] mem_cgroup_cache_charge+0xbe/0xe0 >> > [] add_to_page_cache_locked+0x4c/0x140 >> > [] add_to_page_cache_lru+0x22/0x50 >> > [] grab_cache_page_write_begin+0x8b/0xe0 >> > [] ext3_write_begin+0x88/0x270 >> > [] generic_file_buffered_write+0x116/0x290 >> > [] __generic_file_aio_write+0x27c/0x480 >> > [] generic_file_aio_write+0x76/0xf0 # takes >> > ->i_mutex >> > [] do_sync_write+0xea/0x130 >> > [] vfs_write+0xf3/0x1f0 >> > [] sys_write+0x51/0x90 >> > [] system_call_fastpath+0x18/0x1d >> > [] 0x >> >> It looks like grab_cache_page_write_begin() passes __GFP_FS into >> __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me >> think that this deadlock is also possible in the page allocator even >> before getting to add_to_page_cache_lru. no? > > I am not that familiar with VFS but i_mutex is a high level lock AFAIR > and it shouldn't be called from the pageout path so __page_cache_alloc > should be safe. I wasn't clear, sorry. My concern is not that pageout() grabs i_mutex. My concern is that __page_cache_alloc() will invoke the oom killer and select a victim which wants i_mutex. This victim will deadlock because the oom killer caller already holds i_mutex. The wild accusation I am making is that anyone who invokes the oom killer and waits on the victim to die is essentially grabbing all of the locks that any of the oom killer victims may grab (e.g. i_mutex). To avoid deadlock the oom killer can only be called is while holding no locks that the oom victim demands. I think some locks are grabbed in a way that allows the lock request to fail if the task has a fatal signal pending, so they are safe. But any locks acquisitions that cannot fail (e.g. mutex_lock) will deadlock with the oom killing process. So the oom killing process cannot hold any such locks which the victim will attempt to grab. Hopefully I'm missing something. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Tue 05-02-13 08:48:23, Greg Thelen wrote: > On Tue, Feb 05 2013, Michal Hocko wrote: > > > On Tue 05-02-13 15:49:47, azurIt wrote: > > [...] > >> Just to be sure - am i supposed to apply this two patches? > >> http://watchdog.sk/lkml/patches/ > > > > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > > mentioned in a follow up email. Here is the full patch: > > --- > > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > > From: Michal Hocko > > Date: Mon, 26 Nov 2012 11:47:57 +0100 > > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > > > memcg oom killer might deadlock if the process which falls down to > > mem_cgroup_handle_oom holds a lock which prevents other task to > > terminate because it is blocked on the very same lock. > > This can happen when a write system call needs to allocate a page but > > the allocation hits the memcg hard limit and there is nothing to reclaim > > (e.g. there is no swap or swap limit is hit as well and all cache pages > > have been reclaimed already) and the process selected by memcg OOM > > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > > > Process A > > [] do_truncate+0x58/0xa0 # takes i_mutex > > [] do_last+0x250/0xa30 > > [] path_openat+0xd7/0x440 > > [] do_filp_open+0x49/0xa0 > > [] do_sys_open+0x106/0x240 > > [] sys_open+0x20/0x30 > > [] system_call_fastpath+0x18/0x1d > > [] 0x > > > > Process B > > [] mem_cgroup_handle_oom+0x241/0x3b0 > > [] T.1146+0x5ab/0x5c0 > > [] mem_cgroup_cache_charge+0xbe/0xe0 > > [] add_to_page_cache_locked+0x4c/0x140 > > [] add_to_page_cache_lru+0x22/0x50 > > [] grab_cache_page_write_begin+0x8b/0xe0 > > [] ext3_write_begin+0x88/0x270 > > [] generic_file_buffered_write+0x116/0x290 > > [] __generic_file_aio_write+0x27c/0x480 > > [] generic_file_aio_write+0x76/0xf0 # takes > > ->i_mutex > > [] do_sync_write+0xea/0x130 > > [] vfs_write+0xf3/0x1f0 > > [] sys_write+0x51/0x90 > > [] system_call_fastpath+0x18/0x1d > > [] 0x > > It looks like grab_cache_page_write_begin() passes __GFP_FS into > __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me > think that this deadlock is also possible in the page allocator even > before getting to add_to_page_cache_lru. no? I am not that familiar with VFS but i_mutex is a high level lock AFAIR and it shouldn't be called from the pageout path so __page_cache_alloc should be safe. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Tue, Feb 05 2013, Michal Hocko wrote: > On Tue 05-02-13 15:49:47, azurIt wrote: > [...] >> Just to be sure - am i supposed to apply this two patches? >> http://watchdog.sk/lkml/patches/ > > 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I > mentioned in a follow up email. Here is the full patch: > --- > From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Mon, 26 Nov 2012 11:47:57 +0100 > Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked > > memcg oom killer might deadlock if the process which falls down to > mem_cgroup_handle_oom holds a lock which prevents other task to > terminate because it is blocked on the very same lock. > This can happen when a write system call needs to allocate a page but > the allocation hits the memcg hard limit and there is nothing to reclaim > (e.g. there is no swap or swap limit is hit as well and all cache pages > have been reclaimed already) and the process selected by memcg OOM > killer is blocked on i_mutex on the same inode (e.g. truncate it). > > Process A > [] do_truncate+0x58/0xa0# takes i_mutex > [] do_last+0x250/0xa30 > [] path_openat+0xd7/0x440 > [] do_filp_open+0x49/0xa0 > [] do_sys_open+0x106/0x240 > [] sys_open+0x20/0x30 > [] system_call_fastpath+0x18/0x1d > [] 0x > > Process B > [] mem_cgroup_handle_oom+0x241/0x3b0 > [] T.1146+0x5ab/0x5c0 > [] mem_cgroup_cache_charge+0xbe/0xe0 > [] add_to_page_cache_locked+0x4c/0x140 > [] add_to_page_cache_lru+0x22/0x50 > [] grab_cache_page_write_begin+0x8b/0xe0 > [] ext3_write_begin+0x88/0x270 > [] generic_file_buffered_write+0x116/0x290 > [] __generic_file_aio_write+0x27c/0x480 > [] generic_file_aio_write+0x76/0xf0 # takes > ->i_mutex > [] do_sync_write+0xea/0x130 > [] vfs_write+0xf3/0x1f0 > [] sys_write+0x51/0x90 > [] system_call_fastpath+0x18/0x1d > [] 0x It looks like grab_cache_page_write_begin() passes __GFP_FS into __page_cache_alloc() and mem_cgroup_cache_charge(). Which makes me think that this deadlock is also possible in the page allocator even before getting to add_to_page_cache_lru. no? Can callers holding fs resources (e.g. i_mutex) pass __GFP_FS into the page allocator? If __GFP_FS was avoided, then I think memcg user page charging would need a !__GFP_FS check to avoid invoking oom killer, but at least then we'd avoid both deadlocks and cover both page allocation and memcg page charging in similar fashion. Example from memcg_charge_kmem: may_oom = (gfp & __GFP_FS) && !(gfp & __GFP_NORETRY); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I >mentioned in a follow up email. ou, it wasn't complete? i used it in my last test.. sorry, i'm litte confused by all those patches. will try it this night and report back. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Tue 05-02-13 15:49:47, azurIt wrote: [...] > I have another old problem which is maybe also related to this. I > wasn't connecting it with this before but now i'm not sure. Two of our > servers, which are affected by this cgroup problem, are also randomly > freezing completely (few times per month). These are the symptoms: > - servers are answering to ping > - it is possible to connect via SSH but connection is freezed after > sending the password > - it is possible to login via console but it is freezed after typeing > the login > These symptoms are very similar to HDD problems or HDD overload (but > there is no overload for sure). The only way to fix it is, probably, > hard rebooting the server (didn't find any other way). What do you > think? Can this be related? This is hard to tell without further information. > Maybe HDDs are locked in the similar way the cgroups are - we already > found out that cgroup freezeing is related also to HDD activity. Maybe > there is a little chance that the whole HDD subsystem ends in > deadlock? "HDD subsystem" whatever that means cannot be blocked by memcg being stuck. Certain access to soem files might be an issue because those could have locks held but I do not see other relations. I would start by checking the HW, trying to focus on reducing elements that could contribute - aka try to nail down to the minimum set which reproduces the issue. I cannot help you much with that I am afraid. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Tue 05-02-13 15:49:47, azurIt wrote: [...] > Just to be sure - am i supposed to apply this two patches? > http://watchdog.sk/lkml/patches/ 5-memcg-fix-1.patch is not complete. It doesn't contain the folloup I mentioned in a follow up email. Here is the full patch: --- >From f2bf8437d5b9bb38a95a432bf39f32c584955171 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 26 Nov 2012 11:47:57 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0x Process B [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0x This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which then tells mem_cgroup_charge_common that OOM is not allowed for the charge. No OOM from this path, except for fixing the bug, also make some sense as we really do not want to cause an OOM because of a page cache usage. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. __GFP_NORETRY is abused for this memcg specific flag because no user accounted allocation use this flag except for THP which have memcg oom disabled already. Reported-by: azurIt Signed-off-by: Michal Hocko --- include/linux/gfp.h|3 +++ include/linux/memcontrol.h | 13 + mm/filemap.c |8 +++- mm/memcontrol.c| 10 ++ 4 files changed, 29 insertions(+), 5 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 3a76faf..806fb54 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -146,6 +146,9 @@ struct vm_area_struct; /* 4GB DMA on some platforms */ #define GFP_DMA32 __GFP_DMA32 +/* memcg oom killer is not allowed */ +#define GFP_MEMCG_NO_OOM __GFP_NORETRY + /* Convert GFP flags to their corresponding migrate type */ static inline int allocflags_to_migratetype(gfp_t gfp_flags) { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 81572af..bf0e575 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,6 +63,13 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); + +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM); +} + extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_rotate_reclaimable_page(struct page *page); @@ -178,6 +185,12 @@ static inline int mem_cgroup_cache_charge(struct page *page, return 0; } +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return 0; +} + static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr) { diff --git a/mm/filemap.c b/mm/filemap.c index 556858c..ef182a9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -449,7 +449,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLoc
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>Sorry, to get back to this that late but I was busy as hell since the >beginning of the year. Thank you for your time! >Has the issue repeated since then? Yes, it's happening all the time but meanwhile i wrote a script which is monitoring the problem and killing freezed processes when it occurs. But i don't like it much, it's not a solution for me :( i also noticed, that problem is always affecting the whole server but not so much as freezed cgroup. Depends on number of freezed processes, sometimes it has almost no imapct on the rest of the server, sometimes the whole server is lagging much. I have another old problem which is maybe also related to this. I wasn't connecting it with this before but now i'm not sure. Two of our servers, which are affected by this cgroup problem, are also randomly freezing completely (few times per month). These are the symptoms: - servers are answering to ping - it is possible to connect via SSH but connection is freezed after sending the password - it is possible to login via console but it is freezed after typeing the login These symptoms are very similar to HDD problems or HDD overload (but there is no overload for sure). The only way to fix it is, probably, hard rebooting the server (didn't find any other way). What do you think? Can this be related? Maybe HDDs are locked in the similar way the cgroups are - we already found out that cgroup freezeing is related also to HDD activity. Maybe there is a little chance that the whole HDD subsystem ends in deadlock? >You said you didn't apply other than the above mentioned patch. Could >you apply also debugging part of the patches I have sent? >In case you don't have it handy then it should be this one: Just to be sure - am i supposed to apply this two patches? http://watchdog.sk/lkml/patches/ azur -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Fri 25-01-13 17:31:30, Michal Hocko wrote: > On Fri 25-01-13 16:07:23, azurIt wrote: > > Any news? Thnx! > > Sorry, but I didn't get to this one yet. Sorry, to get back to this that late but I was busy as hell since the beginning of the year. Has the issue repeated since then? You said you didn't apply other than the above mentioned patch. Could you apply also debugging part of the patches I have sent? In case you don't have it handy then it should be this one: --- >From 1623420d964e7e8bc88e2a6239563052df891bf7 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 3 Dec 2012 16:16:01 +0100 Subject: [PATCH] more debugging --- mm/huge_memory.c |6 +++--- mm/memcontrol.c |1 + 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 470cbb4..01a11f1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -671,7 +671,7 @@ static inline struct page *alloc_hugepage(int defrag) } #endif -int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, unsigned int flags) { @@ -790,7 +790,7 @@ pgtable_t get_pmd_huge_pte(struct mm_struct *mm) return pgtable; } -static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, +static noinline int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd, @@ -883,7 +883,7 @@ out_free_pages: goto out; } -int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd) { int ret = 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..1986c65 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2397,6 +2397,7 @@ done: return 0; nomem: *ptr = NULL; + __WARN_printf("gfp_mask:%u nr_pages:%u oom:%d ret:%d\n", gfp_mask, nr_pages, oom, ret); return -ENOMEM; bypass: *ptr = NULL; -- 1.7.10.4 -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Fri 25-01-13 16:07:23, azurIt wrote: > Any news? Thnx! Sorry, but I didn't get to this one yet. > > azur > > > > __ > > Od: "Michal Hocko" > > Komu: azurIt > > Dátum: 30.12.2012 12:08 > > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from > > add_to_page_cache_locked > > > > CC: linux-kernel@vger.kernel.org, linux...@kvack.org, "cgroups mailinglist" > > , "KAMEZAWA Hiroyuki" > > , "Johannes Weiner" > >On Sun 30-12-12 02:09:47, azurIt wrote: > >> >which suggests that the patch is incomplete and that I am blind :/ > >> >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache > >> >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following > >> >follow-up patch on top of the one you already have (which should catch > >> >all the remaining cases). > >> >Sorry about that... > >> > >> > >> This was, again, killing my MySQL server (search for "(mysqld)"): > >> http://www.watchdog.sk/lkml/oom_mysqld5 > > > >grep "Kill process" oom_mysqld5 > >Dec 30 01:53:34 server01 kernel: [ 367.061801] Memory cgroup out of memory: > >Kill process 5512 (apache2) score 716 or sacrifice child > >Dec 30 01:53:35 server01 kernel: [ 367.338024] Memory cgroup out of memory: > >Kill process 5517 (apache2) score 718 or sacrifice child > >Dec 30 01:53:35 server01 kernel: [ 367.747888] Memory cgroup out of memory: > >Kill process 5513 (apache2) score 721 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.159860] Memory cgroup out of memory: > >Kill process 5516 (apache2) score 726 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.665606] Memory cgroup out of memory: > >Kill process 5520 (apache2) score 733 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 368.765652] Out of memory: Kill process > >1778 (mysqld) score 39 or sacrifice child > >Dec 30 01:53:36 server01 kernel: [ 369.101753] Memory cgroup out of memory: > >Kill process 5519 (apache2) score 754 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.464262] Memory cgroup out of memory: > >Kill process 5583 (apache2) score 762 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.465017] Out of memory: Kill process > >5506 (apache2) score 18 or sacrifice child > >Dec 30 01:53:37 server01 kernel: [ 369.574932] Memory cgroup out of memory: > >Kill process 5523 (apache2) score 759 or sacrifice child > > > >So your mysqld has been killed by the global OOM not memcg. But why when > >you seem to be perfectly fine regarding memory? I guess the following > >backtrace is relevant: > >Dec 30 01:53:36 server01 kernel: [ 368.569720] DMA: 0*4kB 1*8kB 0*16kB > >1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB > >Dec 30 01:53:36 server01 kernel: [ 368.570447] DMA32: 9*4kB 10*8kB 8*16kB > >6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = > >2523636kB > >Dec 30 01:53:36 server01 kernel: [ 368.571175] Normal: 5*4kB 2060*8kB > >4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB > >4*2048kB 1855*4096kB = 8134036kB > >Dec 30 01:53:36 server01 kernel: [ 368.571906] 308964 total pagecache pages > >Dec 30 01:53:36 server01 kernel: [ 368.572023] 0 pages in swap cache > >Dec 30 01:53:36 server01 kernel: [ 368.572140] Swap cache stats: add 0, > >delete 0, find 0/0 > >Dec 30 01:53:36 server01 kernel: [ 368.572260] Free swap = 0kB > >Dec 30 01:53:36 server01 kernel: [ 368.572375] Total swap = 0kB > >Dec 30 01:53:36 server01 kernel: [ 368.597836] apache2 invoked oom-killer: > >gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 > >Dec 30 01:53:36 server01 kernel: [ 368.598034] apache2 cpuset=uid > >mems_allowed=0 > >Dec 30 01:53:36 server01 kernel: [ 368.598152] Pid: 5385, comm: apache2 Not > >tainted 3.2.35-grsec #1 > >Dec 30 01:53:36 server01 kernel: [ 368.598273] Call Trace: > >Dec 30 01:53:36 server01 kernel: [ 368.598396] [] > >dump_header+0x7e/0x1e0 > >Dec 30 01:53:36 server01 kernel: [ 368.598516] [] ? > >find_lock_task_mm+0x2f/0x70 > >Dec 30 01:53:36 server01 kernel: [ 368.598638] [] > >oom_kill_process+0x85/0x2a0 > >Dec 30 01:53:36 server01 kernel: [ 368.598759] [] > >out_of_memory+0xe5/0x200 > >Dec 30 01:53:36 server01 kernel: [ 368.598880] [] > >pagefault_out_of_memory+0xbd/0x110 > >Dec 30 01:53:36 server01 kernel: [ 368.
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
Any news? Thnx! azur __ > Od: "Michal Hocko" > Komu: azurIt > Dátum: 30.12.2012 12:08 > Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM from > add_to_page_cache_locked > > CC: linux-kernel@vger.kernel.org, linux...@kvack.org, "cgroups mailinglist" > , "KAMEZAWA Hiroyuki" > , "Johannes Weiner" >On Sun 30-12-12 02:09:47, azurIt wrote: >> >which suggests that the patch is incomplete and that I am blind :/ >> >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache >> >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following >> >follow-up patch on top of the one you already have (which should catch >> >all the remaining cases). >> >Sorry about that... >> >> >> This was, again, killing my MySQL server (search for "(mysqld)"): >> http://www.watchdog.sk/lkml/oom_mysqld5 > >grep "Kill process" oom_mysqld5 >Dec 30 01:53:34 server01 kernel: [ 367.061801] Memory cgroup out of memory: >Kill process 5512 (apache2) score 716 or sacrifice child >Dec 30 01:53:35 server01 kernel: [ 367.338024] Memory cgroup out of memory: >Kill process 5517 (apache2) score 718 or sacrifice child >Dec 30 01:53:35 server01 kernel: [ 367.747888] Memory cgroup out of memory: >Kill process 5513 (apache2) score 721 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.159860] Memory cgroup out of memory: >Kill process 5516 (apache2) score 726 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.665606] Memory cgroup out of memory: >Kill process 5520 (apache2) score 733 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 368.765652] Out of memory: Kill process >1778 (mysqld) score 39 or sacrifice child >Dec 30 01:53:36 server01 kernel: [ 369.101753] Memory cgroup out of memory: >Kill process 5519 (apache2) score 754 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.464262] Memory cgroup out of memory: >Kill process 5583 (apache2) score 762 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.465017] Out of memory: Kill process >5506 (apache2) score 18 or sacrifice child >Dec 30 01:53:37 server01 kernel: [ 369.574932] Memory cgroup out of memory: >Kill process 5523 (apache2) score 759 or sacrifice child > >So your mysqld has been killed by the global OOM not memcg. But why when >you seem to be perfectly fine regarding memory? I guess the following >backtrace is relevant: >Dec 30 01:53:36 server01 kernel: [ 368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB >2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB >Dec 30 01:53:36 server01 kernel: [ 368.570447] DMA32: 9*4kB 10*8kB 8*16kB >6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB >Dec 30 01:53:36 server01 kernel: [ 368.571175] Normal: 5*4kB 2060*8kB >4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB >1855*4096kB = 8134036kB >Dec 30 01:53:36 server01 kernel: [ 368.571906] 308964 total pagecache pages >Dec 30 01:53:36 server01 kernel: [ 368.572023] 0 pages in swap cache >Dec 30 01:53:36 server01 kernel: [ 368.572140] Swap cache stats: add 0, >delete 0, find 0/0 >Dec 30 01:53:36 server01 kernel: [ 368.572260] Free swap = 0kB >Dec 30 01:53:36 server01 kernel: [ 368.572375] Total swap = 0kB >Dec 30 01:53:36 server01 kernel: [ 368.597836] apache2 invoked oom-killer: >gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 >Dec 30 01:53:36 server01 kernel: [ 368.598034] apache2 cpuset=uid >mems_allowed=0 >Dec 30 01:53:36 server01 kernel: [ 368.598152] Pid: 5385, comm: apache2 Not >tainted 3.2.35-grsec #1 >Dec 30 01:53:36 server01 kernel: [ 368.598273] Call Trace: >Dec 30 01:53:36 server01 kernel: [ 368.598396] [] >dump_header+0x7e/0x1e0 >Dec 30 01:53:36 server01 kernel: [ 368.598516] [] ? >find_lock_task_mm+0x2f/0x70 >Dec 30 01:53:36 server01 kernel: [ 368.598638] [] >oom_kill_process+0x85/0x2a0 >Dec 30 01:53:36 server01 kernel: [ 368.598759] [] >out_of_memory+0xe5/0x200 >Dec 30 01:53:36 server01 kernel: [ 368.598880] [] >pagefault_out_of_memory+0xbd/0x110 >Dec 30 01:53:36 server01 kernel: [ 368.599006] [] >mm_fault_error+0xb6/0x1a0 >Dec 30 01:53:36 server01 kernel: [ 368.599127] [] >do_page_fault+0x3ee/0x460 >Dec 30 01:53:36 server01 kernel: [ 368.599250] [] ? >mntput+0x1f/0x30 >Dec 30 01:53:36 server01 kernel: [ 368.599371] [] ? >fput+0x156/0x200 >Dec 30 01:53:36 server01 kernel: [ 368.599496] [] >page_fault+0x1f/0x30 > >This would suggest that an unexpected ENOMEM leaked during page fault >path. I do not see which one could that be because you said THP >
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Sun 30-12-12 02:09:47, azurIt wrote: > >which suggests that the patch is incomplete and that I am blind :/ > >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache > >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following > >follow-up patch on top of the one you already have (which should catch > >all the remaining cases). > >Sorry about that... > > > This was, again, killing my MySQL server (search for "(mysqld)"): > http://www.watchdog.sk/lkml/oom_mysqld5 grep "Kill process" oom_mysqld5 Dec 30 01:53:34 server01 kernel: [ 367.061801] Memory cgroup out of memory: Kill process 5512 (apache2) score 716 or sacrifice child Dec 30 01:53:35 server01 kernel: [ 367.338024] Memory cgroup out of memory: Kill process 5517 (apache2) score 718 or sacrifice child Dec 30 01:53:35 server01 kernel: [ 367.747888] Memory cgroup out of memory: Kill process 5513 (apache2) score 721 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 368.159860] Memory cgroup out of memory: Kill process 5516 (apache2) score 726 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 368.665606] Memory cgroup out of memory: Kill process 5520 (apache2) score 733 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 368.765652] Out of memory: Kill process 1778 (mysqld) score 39 or sacrifice child Dec 30 01:53:36 server01 kernel: [ 369.101753] Memory cgroup out of memory: Kill process 5519 (apache2) score 754 or sacrifice child Dec 30 01:53:37 server01 kernel: [ 369.464262] Memory cgroup out of memory: Kill process 5583 (apache2) score 762 or sacrifice child Dec 30 01:53:37 server01 kernel: [ 369.465017] Out of memory: Kill process 5506 (apache2) score 18 or sacrifice child Dec 30 01:53:37 server01 kernel: [ 369.574932] Memory cgroup out of memory: Kill process 5523 (apache2) score 759 or sacrifice child So your mysqld has been killed by the global OOM not memcg. But why when you seem to be perfectly fine regarding memory? I guess the following backtrace is relevant: Dec 30 01:53:36 server01 kernel: [ 368.569720] DMA: 0*4kB 1*8kB 0*16kB 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15912kB Dec 30 01:53:36 server01 kernel: [ 368.570447] DMA32: 9*4kB 10*8kB 8*16kB 6*32kB 5*64kB 6*128kB 4*256kB 2*512kB 3*1024kB 3*2048kB 613*4096kB = 2523636kB Dec 30 01:53:36 server01 kernel: [ 368.571175] Normal: 5*4kB 2060*8kB 4122*16kB 2550*32kB 2667*64kB 722*128kB 197*256kB 68*512kB 15*1024kB 4*2048kB 1855*4096kB = 8134036kB Dec 30 01:53:36 server01 kernel: [ 368.571906] 308964 total pagecache pages Dec 30 01:53:36 server01 kernel: [ 368.572023] 0 pages in swap cache Dec 30 01:53:36 server01 kernel: [ 368.572140] Swap cache stats: add 0, delete 0, find 0/0 Dec 30 01:53:36 server01 kernel: [ 368.572260] Free swap = 0kB Dec 30 01:53:36 server01 kernel: [ 368.572375] Total swap = 0kB Dec 30 01:53:36 server01 kernel: [ 368.597836] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 Dec 30 01:53:36 server01 kernel: [ 368.598034] apache2 cpuset=uid mems_allowed=0 Dec 30 01:53:36 server01 kernel: [ 368.598152] Pid: 5385, comm: apache2 Not tainted 3.2.35-grsec #1 Dec 30 01:53:36 server01 kernel: [ 368.598273] Call Trace: Dec 30 01:53:36 server01 kernel: [ 368.598396] [] dump_header+0x7e/0x1e0 Dec 30 01:53:36 server01 kernel: [ 368.598516] [] ? find_lock_task_mm+0x2f/0x70 Dec 30 01:53:36 server01 kernel: [ 368.598638] [] oom_kill_process+0x85/0x2a0 Dec 30 01:53:36 server01 kernel: [ 368.598759] [] out_of_memory+0xe5/0x200 Dec 30 01:53:36 server01 kernel: [ 368.598880] [] pagefault_out_of_memory+0xbd/0x110 Dec 30 01:53:36 server01 kernel: [ 368.599006] [] mm_fault_error+0xb6/0x1a0 Dec 30 01:53:36 server01 kernel: [ 368.599127] [] do_page_fault+0x3ee/0x460 Dec 30 01:53:36 server01 kernel: [ 368.599250] [] ? mntput+0x1f/0x30 Dec 30 01:53:36 server01 kernel: [ 368.599371] [] ? fput+0x156/0x200 Dec 30 01:53:36 server01 kernel: [ 368.599496] [] page_fault+0x1f/0x30 This would suggest that an unexpected ENOMEM leaked during page fault path. I do not see which one could that be because you said THP (CONFIG_TRANSPARENT_HUGEPAGE) are disabled (and the other patch I have mentioned in the thread should fix that issue - btw. the patch is already scheduled for stable tree). __do_fault, do_anonymous_page and do_wp_page call mem_cgroup_newpage_charge with GFP_KERNEL which means that we do memcg OOM and never return ENOMEM. do_swap_page calls mem_cgroup_try_charge_swapin with GFP_KERNEL as well. I might have missed something but I will not get to look closer before 2nd January. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>which suggests that the patch is incomplete and that I am blind :/ >mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache >and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following >follow-up patch on top of the one you already have (which should catch >all the remaining cases). >Sorry about that... This was, again, killing my MySQL server (search for "(mysqld)"): http://www.watchdog.sk/lkml/oom_mysqld5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Mon 24-12-12 14:38:50, azurIt wrote: > >OK, good to hear and fingers crossed. I will try to get back to the > >original problem and a better solution sometimes early next year when > >all the things settle a bit. > > > Btw, i noticed one more thing when problem is happening (=when any > cgroup is stucked), i fogot to mention it before, sorry :( . It's > related to HDDs, something is slowing them down in a strange way. All > services are working normally and i really cannot notice any slowness, > the only thing which i noticed is affeceted is our backup software ( > www.Bacula.org ). When problem occurs at night, so it's happening when > backup is running, backup is extremely slow and usually don't finish > until i kill processes inside affected cgroup (=until i resolve the > problem). Backup software is NOT doing big HDD bandwidth BUT it's > doing quite huge number of disk operations (it needs to stat every > file and directory). I believe that only speed of disk operations are > affected and are very slow. I would bet that this is caused by the blocked proceses in memcg oom handler which hold i_mutex and the backup process wants to access the same inode with an operation which requires the lock. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Mon 24-12-12 14:25:26, azurIt wrote: > >OK, good to hear and fingers crossed. I will try to get back to the > >original problem and a better solution sometimes early next year when > >all the things settle a bit. > > > Michal, problem, unfortunately, happened again :( twice. When it > happened first time (two days ago) i don't want to believe it so i > recompiled the kernel and boot it again to be sure i really used your > patch. Today it happened again, here is report: > http://watchdog.sk/lkml/memcg-bug-3.tar.gz Hmm, 1356352982/1507/stack says [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1147+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4f/0x140 [] add_to_page_cache_lru+0x22/0x50 [] find_or_create_page+0x73/0xb0 [] __getblk+0xea/0x2c0 [] ext3_getblk+0xeb/0x240 [] ext3_bread+0x19/0x90 [] ext3_dx_find_entry+0x83/0x1e0 [] ext3_find_entry+0x2e4/0x480 [] ext3_lookup+0x4d/0x120 [] d_alloc_and_lookup+0x45/0x90 [] do_lookup+0x278/0x390 [] path_lookupat+0xae/0x7e0 [] do_path_lookup+0x35/0xe0 [] user_path_at_empty+0x59/0xb0 [] user_path_at+0x11/0x20 [] vfs_fstatat+0x47/0x80 [] vfs_lstat+0x1e/0x20 [] sys_newlstat+0x24/0x50 [] system_call_fastpath+0x18/0x1d [] 0x which suggests that the patch is incomplete and that I am blind :/ mem_cgroup_cache_charge calls __mem_cgroup_try_charge for the page cache and that one doesn't check GFP_MEMCG_NO_OOM. So you need the following follow-up patch on top of the one you already have (which should catch all the remaining cases). Sorry about that... --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 89997ac..559a54d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2779,6 +2779,7 @@ __mem_cgroup_commit_charge_lrucare(struct page *page, struct mem_cgroup *memcg, int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask) { + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg = NULL; int ret; @@ -2791,7 +2792,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, mm = &init_mm; if (page_is_file_cache(page)) { - ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, true); + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, &memcg, oom); if (ret || !memcg) return ret; @@ -2827,6 +2828,7 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t mask, struct mem_cgroup **ptr) { + bool oom = !(mask & GFP_MEMCG_NO_OOM); struct mem_cgroup *memcg; int ret; @@ -2849,13 +2851,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, if (!memcg) goto charge_cur_mm; *ptr = memcg; - ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true); + ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, oom); css_put(&memcg->css); return ret; charge_cur_mm: if (unlikely(!mm)) mm = &init_mm; - return __mem_cgroup_try_charge(mm, mask, 1, ptr, true); + return __mem_cgroup_try_charge(mm, mask, 1, ptr, oom); } static void -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>OK, good to hear and fingers crossed. I will try to get back to the >original problem and a better solution sometimes early next year when >all the things settle a bit. Btw, i noticed one more thing when problem is happening (=when any cgroup is stucked), i fogot to mention it before, sorry :( . It's related to HDDs, something is slowing them down in a strange way. All services are working normally and i really cannot notice any slowness, the only thing which i noticed is affeceted is our backup software ( www.Bacula.org ). When problem occurs at night, so it's happening when backup is running, backup is extremely slow and usually don't finish until i kill processes inside affected cgroup (=until i resolve the problem). Backup software is NOT doing big HDD bandwidth BUT it's doing quite huge number of disk operations (it needs to stat every file and directory). I believe that only speed of disk operations are affected and are very slow. Merry christmas! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>OK, good to hear and fingers crossed. I will try to get back to the >original problem and a better solution sometimes early next year when >all the things settle a bit. Michal, problem, unfortunately, happened again :( twice. When it happened first time (two days ago) i don't want to believe it so i recompiled the kernel and boot it again to be sure i really used your patch. Today it happened again, here is report: http://watchdog.sk/lkml/memcg-bug-3.tar.gz Here is patch which i used (kernel 3.2.35, i didn't use any other from your patches): http://watchdog.sk/lkml/5-memcg-fix.patch azur -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Tue 18-12-12 15:22:23, azurIt wrote: > >It should mitigate the problem. The real fix shouldn't be that specific > >(as per discussion in other thread). The chance this will get upstream > >is not big and that means that it will not get to the stable tree > >either. > > > OOM is no longer killing processes outside target cgroups, so > everything looks fine so far. Will report back when i will have more > info. Thnks! OK, good to hear and fingers crossed. I will try to get back to the original problem and a better solution sometimes early next year when all the things settle a bit. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>It should mitigate the problem. The real fix shouldn't be that specific >(as per discussion in other thread). The chance this will get upstream >is not big and that means that it will not get to the stable tree >either. OOM is no longer killing processes outside target cgroups, so everything looks fine so far. Will report back when i will have more info. Thnks! azur -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Mon 17-12-12 19:23:01, azurIt wrote: > >[Ohh, I am really an idiot. I screwed the first patch] > >- bool oom = true; > >+ bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); > > > >Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM). > > No idea how I could have missed that. I am really sorry about that. > > > :D no problem :) so, now it should really work as expected and > completely fix my original problem? It should mitigate the problem. The real fix shouldn't be that specific (as per discussion in other thread). The chance this will get upstream is not big and that means that it will not get to the stable tree either. > is it safe to apply it on 3.2.35? I didn't check what are the differences but I do not think there is anything to conflict with it. > Thank you very much! HTH -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>[Ohh, I am really an idiot. I screwed the first patch] >- bool oom = true; >+ bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); > >Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM). > No idea how I could have missed that. I am really sorry about that. :D no problem :) so, now it should really work as expected and completely fix my original problem? is it safe to apply it on 3.2.35? Thank you very much! azur -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Mon 17-12-12 02:34:30, azurIt wrote: > >I would try to limit changes to minimum. So the original kernel you were > >using + the first patch to prevent OOM from the write path + 2 debugging > >patches. > > > It didn't take off the whole system this time (but i was > prepared to record a video of console ;) ), here it is: > http://www.watchdog.sk/lkml/oom_mysqld4 [...] [ 1248.059429] [ cut here ] [ 1248.059586] WARNING: at mm/memcontrol.c:2400 T.1146+0x2d9/0x610() [ 1248.059723] Hardware name: S5000VSA [ 1248.059855] gfp_mask:208 nr_pages:1 oom:0 ret:2 This is GFP_KERNEL allocation which is expected. It is also a simple page which is not that expected because we shouldn't return ENOMEM on those unless this was GFP_ATOMIC allocation (which it wasn't) or the caller told us to not trigger OOM which is the case only for THP pages (see mem_cgroup_charge_common). So the big question is how have we ended up with oom=false here... [Ohh, I am really an idiot. I screwed the first patch] - bool oom = true; + bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); Which obviously doesn't work. It should read !(gfp_mask &GFP_MEMCG_NO_OOM). No idea how I could have missed that. I am really sorry about that. --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c04676d..1f35a74 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2704,7 +2704,7 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, struct mem_cgroup *memcg = NULL; unsigned int nr_pages = 1; struct page_cgroup *pc; - bool oom = !(gfp_mask | GFP_MEMCG_NO_OOM); + bool oom = !(gfp_mask & GFP_MEMCG_NO_OOM); int ret; if (PageTransHuge(page)) { -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>I would try to limit changes to minimum. So the original kernel you were >using + the first patch to prevent OOM from the write path + 2 debugging >patches. It didn't take off the whole system this time (but i was prepared to record a video of console ;) ), here it is: http://www.watchdog.sk/lkml/oom_mysqld4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>I would try to limit changes to minimum. So the original kernel you were >using + the first patch to prevent OOM from the write path + 2 debugging >patches. ok. >But was it at least related to the debugging from the patch or it was >rather a totally unrelated thing? I wasn't reading it much but i think it looks like a traces i was sending you before. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Mon 10-12-12 11:18:17, azurIt wrote: > >Hmm, this is _really_ surprising. The latest patch didn't add any new > >logging actually. It just enahanced messages which were already printed > >out previously + changed few functions to be not inlined so they show up > >in the traces. So the only explanation is that the workload has changed > >or the patches got misapplied. > > > This time i installed 3.2.35, maybe some changes between .34 and .35 > did this? Should i try .34? I would try to limit changes to minimum. So the original kernel you were using + the first patch to prevent OOM from the write path + 2 debugging patches. > >> Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From > >> 141.105.120.152: bruteforce prevention initiated for the next 30 minutes > >> or until service restarted, stalling each fork 30 seconds. Please > >> investigate the crash report for > >> /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 > >> gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] > >> uid/euid:0/0 gid/egid:0/0 > > > >This explains why you have seen your machine hung. I am not familiar > >with grsec but stalling each fork 30s sounds really bad. > > > Btw, i never ever saw such a message from grsecurity yet. Will write to grsec > mailing list about explanation. > > > >Anyway this will not help me much. Do you happen to still have any of > >those logged traces from the last run? > > > Unfortunately not, it didn't log anything and tons of messages were > printed only to console (i was logged via IP-KVM). It looked that > printing is infinite, i rebooted it after few minutes. But was it at least related to the debugging from the patch or it was rather a totally unrelated thing? > >Apart from that. If my current understanding is correct then this is > >related to transparent huge pages (and leaking charge to the page fault > >handler). Do you see the same problem if you disable THP before you > >start your workload? (echo never > > >/sys/kernel/mm/transparent_hugepage/enabled) > > # cat /sys/kernel/mm/transparent_hugepage/enabled > cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory Weee. Then it cannot be related to THP at all. Which makes this even bigger mystery. We really need to find out who is leaking that charge. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>Hmm, this is _really_ surprising. The latest patch didn't add any new >logging actually. It just enahanced messages which were already printed >out previously + changed few functions to be not inlined so they show up >in the traces. So the only explanation is that the workload has changed >or the patches got misapplied. This time i installed 3.2.35, maybe some changes between .34 and .35 did this? Should i try .34? >> Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: >> bruteforce prevention initiated for the next 30 minutes or until service >> restarted, stalling each fork 30 seconds. Please investigate the crash >> report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 >> gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] >> uid/euid:0/0 gid/egid:0/0 > >This explains why you have seen your machine hung. I am not familiar >with grsec but stalling each fork 30s sounds really bad. Btw, i never ever saw such a message from grsecurity yet. Will write to grsec mailing list about explanation. >Anyway this will not help me much. Do you happen to still have any of >those logged traces from the last run? Unfortunately not, it didn't log anything and tons of messages were printed only to console (i was logged via IP-KVM). It looked that printing is infinite, i rebooted it after few minutes. >Apart from that. If my current understanding is correct then this is >related to transparent huge pages (and leaking charge to the page fault >handler). Do you see the same problem if you disable THP before you >start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled) # cat /sys/kernel/mm/transparent_hugepage/enabled cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory # ls -la /sys/kernel/mm total 0 drwx-- 3 root root 0 Dec 10 11:11 . drwx-- 5 root root 0 Dec 10 02:06 .. drwx-- 2 root root 0 Dec 10 11:11 cleancache -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Mon 10-12-12 02:20:38, azurIt wrote: [...] > Michal, Hi, > this was printing so many debug messages to console that the whole > server hangs Hmm, this is _really_ surprising. The latest patch didn't add any new logging actually. It just enahanced messages which were already printed out previously + changed few functions to be not inlined so they show up in the traces. So the only explanation is that the workload has changed or the patches got misapplied. > and i had to hard reset it after several minutes :( Sorry > but i cannot test such a things in production. There's no problem with > one soft reset which takes 4 minutes but this hard reset creates about > 20 minutes outage (mainly cos of disk quotas checking). Understood. > Last logged message: > > Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: > bruteforce prevention initiated for the next 30 minutes or until service > restarted, stalling each fork 30 seconds. Please investigate the crash > report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 > gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] > uid/euid:0/0 gid/egid:0/0 This explains why you have seen your machine hung. I am not familiar with grsec but stalling each fork 30s sounds really bad. Anyway this will not help me much. Do you happen to still have any of those logged traces from the last run? Apart from that. If my current understanding is correct then this is related to transparent huge pages (and leaking charge to the page fault handler). Do you see the same problem if you disable THP before you start your workload? (echo never > /sys/kernel/mm/transparent_hugepage/enabled) -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>There are no other callers AFAICS so I am getting clueless. Maybe more >debugging will tell us something (the inlining has been reduced for thp >paths which can reduce performance in thp page fault heavy workloads but >this will give us better traces - I hope). Michal, this was printing so many debug messages to console that the whole server hangs and i had to hard reset it after several minutes :( Sorry but i cannot test such a things in production. There's no problem with one soft reset which takes 4 minutes but this hard reset creates about 20 minutes outage (mainly cos of disk quotas checking). Last logged message: Dec 10 02:03:29 server01 kernel: [ 220.366486] grsec: From 141.105.120.152: bruteforce prevention initiated for the next 30 minutes or until service restarted, stalling each fork 30 seconds. Please investigate the crash report for /usr/lib/apache2/mpm-itk/apache2[apache2:3586] uid/euid:1258/1258 gid/egid:100/100, parent /usr/lib/apache2/mpm-itk/apache2[apache2:2142] uid/euid:0/0 gid/egid:0/0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Thu 06-12-12 11:12:49, azurIt wrote: > >Dohh. The very same stack mem_cgroup_newpage_charge called from the page > >fault. The heavy inlining is not particularly helping here... So there > >must be some other THP charge leaking out. > >[/me is diving into the code again] > > > >* do_huge_pmd_anonymous_page falls back to handle_pte_fault > >* do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't > > charge the huge page > >* do_huge_pmd_wp_page splits the huge page and retries with fallback to > > handle_pte_fault > >* collapse_huge_page is not called in the page fault path > >* do_wp_page, do_anonymous_page and __do_fault operate on a single page > > so the memcg charging cannot return ENOMEM > > > >There are no other callers AFAICS so I am getting clueless. Maybe more > >debugging will tell us something (the inlining has been reduced for thp > >paths which can reduce performance in thp page fault heavy workloads but > >this will give us better traces - I hope). > > > Should i apply all patches togather? (fix for this bug, more log > messages, backported fix from 3.5 and this new one) Yes please -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>Dohh. The very same stack mem_cgroup_newpage_charge called from the page >fault. The heavy inlining is not particularly helping here... So there >must be some other THP charge leaking out. >[/me is diving into the code again] > >* do_huge_pmd_anonymous_page falls back to handle_pte_fault >* do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't > charge the huge page >* do_huge_pmd_wp_page splits the huge page and retries with fallback to > handle_pte_fault >* collapse_huge_page is not called in the page fault path >* do_wp_page, do_anonymous_page and __do_fault operate on a single page > so the memcg charging cannot return ENOMEM > >There are no other callers AFAICS so I am getting clueless. Maybe more >debugging will tell us something (the inlining has been reduced for thp >paths which can reduce performance in thp page fault heavy workloads but >this will give us better traces - I hope). Should i apply all patches togather? (fix for this bug, more log messages, backported fix from 3.5 and this new one) >Anyway do you see the same problem if transparent huge pages are >disabled? >echo never > /sys/kernel/mm/transparent_hugepage/enabled) # cat /sys/kernel/mm/transparent_hugepage/enabled cat: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Thu 06-12-12 01:29:24, azurIt wrote: > >OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. > >This can only happen if this was an atomic allocation request > >(!__GFP_WAIT) or if oom is not allowed which is the case only for > >transparent huge page allocation. > >The first case can be excluded (in the clean 3.2 stable kernel) because > >all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one > >should be OK because the page fault should fallback to a regular page if > >THP allocation/charge fails. > >[/me goes to double check] > >Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with > >VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback > >instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split > >hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The > >patch applies to 3.2 without any further modifications. I didn't have > >time to test it but if it helps you we should push this to the stable > >tree. > > > This, unfortunately, didn't fix the problem :( > http://www.watchdog.sk/lkml/oom_mysqld3 Dohh. The very same stack mem_cgroup_newpage_charge called from the page fault. The heavy inlining is not particularly helping here... So there must be some other THP charge leaking out. [/me is diving into the code again] * do_huge_pmd_anonymous_page falls back to handle_pte_fault * do_huge_pmd_wp_page_fallback falls back to simple pages so it doesn't charge the huge page * do_huge_pmd_wp_page splits the huge page and retries with fallback to handle_pte_fault * collapse_huge_page is not called in the page fault path * do_wp_page, do_anonymous_page and __do_fault operate on a single page so the memcg charging cannot return ENOMEM There are no other callers AFAICS so I am getting clueless. Maybe more debugging will tell us something (the inlining has been reduced for thp paths which can reduce performance in thp page fault heavy workloads but this will give us better traces - I hope). Anyway do you see the same problem if transparent huge pages are disabled? echo never > /sys/kernel/mm/transparent_hugepage/enabled) --- >From 93a30140b50d8474a047b91c698f4880149635db Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 6 Dec 2012 10:40:17 +0100 Subject: [PATCH] more debugging --- mm/huge_memory.c |6 +++--- mm/memcontrol.c |2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 470cbb4..01a11f1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -671,7 +671,7 @@ static inline struct page *alloc_hugepage(int defrag) } #endif -int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, unsigned int flags) { @@ -790,7 +790,7 @@ pgtable_t get_pmd_huge_pte(struct mm_struct *mm) return pgtable; } -static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, +static noinline int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd, @@ -883,7 +883,7 @@ out_free_pages: goto out; } -int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, +noinline int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd) { int ret = 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9e5b56b..1986c65 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2397,7 +2397,7 @@ done: return 0; nomem: *ptr = NULL; - __WARN(); + __WARN_printf("gfp_mask:%u nr_pages:%u oom:%d ret:%d\n", gfp_mask, nr_pages, oom, ret); return -ENOMEM; bypass: *ptr = NULL; -- 1.7.10.4 -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. >This can only happen if this was an atomic allocation request >(!__GFP_WAIT) or if oom is not allowed which is the case only for >transparent huge page allocation. >The first case can be excluded (in the clean 3.2 stable kernel) because >all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one >should be OK because the page fault should fallback to a regular page if >THP allocation/charge fails. >[/me goes to double check] >Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with >VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback >instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split >hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The >patch applies to 3.2 without any further modifications. I didn't have >time to test it but if it helps you we should push this to the stable >tree. This, unfortunately, didn't fix the problem :( http://www.watchdog.sk/lkml/oom_mysqld3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Wed 05-12-12 02:36:44, azurIt wrote: > >The following should print the traces when we hand over ENOMEM to the > >caller. It should catch all charge paths (migration is not covered but > >that one is not important here). If we don't see any traces from here > >and there is still global OOM striking then there must be something else > >to trigger this. > >Could you test this with the patch which aims at fixing your deadlock, > >please? I realise that this is a production environment but I do not see > >anything relevant in the code. > > > Michal, > > i think/hope this is what you wanted: > http://www.watchdog.sk/lkml/oom_mysqld2 Dec 5 02:20:48 server01 kernel: [ 380.995947] WARNING: at mm/memcontrol.c:2400 T.1146+0x2c1/0x5d0() Dec 5 02:20:48 server01 kernel: [ 380.995950] Hardware name: S5000VSA Dec 5 02:20:48 server01 kernel: [ 380.995952] Pid: 5351, comm: apache2 Not tainted 3.2.34-grsec #1 Dec 5 02:20:48 server01 kernel: [ 380.995954] Call Trace: Dec 5 02:20:48 server01 kernel: [ 380.995960] [] warn_slowpath_common+0x7a/0xb0 Dec 5 02:20:48 server01 kernel: [ 380.995963] [] warn_slowpath_null+0x1a/0x20 Dec 5 02:20:48 server01 kernel: [ 380.995965] [] T.1146+0x2c1/0x5d0 Dec 5 02:20:48 server01 kernel: [ 380.995967] [] mem_cgroup_charge_common+0x53/0x90 Dec 5 02:20:48 server01 kernel: [ 380.995970] [] mem_cgroup_newpage_charge+0x45/0x50 Dec 5 02:20:48 server01 kernel: [ 380.995974] [] handle_pte_fault+0x609/0x940 Dec 5 02:20:48 server01 kernel: [ 380.995978] [] ? pte_alloc_one+0x3f/0x50 Dec 5 02:20:48 server01 kernel: [ 380.995981] [] handle_mm_fault+0x138/0x260 Dec 5 02:20:48 server01 kernel: [ 380.995983] [] do_page_fault+0x13d/0x460 Dec 5 02:20:48 server01 kernel: [ 380.995986] [] ? do_mmap_pgoff+0x3dc/0x430 Dec 5 02:20:48 server01 kernel: [ 380.995988] [] ? remove_vma+0x5d/0x80 Dec 5 02:20:48 server01 kernel: [ 380.995992] [] page_fault+0x1f/0x30 Dec 5 02:20:48 server01 kernel: [ 380.995994] ---[ end trace 25bbb3e634c25b7f ]--- Dec 5 02:20:48 server01 kernel: [ 380.996373] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 Dec 5 02:20:48 server01 kernel: [ 380.996377] apache2 cpuset=uid mems_allowed=0 Dec 5 02:20:48 server01 kernel: [ 380.996379] Pid: 5351, comm: apache2 Tainted: GW3.2.34-grsec #1 Dec 5 02:20:48 server01 kernel: [ 380.996380] Call Trace: Dec 5 02:20:48 server01 kernel: [ 380.996384] [] dump_header+0x7e/0x1e0 Dec 5 02:20:48 server01 kernel: [ 380.996387] [] ? find_lock_task_mm+0x2f/0x70 Dec 5 02:20:48 server01 kernel: [ 380.996389] [] oom_kill_process+0x85/0x2a0 Dec 5 02:20:48 server01 kernel: [ 380.996392] [] out_of_memory+0xe5/0x200 Dec 5 02:20:48 server01 kernel: [ 380.996394] [] ? pte_alloc_one+0x3f/0x50 Dec 5 02:20:48 server01 kernel: [ 380.996397] [] pagefault_out_of_memory+0xbd/0x110 Dec 5 02:20:48 server01 kernel: [ 380.996399] [] mm_fault_error+0xb6/0x1a0 Dec 5 02:20:48 server01 kernel: [ 380.996401] [] do_page_fault+0x3ee/0x460 Dec 5 02:20:48 server01 kernel: [ 380.996403] [] ? do_mmap_pgoff+0x3dc/0x430 Dec 5 02:20:48 server01 kernel: [ 380.996405] [] ? remove_vma+0x5d/0x80 Dec 5 02:20:48 server01 kernel: [ 380.996408] [] page_fault+0x1f/0x30 OK, so the ENOMEM seems to be leaking from mem_cgroup_newpage_charge. This can only happen if this was an atomic allocation request (!__GFP_WAIT) or if oom is not allowed which is the case only for transparent huge page allocation. The first case can be excluded (in the clean 3.2 stable kernel) because all callers of mem_cgroup_newpage_charge use GFP_KERNEL. The later one should be OK because the page fault should fallback to a regular page if THP allocation/charge fails. [/me goes to double check] Hmm do_huge_pmd_wp_page seems to charge a huge page and fails with VM_FAULT_OOM without any fallback. We should do_huge_pmd_wp_page_fallback instead. This has been fixed in 3.5-rc1 by 1f1d06c3 (thp, memcg: split hugepage for memcg oom on cow) but it hasn't been backported to 3.2. The patch applies to 3.2 without any further modifications. I didn't have time to test it but if it helps you we should push this to the stable tree. --- >From 765f5e0121c4410faa19c088e9ada75976bde178 Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Tue, 29 May 2012 15:06:23 -0700 Subject: [PATCH] thp, memcg: split hugepage for memcg oom on cow On COW, a new hugepage is allocated and charged to the memcg. If the system is oom or the charge to the memcg fails, however, the fault handler will return VM_FAULT_OOM which results in an oom kill. Instead, it's possible to fallback to splitting the hugepage so that the COW results only in an order-0 page being allocated and charged to the memcg which has a higher liklihood to succeed. This is expensive because the hugepage must be split in the page fault handler, but it is much better than unnecessarily oom killing a process. Signed-off-by: David Rientjes Cc:
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>The following should print the traces when we hand over ENOMEM to the >caller. It should catch all charge paths (migration is not covered but >that one is not important here). If we don't see any traces from here >and there is still global OOM striking then there must be something else >to trigger this. >Could you test this with the patch which aims at fixing your deadlock, >please? I realise that this is a production environment but I do not see >anything relevant in the code. Michal, i think/hope this is what you wanted: http://www.watchdog.sk/lkml/oom_mysqld2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Fri 30-11-12 17:19:23, Michal Hocko wrote: [...] > The important question is why you see VM_FAULT_OOM and whether memcg > charging failure can trigger that. I don not see how this could happen > right now because __GFP_NORETRY is not used for user pages (except for > THP which disable memcg OOM already), file backed page faults (aka > __do_fault) use mem_cgroup_newpage_charge which doesn't disable OOM. > This is a real head scratcher. The following should print the traces when we hand over ENOMEM to the caller. It should catch all charge paths (migration is not covered but that one is not important here). If we don't see any traces from here and there is still global OOM striking then there must be something else to trigger this. Could you test this with the patch which aims at fixing your deadlock, please? I realise that this is a production environment but I do not see anything relevant in the code. --- diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c8425b1..9e5b56b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2397,6 +2397,7 @@ done: return 0; nomem: *ptr = NULL; + __WARN(); return -ENOMEM; bypass: *ptr = NULL; -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>The only strange thing I noticed is that some groups have 0 limit. Is >this intentional? >grep memory.limit_in_bytes cgroups | grep -v uid | sed 's@.*/@@' | sort | uniq >-c > 3 memory.limit_in_bytes:0 These are users who are not allowed to run anything. azur -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Fri 30-11-12 17:26:51, azurIt wrote: > >Could you also post your complete containers configuration, maybe there > >is something strange in there (basically grep . -r YOUR_CGROUP_MNT > >except for tasks files which are of no use right now). > > > Here it is: > http://www.watchdog.sk/lkml/cgroups.gz The only strange thing I noticed is that some groups have 0 limit. Is this intentional? grep memory.limit_in_bytes cgroups | grep -v uid | sed 's@.*/@@' | sort | uniq -c 3 memory.limit_in_bytes:0 254 memory.limit_in_bytes:104857600 107 memory.limit_in_bytes:157286400 68 memory.limit_in_bytes:209715200 10 memory.limit_in_bytes:262144000 28 memory.limit_in_bytes:314572800 1 memory.limit_in_bytes:346030080 1 memory.limit_in_bytes:524288000 2 memory.limit_in_bytes:9223372036854775807 -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>Could you also post your complete containers configuration, maybe there >is something strange in there (basically grep . -r YOUR_CGROUP_MNT >except for tasks files which are of no use right now). Here it is: http://www.watchdog.sk/lkml/cgroups.gz -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Fri 30-11-12 16:59:37, azurIt wrote: > >> Here is the full boot log: > >> www.watchdog.sk/lkml/kern.log > > > >The log is not complete. Could you paste the comple dmesg output? Or > >even better, do you have logs from the previous run? > > > What is missing there? All kernel messages are logging into > /var/log/kern.log (it's the same as dmesg), dmesg itself was already > rewrited by other messages. I think it's all what that kernel printed. Early boot messages are missing - so exactly the BIOS memory map I was asking for. As the NUMA has been excluded it is probably not that relevant anymore. The important question is why you see VM_FAULT_OOM and whether memcg charging failure can trigger that. I don not see how this could happen right now because __GFP_NORETRY is not used for user pages (except for THP which disable memcg OOM already), file backed page faults (aka __do_fault) use mem_cgroup_newpage_charge which doesn't disable OOM. This is a real head scratcher. Could you also post your complete containers configuration, maybe there is something strange in there (basically grep . -r YOUR_CGROUP_MNT except for tasks files which are of no use right now). -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>> Here is the full boot log: >> www.watchdog.sk/lkml/kern.log > >The log is not complete. Could you paste the comple dmesg output? Or >even better, do you have logs from the previous run? What is missing there? All kernel messages are logging into /var/log/kern.log (it's the same as dmesg), dmesg itself was already rewrited by other messages. I think it's all what that kernel printed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Fri 30-11-12 16:08:11, azurIt wrote: > >DMA32 zone is usually fills up first 4G unless your HW remaps the rest > >of the memory above 4G or you have a numa machine and the rest of the > >memory is at other node. Could you post your memory map printed during > >the boot? (e820: BIOS-provided physical RAM map: and following lines) > > > Here is the full boot log: > www.watchdog.sk/lkml/kern.log The log is not complete. Could you paste the comple dmesg output? Or even better, do you have logs from the previous run? > >You have mentioned that you are comounting with cpuset. If this happens > >to be a NUMA machine have you made the access to all nodes available? > >Also what does /proc/sys/vm/zone_reclaim_mode says? > > > Don't really know what NUMA means and which nodes are you talking > about, sorry :( http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access > # cat /proc/sys/vm/zone_reclaim_mode > cat: /proc/sys/vm/zone_reclaim_mode: No such file or directory OK, so the NUMA is not enabled. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Fri 30-11-12 16:03:47, Michal Hocko wrote: [...] > Anyway, the more interesting thing is gfp_mask is GFP_NOWAIT allocation > from the page fault? Huh this shouldn't happen - ever. OK, it starts making sense now. The message came from pagefault_out_of_memory which doesn't have gfp nor the required node information any longer. This suggests that VM_FAULT_OOM has been returned by the fault handler. So this hasn't been triggered by the page fault allocator. I am wondering whether this could be caused by the patch but the effect of that one should be limitted to the write (unlike the later version for -mm tree which hooks into the shmem as well). Will have to think about it some more. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>DMA32 zone is usually fills up first 4G unless your HW remaps the rest >of the memory above 4G or you have a numa machine and the rest of the >memory is at other node. Could you post your memory map printed during >the boot? (e820: BIOS-provided physical RAM map: and following lines) Here is the full boot log: www.watchdog.sk/lkml/kern.log >You have mentioned that you are comounting with cpuset. If this happens >to be a NUMA machine have you made the access to all nodes available? >Also what does /proc/sys/vm/zone_reclaim_mode says? Don't really know what NUMA means and which nodes are you talking about, sorry :( # cat /proc/sys/vm/zone_reclaim_mode cat: /proc/sys/vm/zone_reclaim_mode: No such file or directory azur -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Fri 30-11-12 15:44:31, Michal Hocko wrote: > On Fri 30-11-12 14:44:27, azurIt wrote: > > >Anyway your system is under both global and local memory pressure. You > > >didn't see apache going down previously because it was probably the one > > >which was stuck and could be killed. > > >Anyway you need to setup your system more carefully. > > > > > > There is, also, an evidence that system has enough of memory! :) Just > > take column 'rss' from process list in OOM message and sum it - you > > will get 2489911. It's probably in KB so it's about 2.4 GB. System has > > 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of > > 14. > > Hmm, that corresponds to the ZONE_DMA32 size pretty nicely but that zone > is hardly touched: > Nov 30 02:53:56 server01 kernel: [ 818.241291] DMA32 free:2523636kB > min:2672kB low:3340kB high:4008kB active_anon:0kB inactive_anon:0kB > active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB > isolated(file):0kB present:2542248kB mlocked:0kB dirty:0kB writeback:0kB > mapped:4kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB > kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB > pages_scanned:0 all_unreclaimable? no > > DMA32 zone is usually fills up first 4G unless your HW remaps the rest > of the memory above 4G or you have a numa machine and the rest of the > memory is at other node. Could you post your memory map printed during > the boot? (e820: BIOS-provided physical RAM map: and following lines) > > There is also ZONE_NORMAL which is also not used much > Nov 30 02:53:56 server01 kernel: [ 818.242163] Normal free:6924716kB > min:12512kB low:15640kB high:18768kB active_anon:1463128kB > inactive_anon:2072kB active_file:1803964kB inactive_file:1072628kB > unevictable:3924kB isolated(anon):0kB isolated(file):0kB present:11893760kB > mlocked:3924kB dirty:1000kB writeback:776kB mapped:35656kB shmem:3828kB > slab_reclaimable:202560kB slab_unreclaimable:50696kB kernel_stack:2944kB > pagetables:158616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 > all_unreclaimable? no > > You have mentioned that you are comounting with cpuset. If this happens > to be a NUMA machine have you made the access to all nodes available? And now that I am looking at the oom message more closely I can see Nov 30 02:53:56 server01 kernel: [ 818.232812] apache2 invoked oom-killer: gfp_mask=0x0, order=0, oom_adj=0, oom_score_adj=0 Nov 30 02:53:56 server01 kernel: [ 818.233029] apache2 cpuset=uid mems_allowed=0 Nov 30 02:53:56 server01 kernel: [ 818.233159] Pid: 9247, comm: apache2 Not tainted 3.2.34-grsec #1 Nov 30 02:53:56 server01 kernel: [ 818.233289] Call Trace: Nov 30 02:53:56 server01 kernel: [ 818.233470] [] dump_header+0x7e/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.233600] [] ? find_lock_task_mm+0x2f/0x70 Nov 30 02:53:56 server01 kernel: [ 818.233721] [] oom_kill_process+0x85/0x2a0 Nov 30 02:53:56 server01 kernel: [ 818.233842] [] out_of_memory+0xe5/0x200 Nov 30 02:53:56 server01 kernel: [ 818.233963] [] ? pte_alloc_one+0x3f/0x50 Nov 30 02:53:56 server01 kernel: [ 818.234082] [] pagefault_out_of_memory+0xbd/0x110 Nov 30 02:53:56 server01 kernel: [ 818.234204] [] mm_fault_error+0xb6/0x1a0 Nov 30 02:53:56 server01 kernel: [ 818.235886] [] do_page_fault+0x3ee/0x460 Nov 30 02:53:56 server01 kernel: [ 818.236006] [] ? vma_merge+0x1f7/0x2c0 Nov 30 02:53:56 server01 kernel: [ 818.236124] [] ? do_brk+0x267/0x400 Nov 30 02:53:56 server01 kernel: [ 818.236244] [] ? gr_learn_resource+0x42/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.236367] [] page_fault+0x1f/0x30 Which is interesting from 2 perspectives. Only the first node (Node-0) is allowed which would suggest that the cpuset controller is not configured to all nodes. It is still surprising Node 0 wouldn't have any memory (I would expect ZONE_DMA32 would be sitting there). Anyway, the more interesting thing is gfp_mask is GFP_NOWAIT allocation from the page fault? Huh this shouldn't happen - ever. > Also what does /proc/sys/vm/zone_reclaim_mode says? > -- > Michal Hocko > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Fri 30-11-12 14:44:27, azurIt wrote: > >Anyway your system is under both global and local memory pressure. You > >didn't see apache going down previously because it was probably the one > >which was stuck and could be killed. > >Anyway you need to setup your system more carefully. > > > There is, also, an evidence that system has enough of memory! :) Just > take column 'rss' from process list in OOM message and sum it - you > will get 2489911. It's probably in KB so it's about 2.4 GB. System has > 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of > 14. Hmm, that corresponds to the ZONE_DMA32 size pretty nicely but that zone is hardly touched: Nov 30 02:53:56 server01 kernel: [ 818.241291] DMA32 free:2523636kB min:2672kB low:3340kB high:4008kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2542248kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no DMA32 zone is usually fills up first 4G unless your HW remaps the rest of the memory above 4G or you have a numa machine and the rest of the memory is at other node. Could you post your memory map printed during the boot? (e820: BIOS-provided physical RAM map: and following lines) There is also ZONE_NORMAL which is also not used much Nov 30 02:53:56 server01 kernel: [ 818.242163] Normal free:6924716kB min:12512kB low:15640kB high:18768kB active_anon:1463128kB inactive_anon:2072kB active_file:1803964kB inactive_file:1072628kB unevictable:3924kB isolated(anon):0kB isolated(file):0kB present:11893760kB mlocked:3924kB dirty:1000kB writeback:776kB mapped:35656kB shmem:3828kB slab_reclaimable:202560kB slab_unreclaimable:50696kB kernel_stack:2944kB pagetables:158616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no You have mentioned that you are comounting with cpuset. If this happens to be a NUMA machine have you made the access to all nodes available? Also what does /proc/sys/vm/zone_reclaim_mode says? -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>Anyway your system is under both global and local memory pressure. You >didn't see apache going down previously because it was probably the one >which was stuck and could be killed. >Anyway you need to setup your system more carefully. There is, also, an evidence that system has enough of memory! :) Just take column 'rss' from process list in OOM message and sum it - you will get 2489911. It's probably in KB so it's about 2.4 GB. System has 14 GB of RAM so this also match data on my graph - 2.4 is about 17% of 14. azur -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>Anyway your system is under both global and local memory pressure. You >didn't see apache going down previously because it was probably the one >which was stuck and could be killed. >Anyway you need to setup your system more carefully. No, it wasn't, i'm 1000% sure (i was on SSH). Here is the memory usage graph from that system on that time: http://www.watchdog.sk/lkml/memory.png The blank part is rebooting into new kernel. MySQL server was killed several times, then i rebooted into previous kernel and problem was gone (not a single MySQL kill). You can see two MySQL kills there on 03:54 and 03:04:30. > >> Maybe i should mention that MySQL server has it's own cgroup (called >> 'mysql') but with no limits to any resources. > >Where is that group in the hierarchy? In root. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
On Fri 30-11-12 03:29:18, azurIt wrote: > >Here we go with the patch for 3.2.34. Could you test with this one, > >please? > > > Michal, unfortunately i had to boot to another kernel because the one > with this patch keeps killing my MySQL server :( it was, probably, > doing it on OOM in any cgroup - looks like OOM was not choosing > processes only from cgroup which is out of memory. Here is the log > from syslog: http://www.watchdog.sk/lkml/oom_mysqld You are seeing also global OOM: Nov 30 02:53:56 server01 kernel: [ 818.233159] Pid: 9247, comm: apache2 Not tainted 3.2.34-grsec #1 Nov 30 02:53:56 server01 kernel: [ 818.233289] Call Trace: Nov 30 02:53:56 server01 kernel: [ 818.233470] [] dump_header+0x7e/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.233600] [] ? find_lock_task_mm+0x2f/0x70 Nov 30 02:53:56 server01 kernel: [ 818.233721] [] oom_kill_process+0x85/0x2a0 Nov 30 02:53:56 server01 kernel: [ 818.233842] [] out_of_memory+0xe5/0x200 Nov 30 02:53:56 server01 kernel: [ 818.233963] [] ? pte_alloc_one+0x3f/0x50 Nov 30 02:53:56 server01 kernel: [ 818.234082] [] pagefault_out_of_memory+0xbd/0x110 Nov 30 02:53:56 server01 kernel: [ 818.234204] [] mm_fault_error+0xb6/0x1a0 Nov 30 02:53:56 server01 kernel: [ 818.235886] [] do_page_fault+0x3ee/0x460 Nov 30 02:53:56 server01 kernel: [ 818.236006] [] ? vma_merge+0x1f7/0x2c0 Nov 30 02:53:56 server01 kernel: [ 818.236124] [] ? do_brk+0x267/0x400 Nov 30 02:53:56 server01 kernel: [ 818.236244] [] ? gr_learn_resource+0x42/0x1e0 Nov 30 02:53:56 server01 kernel: [ 818.236367] [] page_fault+0x1f/0x30 [...] Nov 30 02:53:56 server01 kernel: [ 818.356297] Out of memory: Kill process 2188 (mysqld) score 60 or sacrifice child Nov 30 02:53:56 server01 kernel: [ 818.356493] Killed process 2188 (mysqld) total-vm:3330016kB, anon-rss:864176kB, file-rss:8072kB Then you also have memcg oom killer: Nov 30 02:53:56 server01 kernel: [ 818.375717] Task in /1037/uid killed as a result of limit of /1037 Nov 30 02:53:56 server01 kernel: [ 818.375886] memory: usage 102400kB, limit 102400kB, failcnt 736 Nov 30 02:53:56 server01 kernel: [ 818.376008] memory+swap: usage 102400kB, limit 102400kB, failcnt 0 The messages are intermixed and I guess rate limitting jumped in as well, because I cannot associate all the oom messages to a specific OOM event. Anyway your system is under both global and local memory pressure. You didn't see apache going down previously because it was probably the one which was stuck and could be killed. Anyway you need to setup your system more carefully. > Maybe i should mention that MySQL server has it's own cgroup (called > 'mysql') but with no limits to any resources. Where is that group in the hierarchy? > > azurIt > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>Here we go with the patch for 3.2.34. Could you test with this one, >please? Michal, unfortunately i had to boot to another kernel because the one with this patch keeps killing my MySQL server :( it was, probably, doing it on OOM in any cgroup - looks like OOM was not choosing processes only from cgroup which is out of memory. Here is the log from syslog: http://www.watchdog.sk/lkml/oom_mysqld Maybe i should mention that MySQL server has it's own cgroup (called 'mysql') but with no limits to any resources. azurIt -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>Here we go with the patch for 3.2.34. Could you test with this one, >please? I installed kernel with this patch, will report back if problem occurs again OR in few weeks if everything will be ok. Thank you! azurIt -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
>Here we go with the patch for 3.2.34. Could you test with this one, >please? Michal, regarding to your conversation with Johannes Weiner, should i try this patch or not? azur -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for 3.2.34] memcg: do not trigger OOM from add_to_page_cache_locked
Here we go with the patch for 3.2.34. Could you test with this one, please? --- >From 0d2d915c16f93918051b7ab8039d30b5a922049c Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 26 Nov 2012 11:47:57 +0100 Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked memcg oom killer might deadlock if the process which falls down to mem_cgroup_handle_oom holds a lock which prevents other task to terminate because it is blocked on the very same lock. This can happen when a write system call needs to allocate a page but the allocation hits the memcg hard limit and there is nothing to reclaim (e.g. there is no swap or swap limit is hit as well and all cache pages have been reclaimed already) and the process selected by memcg OOM killer is blocked on i_mutex on the same inode (e.g. truncate it). Process A [] do_truncate+0x58/0xa0 # takes i_mutex [] do_last+0x250/0xa30 [] path_openat+0xd7/0x440 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x106/0x240 [] sys_open+0x20/0x30 [] system_call_fastpath+0x18/0x1d [] 0x Process B [] mem_cgroup_handle_oom+0x241/0x3b0 [] T.1146+0x5ab/0x5c0 [] mem_cgroup_cache_charge+0xbe/0xe0 [] add_to_page_cache_locked+0x4c/0x140 [] add_to_page_cache_lru+0x22/0x50 [] grab_cache_page_write_begin+0x8b/0xe0 [] ext3_write_begin+0x88/0x270 [] generic_file_buffered_write+0x116/0x290 [] __generic_file_aio_write+0x27c/0x480 [] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex [] do_sync_write+0xea/0x130 [] vfs_write+0xf3/0x1f0 [] sys_write+0x51/0x90 [] system_call_fastpath+0x18/0x1d [] 0x This is not a hard deadlock though because administrator can still intervene and increase the limit on the group which helps the writer to finish the allocation and release the lock. This patch heals the problem by forbidding OOM from page cache charges (namely add_ro_page_cache_locked). mem_cgroup_cache_charge_no_oom helper function is defined which adds GFP_MEMCG_NO_OOM to the gfp mask which then tells mem_cgroup_charge_common that OOM is not allowed for the charge. No OOM from this path, except for fixing the bug, also make some sense as we really do not want to cause an OOM because of a page cache usage. As a possibly visible result add_to_page_cache_lru might fail more often with ENOMEM but this is to be expected if the limit is set and it is preferable than OOM killer IMO. __GFP_NORETRY is abused for this memcg specific flag because no user accounted allocation use this flag except for THP which have memcg oom disabled already. Reported-by: azurIt Signed-off-by: Michal Hocko --- include/linux/gfp.h|3 +++ include/linux/memcontrol.h | 13 + mm/filemap.c |8 +++- mm/memcontrol.c|2 +- 4 files changed, 24 insertions(+), 2 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 3a76faf..806fb54 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -146,6 +146,9 @@ struct vm_area_struct; /* 4GB DMA on some platforms */ #define GFP_DMA32 __GFP_DMA32 +/* memcg oom killer is not allowed */ +#define GFP_MEMCG_NO_OOM __GFP_NORETRY + /* Convert GFP flags to their corresponding migrate type */ static inline int allocflags_to_migratetype(gfp_t gfp_flags) { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 81572af..bf0e575 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -63,6 +63,13 @@ extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); + +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return mem_cgroup_cache_charge(page, mm, gfp_mask | GFP_MEMCG_NO_OOM); +} + extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru); extern void mem_cgroup_rotate_reclaimable_page(struct page *page); @@ -178,6 +185,12 @@ static inline int mem_cgroup_cache_charge(struct page *page, return 0; } +static inline int mem_cgroup_cache_charge_no_oom(struct page *page, + struct mm_struct *mm, gfp_t gfp_mask) +{ + return 0; +} + static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr) { diff --git a/mm/filemap.c b/mm/filemap.c index 556858c..ef182a9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -449,7 +449,13 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); - error = mem_cgroup_cache_charge(page, current->mm, + /* +* Cannot trigger OOM even if gfp_mask would allow that normally +