On Tue 04-06-13 14:48:52, Johannes Weiner wrote:
> On Tue, Jun 04, 2013 at 11:17:49AM +0200, Michal Hocko wrote:
[...]
> > Now that I am looking at this again I've realized that this
> > is not correct. The task which triggers memcg OOM will not
> > have memcg_oom.memcg set so it would trigger a global OOM in
> > pagefault_out_of_memory. Either we should return CHARGE_RETRY (and
> > propagate it via mem_cgroup_do_charge) for need_to_kill or set up
> > current->memcg_oom also for need_to_kill.
> 
> You are absolutely right!  I fixed it with a separate bit, like so:
> 
> +       struct memcg_oom_info {
> +               unsigned int in_userfault:1;
> +               unsigned int in_memcg_oom:1;
> +               int wakeups;
> +               struct mem_cgroup *wait_on_memcg;
> +       } memcg_oom;
> 
> in_memcg_oom signals mem_cgroup_oom_synchronize() that it's a memcg
> OOM and that it should always return true to not invoke the global
> killer.  But wait_on_memcg might or might not be set, depending if the
> task needs to sleep.
> 
> So the task invoking the memcg OOM killer won't sleep but retry the
> fault instead (everybody else might be sleeping and the first OOM
> killer invocation might not have freed anything).

OK, I was about to say that you should just return CHARGE_RETRY but that
doesn't solve the locking issue because although the task which
called mem_cgroup_out_of_memory still holds lock which might prevent
the victim from terminating. Maybe worth a note in the changelog?

[...]
> > > @@ -3760,22 +3761,14 @@ unlock:
> > >  /*
> > >   * By the time we get here, we already hold the mm semaphore
> > >   */
> > > -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > > -         unsigned long address, unsigned int flags)
> > > +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct 
> > > *vma,
> > > +                      unsigned long address, unsigned int flags)
> > 
> > Is this reusable? Who would call this helper or is it just for the code
> > readability? I would probably prefer a smaller patch but I do not have a
> > strong opinion on this.
> 
> I just figured I'd move all the task-specific setup into the outer
> function and leave this inner one with the actual page table stuff.
> 
> The function is not called from somewhere else, but I don't see a
> problem with splitting it into logical parts and have the individual
> parts more readable.  The alternative would have been to make every
> return in handle_mm_fault() a { ret = -EFOO; goto out }, and that
> would have been harder to follow.

OK

[...]
> From: Johannes Weiner <han...@cmpxchg.org>
> Subject: [PATCH] memcg: do not sleep on OOM waitqueue with full charge context
> 
> The memcg OOM handling is incredibly fragile because once a memcg goes
> OOM, one task (kernel or userspace) is responsible for resolving the
> situation.  Every other task that gets caught trying to charge memory
> gets stuck in a waitqueue while potentially holding various filesystem
> and mm locks on which the OOM handling task may now deadlock.
> 
> Do two things:
> 
> 1. When OOMing in a system call (buffered IO and friends), invoke the
>    OOM killer but just return -ENOMEM, never sleep.  Userspace should
>    be able to handle this.
> 
> 2. When OOMing in a page fault and somebody else is handling the
>    situation, do not sleep directly in the charging code.  Instead,
>    remember the OOMing memcg in the task struct and then fully unwind
>    the page fault stack first before going to sleep.
> 
> While reworking the OOM routine, also remove a needless OOM waitqueue
> wakeup when invoking the killer.  Only uncharges and limit increases,
> things that actually change the memory situation, should do wakeups.
> 
> Signed-off-by: Johannes Weiner <han...@cmpxchg.org>

OK this one looks good. Bitfields is not the nicests piece of art but
this is just a minor thing.

With the remaining archs moved to pagefault_out_of_memory you can add
Reviewed-by: Michal Hocko <mho...@suse.cz>

Thanks!

> ---
>  include/linux/memcontrol.h |  22 +++++++
>  include/linux/mm.h         |   1 +
>  include/linux/sched.h      |   6 ++
>  mm/ksm.c                   |   2 +-
>  mm/memcontrol.c            | 146 
> ++++++++++++++++++++++++++++-----------------
>  mm/memory.c                |  40 +++++++++----
>  mm/oom_kill.c              |   7 ++-
>  7 files changed, 154 insertions(+), 70 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d6183f0..b8ec9d1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -121,6 +121,15 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec 
> *lruvec, enum lru_list);
>  void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
>  extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>                                       struct task_struct *p);
> +static inline void mem_cgroup_set_userfault(struct task_struct *p)
> +{
> +     p->memcg_oom.in_userfault = 1;
> +}
> +static inline void mem_cgroup_clear_userfault(struct task_struct *p)
> +{
> +     p->memcg_oom.in_userfault = 0;
> +}
> +bool mem_cgroup_oom_synchronize(void);
>  extern void mem_cgroup_replace_page_cache(struct page *oldpage,
>                                       struct page *newpage);
>  
> @@ -337,6 +346,19 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, 
> struct task_struct *p)
>  {
>  }
>  
> +static inline void mem_cgroup_set_userfault(struct task_struct *p)
> +{
> +}
> +
> +static inline void mem_cgroup_clear_userfault(struct task_struct *p)
> +{
> +}
> +
> +static inline bool mem_cgroup_oom_synchronize(void)
> +{
> +     return false;
> +}
> +
>  static inline void mem_cgroup_begin_update_page_stat(struct page *page,
>                                       bool *locked, unsigned long *flags)
>  {
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e0c8528..93b29ed 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -167,6 +167,7 @@ extern pgprot_t protection_map[16];
>  #define FAULT_FLAG_RETRY_NOWAIT      0x10    /* Don't drop mmap_sem and wait 
> when retrying */
>  #define FAULT_FLAG_KILLABLE  0x20    /* The fault task is in SIGKILL 
> killable region */
>  #define FAULT_FLAG_TRIED     0x40    /* second try */
> +#define FAULT_FLAG_KERNEL    0x80    /* kernel-triggered fault 
> (get_user_pages etc.) */
>  
>  /*
>   * vm_fault is filled by the the pagefault handler and passed to the vma's
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 178a8d9..270420a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1405,6 +1405,12 @@ struct task_struct {
>               unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
>       } memcg_batch;
>       unsigned int memcg_kmem_skip_account;
> +     struct memcg_oom_info {
> +             unsigned int in_userfault:1;
> +             unsigned int in_memcg_oom:1;
> +             int wakeups;
> +             struct mem_cgroup *wait_on_memcg;
> +     } memcg_oom;
>  #endif
>  #ifdef CONFIG_HAVE_HW_BREAKPOINT
>       atomic_t ptrace_bp_refcnt;
> diff --git a/mm/ksm.c b/mm/ksm.c
> index b6afe0c..9dff93b 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -372,7 +372,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned 
> long addr)
>                       break;
>               if (PageKsm(page))
>                       ret = handle_mm_fault(vma->vm_mm, vma, addr,
> -                                                     FAULT_FLAG_WRITE);
> +                                     FAULT_FLAG_KERNEL | FAULT_FLAG_WRITE);
>               else
>                       ret = VM_FAULT_WRITE;
>               put_page(page);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index de22292..537b74b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -298,6 +298,7 @@ struct mem_cgroup {
>  
>       bool            oom_lock;
>       atomic_t        under_oom;
> +     atomic_t        oom_wakeups;
>  
>       atomic_t        refcnt;
>  
> @@ -2168,6 +2169,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait,
>  
>  static void memcg_wakeup_oom(struct mem_cgroup *memcg)
>  {
> +     atomic_inc(&memcg->oom_wakeups);
>       /* for filtering, pass "memcg" as argument. */
>       __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
>  }
> @@ -2179,56 +2181,103 @@ static void memcg_oom_recover(struct mem_cgroup 
> *memcg)
>  }
>  
>  /*
> - * try to call OOM killer. returns false if we should exit memory-reclaim 
> loop.
> + * try to call OOM killer
>   */
> -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> -                               int order)
> +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  {
> -     struct oom_wait_info owait;
> -     bool locked, need_to_kill;
> -
> -     owait.memcg = memcg;
> -     owait.wait.flags = 0;
> -     owait.wait.func = memcg_oom_wake_function;
> -     owait.wait.private = current;
> -     INIT_LIST_HEAD(&owait.wait.task_list);
> -     need_to_kill = true;
> -     mem_cgroup_mark_under_oom(memcg);
> +     bool locked, need_to_kill = true;
>  
>       /* At first, try to OOM lock hierarchy under memcg.*/
>       spin_lock(&memcg_oom_lock);
>       locked = mem_cgroup_oom_lock(memcg);
> -     /*
> -      * Even if signal_pending(), we can't quit charge() loop without
> -      * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
> -      * under OOM is always welcomed, use TASK_KILLABLE here.
> -      */
> -     prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
>       if (!locked || memcg->oom_kill_disable)
>               need_to_kill = false;
>       if (locked)
>               mem_cgroup_oom_notify(memcg);
>       spin_unlock(&memcg_oom_lock);
>  
> -     if (need_to_kill) {
> -             finish_wait(&memcg_oom_waitq, &owait.wait);
> -             mem_cgroup_out_of_memory(memcg, mask, order);
> -     } else {
> -             schedule();
> -             finish_wait(&memcg_oom_waitq, &owait.wait);
> +     /*
> +      * A system call can just return -ENOMEM, but if this is a
> +      * page fault and somebody else is handling the OOM already,
> +      * we need to sleep on the OOM waitqueue for this memcg until
> +      * the situation is resolved.  Which can take some time,
> +      * because it might be handled by a userspace task.
> +      *
> +      * However, this is the charge context, which means that we
> +      * may sit on a large call stack and hold various filesystem
> +      * locks, the mmap_sem etc., and we don't want the OOM handler
> +      * to deadlock on them while we sit here and wait.  Store the
> +      * current OOM context in the task_struct, then return
> +      * -ENOMEM.  At the end of the page fault handler, with the
> +      * stack unwound, pagefault_out_of_memory() will check back
> +      * with us by calling mem_cgroup_oom_synchronize(), possibly
> +      * putting the task to sleep.
> +      */
> +     if (current->memcg_oom.in_userfault) {
> +             current->memcg_oom.in_memcg_oom = 1;
> +             /*
> +              * Somebody else is handling the situation.  Make sure
> +              * no wakeups are missed between now and going to
> +              * sleep at the end of the page fault.
> +              */
> +             if (!need_to_kill) {
> +                     mem_cgroup_mark_under_oom(memcg);
> +                     current->memcg_oom.wakeups =
> +                             atomic_read(&memcg->oom_wakeups);
> +                     css_get(&memcg->css);
> +                     current->memcg_oom.wait_on_memcg = memcg;
> +             }
>       }
> -     spin_lock(&memcg_oom_lock);
> -     if (locked)
> +
> +     if (need_to_kill)
> +             mem_cgroup_out_of_memory(memcg, mask, order);
> +
> +     if (locked) {
> +             spin_lock(&memcg_oom_lock);
>               mem_cgroup_oom_unlock(memcg);
> -     memcg_wakeup_oom(memcg);
> -     spin_unlock(&memcg_oom_lock);
> +             spin_unlock(&memcg_oom_lock);
> +     }
> +}
>  
> -     mem_cgroup_unmark_under_oom(memcg);
> +bool mem_cgroup_oom_synchronize(void)
> +{
> +     struct oom_wait_info owait;
> +     struct mem_cgroup *memcg;
>  
> -     if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
> +     /* OOM is global, do not handle */
> +     if (!current->memcg_oom.in_memcg_oom)
>               return false;
> -     /* Give chance to dying process */
> -     schedule_timeout_uninterruptible(1);
> +
> +     /*
> +      * We invoked the OOM killer but there is a chance that a kill
> +      * did not free up any charges.  Everybody else might already
> +      * be sleeping, so restart the fault and keep the rampage
> +      * going until some charges are released.
> +      */
> +     memcg = current->memcg_oom.wait_on_memcg;
> +     if (!memcg)
> +             goto out;
> +
> +     if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
> +             goto out_put;
> +
> +     owait.memcg = memcg;
> +     owait.wait.flags = 0;
> +     owait.wait.func = memcg_oom_wake_function;
> +     owait.wait.private = current;
> +     INIT_LIST_HEAD(&owait.wait.task_list);
> +
> +     prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> +     /* Only sleep if we didn't miss any wakeups since OOM */
> +     if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups)
> +             schedule();
> +     finish_wait(&memcg_oom_waitq, &owait.wait);
> +out_put:
> +     mem_cgroup_unmark_under_oom(memcg);
> +     css_put(&memcg->css);
> +     current->memcg_oom.wait_on_memcg = NULL;
> +out:
> +     current->memcg_oom.in_memcg_oom = 0;
>       return true;
>  }
>  
> @@ -2541,12 +2590,11 @@ enum {
>       CHARGE_RETRY,           /* need to retry but retry is not bad */
>       CHARGE_NOMEM,           /* we can't do more. return -ENOMEM */
>       CHARGE_WOULDBLOCK,      /* GFP_WAIT wasn't set and no enough res. */
> -     CHARGE_OOM_DIE,         /* the current is killed because of OOM */
>  };
>  
>  static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>                               unsigned int nr_pages, unsigned int min_pages,
> -                             bool oom_check)
> +                             bool invoke_oom)
>  {
>       unsigned long csize = nr_pages * PAGE_SIZE;
>       struct mem_cgroup *mem_over_limit;
> @@ -2603,14 +2651,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup 
> *memcg, gfp_t gfp_mask,
>       if (mem_cgroup_wait_acct_move(mem_over_limit))
>               return CHARGE_RETRY;
>  
> -     /* If we don't need to call oom-killer at el, return immediately */
> -     if (!oom_check)
> -             return CHARGE_NOMEM;
> -     /* check OOM */
> -     if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask, get_order(csize)))
> -             return CHARGE_OOM_DIE;
> +     if (invoke_oom)
> +             mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(csize));
>  
> -     return CHARGE_RETRY;
> +     return CHARGE_NOMEM;
>  }
>  
>  /*
> @@ -2713,7 +2757,7 @@ again:
>       }
>  
>       do {
> -             bool oom_check;
> +             bool invoke_oom = oom && !nr_oom_retries;
>  
>               /* If killed, bypass charge */
>               if (fatal_signal_pending(current)) {
> @@ -2721,14 +2765,8 @@ again:
>                       goto bypass;
>               }
>  
> -             oom_check = false;
> -             if (oom && !nr_oom_retries) {
> -                     oom_check = true;
> -                     nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
> -             }
> -
> -             ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, nr_pages,
> -                 oom_check);
> +             ret = mem_cgroup_do_charge(memcg, gfp_mask, batch,
> +                                        nr_pages, invoke_oom);
>               switch (ret) {
>               case CHARGE_OK:
>                       break;
> @@ -2741,16 +2779,12 @@ again:
>                       css_put(&memcg->css);
>                       goto nomem;
>               case CHARGE_NOMEM: /* OOM routine works */
> -                     if (!oom) {
> +                     if (!oom || invoke_oom) {
>                               css_put(&memcg->css);
>                               goto nomem;
>                       }
> -                     /* If oom, we never return -ENOMEM */
>                       nr_oom_retries--;
>                       break;
> -             case CHARGE_OOM_DIE: /* Killed by OOM Killer */
> -                     css_put(&memcg->css);
> -                     goto bypass;
>               }
>       } while (ret != CHARGE_OK);
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 6dc1882..6d3d8c3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1815,7 +1815,7 @@ long __get_user_pages(struct task_struct *tsk, struct 
> mm_struct *mm,
>                       while (!(page = follow_page_mask(vma, start,
>                                               foll_flags, &page_mask))) {
>                               int ret;
> -                             unsigned int fault_flags = 0;
> +                             unsigned int fault_flags = FAULT_FLAG_KERNEL;
>  
>                               /* For mlock, just skip the stack guard page. */
>                               if (foll_flags & FOLL_MLOCK) {
> @@ -1943,6 +1943,7 @@ int fixup_user_fault(struct task_struct *tsk, struct 
> mm_struct *mm,
>       if (!vma || address < vma->vm_start)
>               return -EFAULT;
>  
> +     fault_flags |= FAULT_FLAG_KERNEL;
>       ret = handle_mm_fault(mm, vma, address, fault_flags);
>       if (ret & VM_FAULT_ERROR) {
>               if (ret & VM_FAULT_OOM)
> @@ -3760,22 +3761,14 @@ unlock:
>  /*
>   * By the time we get here, we already hold the mm semaphore
>   */
> -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> -             unsigned long address, unsigned int flags)
> +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct 
> *vma,
> +                          unsigned long address, unsigned int flags)
>  {
>       pgd_t *pgd;
>       pud_t *pud;
>       pmd_t *pmd;
>       pte_t *pte;
>  
> -     __set_current_state(TASK_RUNNING);
> -
> -     count_vm_event(PGFAULT);
> -     mem_cgroup_count_vm_event(mm, PGFAULT);
> -
> -     /* do counter updates before entering really critical section. */
> -     check_sync_rss_stat(current);
> -
>       if (unlikely(is_vm_hugetlb_page(vma)))
>               return hugetlb_fault(mm, vma, address, flags);
>  
> @@ -3856,6 +3849,31 @@ retry:
>       return handle_pte_fault(mm, vma, address, pte, pmd, flags);
>  }
>  
> +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> +                 unsigned long address, unsigned int flags)
> +{
> +     int in_userfault = !(flags & FAULT_FLAG_KERNEL);
> +     int ret;
> +
> +     __set_current_state(TASK_RUNNING);
> +
> +     count_vm_event(PGFAULT);
> +     mem_cgroup_count_vm_event(mm, PGFAULT);
> +
> +     /* do counter updates before entering really critical section. */
> +     check_sync_rss_stat(current);
> +
> +     if (in_userfault)
> +             mem_cgroup_set_userfault(current);
> +
> +     ret = __handle_mm_fault(mm, vma, address, flags);
> +
> +     if (in_userfault)
> +             mem_cgroup_clear_userfault(current);
> +
> +     return ret;
> +}
> +
>  #ifndef __PAGETABLE_PUD_FOLDED
>  /*
>   * Allocate page upper directory.
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 79e451a..0c9f836 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -678,9 +678,12 @@ out:
>   */
>  void pagefault_out_of_memory(void)
>  {
> -     struct zonelist *zonelist = node_zonelist(first_online_node,
> -                                               GFP_KERNEL);
> +     struct zonelist *zonelist;
>  
> +     if (mem_cgroup_oom_synchronize())
> +             return;
> +
> +     zonelist = node_zonelist(first_online_node, GFP_KERNEL);
>       if (try_set_zonelist_oom(zonelist, GFP_KERNEL)) {
>               out_of_memory(NULL, 0, 0, NULL, false);
>               clear_zonelist_oom(zonelist, GFP_KERNEL);
> -- 
> 1.8.2.3
> 

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to