2015-05-08 16:01 GMT+08:00 Yogesh Narayan Gaur <yn.g...@samsung.com>: > EP-2DAD0AFA905A4ACB804C4F82A001242F > > ------- Original Message ------- > Sender : yalin wang<yalin.wang2...@gmail.com> > Date : May 08, 2015 13:17 (GMT+05:30) > Title : Re: [EDT] oom_killer: find bulkiest task based on pss value > > 2015-05-08 13:29 GMT+08:00 Yogesh Narayan Gaur : >>> >>> EP-2DAD0AFA905A4ACB804C4F82A001242F >>> Hi Andrew, >>> >>> Presently in oom_kill.c we calculate badness score of the victim task as >>> per the present RSS counter value of the task. >>> RSS counter value for any task is usually '[Private (Dirty/Clean)] + >>> [Shared (Dirty/Clean)]' of the task. >>> We have encountered a situation where values for Private fields are less >>> but value for Shared fields are more and hence make total RSS counter value >>> large. Later on oom situation killing task with highest RSS value but as >>> Private field values are not large hence memory gain after killing this >>> process is not as per the expectation. >>> >>> For e.g. take below use-case scenario, in which 3 process are running in >>> system. >>> All these process done mmap for file exist in present directory and then >>> copying data from this file to local allocated pointers in while(1) loop >>> with some sleep. Out of 3 process, 2 process has mmaped file with >>> MAP_SHARED setting and one has mapped file with MAP_PRIVATE setting. >>> I have all 3 processes in background and checks RSS/PSS value from user >>> space utility (utility over cat /proc/pid/smaps) >>> Before OOM, below is the consumed memory status for these 3 process (all >>> processes run with oom_score_adj = 0) >>> ==================================================== >>> Comm : 1prg, Pid : 213 (values in kB) >>> Rss Shared Private Pss >>> Process : 375764 194596 181168 278460 >>> ==================================================== >>> Comm : 3prg, Pid : 217 (values in kB) >>> Rss Shared Private Pss >>> Process : 305760 32 305728 305738 >>> ==================================================== >>> Comm : 2prg, Pid : 218 (values in kB) >>> Rss Shared Private Pss >>> Process : 389980 194596 195384 292676 >>> ==================================================== >>> >>> Thus as per present code design, first it would select process [2prg : 218] >>> as bulkiest process as its RSS value is highest to kill. But if we kill >>> this process then only ~195MB would be free as compare to expected ~389MB. >>> Thus identifying the task based on RSS value is not accurate design and >>> killing that identified process didn’t release expected memory back to >>> system. >>> >>> We need to calculate victim task based on PSS instead of RSS as PSS value >>> calculates as >>> PSS value = [Private (Dirty/Clean)] + [Shared (Dirty/Clean) / no. of shared >>> task] >>> For above use-case scenario also, it can be checked that process [3prg : >>> 217] is having largest PSS value and by killing this process we can gain >>> maximum memory (~305MB) as compare to killing process identified based on >>> RSS value. >>> >>> -- >>> Regards, >>> Yogesh Gaur. > >> >>Great, >> >> in fact, i also encounter this scenario, >> I use USS (page map counter == 1) pages >> to decide which process should be killed, >> seems have the same result as you use PSS, >> but PSS is better , it also consider shared pages, >> in case some process have large shared pages mapping >> but little Private page mapping >> >> BRs, >> Yalin > > I have made patch which identifies bulkiest task on basis of PSS value. > Please check below patch. > This patch is correcting the way victim task gets identified in oom condition. > > ================== > > From 1c3d7f552f696bdbc0126c8e23beabedbd80e423 Mon Sep 17 00:00:00 2001 > From: Yogesh Gaur <yn.g...@samsung.com> > Date: Thu, 7 May 2015 01:52:13 +0530 > Subject: [PATCH] oom: find victim task based on pss > > This patch is identifying bulkiest task to kill by OOM on the basis of PSS > value > instead of present RSS values. > There can be scenario where task with highest RSS counter is consuming lot of > shared > memory and killing that task didn't release expected amount of memory to > system. > PSS value = [Private (Dirty/Clean)] + [Shared (Dirty/Clean) / no. of shared > task] > RSS value = [Private (Dirty/Clean)] + [Shared (Dirty/Clean)] > Thus, using PSS value instead of RSS value as PSS value closely matches with > actual > memory usage by the task. > This patch is using smaps_pte_range() interface defined in > CONFIG_PROC_PAGE_MONITOR. > For case when CONFIG_PROC_PAGE_MONITOR disabled, this simply returns RSS > value count. > > Signed-off-by: Yogesh Gaur <yn.g...@samsung.com> > Signed-off-by: Amit Arora <amit.ar...@samsung.com> > Reviewed-by: Ajeet Yadav <ajee...@samsung.com> > --- > fs/proc/task_mmu.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++ > include/linux/mm.h | 9 +++++++++ > mm/oom_kill.c | 9 +++++++-- > 3 files changed, 63 insertions(+), 2 deletions(-) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 956b75d..dd962ff 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -964,6 +964,53 @@ struct pagemapread { > bool v2; > }; > > +/** > + * get_mm_pss - function to determine PSS count of pages being used for proc > p > + * PSS value=[Private(Dirty/Clean)] + [Shared(Dirty/Clean)/no. of shared > task] > + * @p: task struct of which task we should calculate > + * @mm: mm struct of the task. > + * > + * This function needs to be called under task_lock for calling task 'p'. > + */ > +long get_mm_pss(struct task_struct *p, struct mm_struct *mm) > +{ > + long pss = 0; > + struct vm_area_struct *vma = NULL; > + struct mem_size_stats mss; > + struct mm_walk smaps_walk = { > + .pmd_entry = smaps_pte_range, > + .private = &mss, > + }; > + > + if (mm == NULL) > + return 0; > + > + /* task_lock held in oom_badness */ > + smaps_walk.mm = mm; > + > + if (!down_read_trylock(&mm->mmap_sem)) { > + pr_warn("Skipping task:%s\n", p->comm); > + return 0; > + } > + > + vma = mm->mmap; > + if (!vma) { > + up_read(&mm->mmap_sem); > + return 0; > + } > + > + while (vma) { > + memset(&mss, 0, sizeof(struct mem_size_stats)); > + walk_page_vma(vma, &smaps_walk); > + pss += (unsigned long) (mss.pss >> (12 + PSS_SHIFT)); /*PSS > in PAGE */ > + > + /* Check next vma in list */ > + vma = vma->vm_next; > + } > + up_read(&mm->mmap_sem); > + return pss; > +} > + > #define PAGEMAP_WALK_SIZE (PMD_SIZE) > #define PAGEMAP_WALK_MASK (PMD_MASK) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 47a9392..b6bb521 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1423,6 +1423,15 @@ static inline void setmax_mm_hiwater_rss(unsigned long > *maxrss, > *maxrss = hiwater_rss; > } > > +#ifdef CONFIG_PROC_PAGE_MONITOR > +long get_mm_pss(struct task_struct *p, struct mm_struct *mm); > +#else > +static inline long get_mm_pss(struct task_struct *p, struct mm_struct *mm) > +{ > + return 0; > +} > +#endif > + > #if defined(SPLIT_RSS_COUNTING) > void sync_mm_rss(struct mm_struct *mm); > #else > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 642f38c..537eb4c 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -151,6 +151,7 @@ unsigned long oom_badness(struct task_struct *p, struct > mem_cgroup *memcg, > { > long points; > long adj; > + long pss = 0; > > if (oom_unkillable_task(p, memcg, nodemask)) > return 0; > @@ -167,9 +168,13 @@ unsigned long oom_badness(struct task_struct *p, struct > mem_cgroup *memcg, > > /* > * The baseline for the badness score is the proportion of RAM that > each > - * task's rss, pagetable and swap space use. > + * task's pss, pagetable and swap space use. > */ > - points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) + > + pss = get_mm_pss(p, p->mm); > + if (pss == 0) /* make pss equals to rss, pseudo-pss */ > + pss = get_mm_rss(p->mm); > + > + points = pss + get_mm_counter(p->mm, MM_SWAPENTS) + > atomic_long_read(&p->mm->nr_ptes) + mm_nr_pmds(p->mm); > task_unlock(p); > > -- > 1.7.1 > > > -- > BRs > Yogesh Gaur. The logic seems ok , but i feel the code footprint is too large than the original method, maybe have some performance problems, i have add linux-mm mail list for mm expert to review .
BRs Yalin. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/