On Thu, Jun 23, 2016 at 08:26:35PM +0900, Tetsuo Handa wrote: > > Since you think you saw OOM messages with the older kernels, I assume that > the OOM > killer was invoked on your 4.6.2 kernel. The OOM reaper in Linux 4.6 and > Linux 4.7 > will not help if the OOM killed process was between down_write(&mm->mmap_sem) > and > up_write(&mm->mmap_sem). > > I was not able to confirm whether the OOM killed process (I guess it was java) > was holding mm->mmap_sem for write, for /proc/sys/kernel/hung_task_warnings > dropped to 0 before traces of java threads are printed or console became > unusable due to the "delayed: kcryptd_crypt, ..." line. Anyway, I think that > kmallocwd will report it. > > > > It is sad that we haven't merged kmallocwd which will report > > > which memory allocations are stalling > > > ( > > > http://lkml.kernel.org/r/1462630604-23410-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp > > > ). > > > > Would you like me to try it? It wouldn't prevent the hang, though, > > just print better debug ouptut to serial console, right? > > Or would it OOM kill some process? > > Yes, but for bisection purpose, please try commit 78ebc2f7146156f4 without > applying kmallocwd. If that commit helps avoiding flood of the allocation > failure warnings, we can consider backporting it. If that commit does not > help, I think you are reporting a new location which we should not use > memory reserves. > > kmallocwd will not OOM kill some process. kmallocwd will not prevent the hang. > kmallocwd just prints information of threads which are stalling inside memory > allocation request.
First I tried today's git, linux-4.7-rc4-187-g086e3eb, and the good news is that the oom killer seems to work very well and reliably killed the offending task (java). It happened a few times, the AOSP build broke and I restarted it until it completed. E.g.: [ 2083.604374] Purging GPU memory, 0 pages freed, 4508 pages still pinned. [ 2083.611000] 96 and 0 pages still available in the bound and unbound GPU page lists. [ 2083.618815] make invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0 [ 2083.629257] make cpuset=/ mems_allowed=0 ... [ 2084.688753] Out of memory: Kill process 10431 (java) score 378 or sacrifice child [ 2084.696593] Killed process 10431 (java) total-vm:5200964kB, anon-rss:2521764kB, file-rss:0kB, shmem-rss:0kB [ 2084.938058] oom_reaper: reaped process 10431 (java), now anon-rss:0kB, file-rss:8kB, shmem-rss:0kB Next I tried 4.6.2 with 78ebc2f7146156f4, then with kmallocwd (needed one manual fixup), then both patches. It still livelocked in all cases, the log spew looked a bit different with 78ebc2f7146156f4 applied but still continued endlessly. kmallocwd alone didn't trigger, with both patches applied kmallocwd triggered but: [ 363.815595] MemAlloc-Info: stalling=33 dying=0 exiting=42 victim=0 oom_count=0 [ 363.815601] MemAlloc: kworker/0:0(4) flags=0x4208860 switches=212 seq=1 gfp=0x26012c0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_NOTRACK) order=0 delay=17984 ** 1402 printk messages dropped ** [ 363.818816] [<ffffffff8116d519>] __do_page_cache_readahead+0x144/0x29d ** 501 printk messages dropped ** I'll zip up the logs and send them off-list. Thanks, Johannes