On Tue, Jun 21, 2016 at 08:47:51PM +0900, Tetsuo Handa wrote: > Johannes Stezenbach wrote: > > > > a man's got to have a hobby, thus I'm running Android AOSP > > builds on my home PC which has 4GB of RAM, 4GB swap. > > Apparently it is not really adequate for the job but used to > > work with a 4.4.10 kernel. Now I upgraded to 4.6.2 > > and it crashes usually within 30mins during compilation. > > Such reproducer is welcomed. > You might be hitting OOM livelock using innocent workload. > > > The crash is a hard hang, mouse doesn't move, no reaction > > to keyboard, nothing in logs (systemd journal) after reboot. > > Yes, it seems to me that your system is OOM livelocked.
I got from my crash log that X is hanging in i915_gem_object_get_pages_gtt, and network is dead due to order 0 allocation errors causing a series of "ath9k_htc: RX memory allocation error", which is what makes the issue so unpleasant. The particular command which triggers it seems to be Jill from the Android Java toolchain (http://tools.android.com/tech-docs/jackandjill), which runs as "java -Xmx3500m -jar $(JILL_JAR)", i.e. potentially eating all my available RAM when linking the Android framework. Meanwhile I found some RAM and linux-4.6.2 runs stable with 8GB for this workload. The build time (for the partial AOSP rebuild that fairly reliably triggered the hangup) dropped from ~20min to ~17min (so it wasn't trashing too badly), swap usage dropped from ~50% (of 4GB) to <5%. > It is sad that we haven't merged kmallocwd which will report > which memory allocations are stalling > ( > http://lkml.kernel.org/r/1462630604-23410-1-git-send-email-penguin-ker...@i-love.sakura.ne.jp > ). Would you like me to try it? It wouldn't prevent the hang, though, just print better debug ouptut to serial console, right? Or would it OOM kill some process? > > Then I tried 4.5.7, it seems to be stable so far. > > > > I'm using dm-crypt + lvm + ext4 (swap also in lvm). > > > > Now I hooked up a laptop to the serial port and captured > > some logs of the crash which seems to be repeating > > > > [ 2240.842567] swapper/3: page allocation failure: order:0, > > mode:0x2200020(GFP_NOWAIT|__GFP_HIGH|__GFP_NOTRACK) > > or > > [ 2241.167986] SLUB: Unable to allocate memory on node -1, > > gfp=0x2080020(GFP_ATOMIC) > > > > over and over. Based on the backtraces in the log I decided > > to hot-unplug USB devices, and twice the kernel came > > back to live, but on the 3rd crash it was dead for good. > > The values > > DMA free:12kB min:32kB > DMA32 free:2268kB min:6724kB > Normal free:84kB min:928kB > > suggest that memory reserves are spent for pointless purpose. Maybe your > system is > falling into situation which was mitigated by commit 78ebc2f7146156f4 > ("mm,writeback: > don't use memory reserves for wb_start_writeback"). Thus, applying that > commit to > your 4.6.2 kernel might help avoiding flood of these allocation failure > messages. I could try. Could you let me know if booting with mem=4G is equivalent, or do I need to use memmap= or physically remove the RAM (which is not so easy since the CPU fan is in the way). > > Before I pressed the reset button I used SysRq-W. At the bottom > > is a "BUG: workqueue lockup", it could be the result of > > the log spew on serial console taking so long but it looks > > like some IO is never completing. > > But even after you apply that commit, I guess you will still see silent hang > up > because the page allocator would think there is still reclaimable memory. So, > is > it possible to also try current linux.git kernels? I'd like to know whether > "OOM detection rework" (which went to 4.7) helps giving up reclaiming and > invoking the OOM killer with your workload. > > Maybe __GFP_FS allocations start invoking the OOM killer. But maybe __GFP_FS > allocations still remain stuck waiting for !__GFP_FS allocations whereas > !__GFP_FS > allocations gives up without invoking the OOM killer (i.e. effectively no > "give up"). I could also try. Same question about mem= though. What is your opinion about older kernels (4.4, 4.5) working? I think I've seen some OOM messages with the older kernels, Jill was killed and I restarted the build to complete it. A full bisect would take more than a day, I don't think I have the time for it. Since I use dm-crypt + lvm, should we add more Cc or do you think it is an mm issue? > > Below I'm pasting some log snippets, let me know if you like > > it so much you want more of it ;-/ The total log is about 1.7MB. > > Yes, I'd like to browse it. Could you send it to me? Did you get any additional insights from it? Thanks, Johannes