On Tue, Sep 30, 2014 at 1:16 AM, Michal Hocko <mho...@suse.cz> wrote: > On Mon 29-09-14 20:43:46, Cong Wang wrote: >> Hi, Johannes and Greg >> >> >> Please consider to backport the following commit to stable kernels < 3.12. >> >> commit 3812c8c8f3953921ef18544110dafc3505c1ac62 >> Author: Johannes Weiner <han...@cmpxchg.org> >> Date: Thu Sep 12 15:13:44 2013 -0700 >> >> mm: memcg: do not trap chargers with full callstack on OOM >> >> It should solve some soft lockup I observed on different machines >> recently. For me myself, I only care about 3.10. :-p > > Could you be more specific about the soft lockup you are seeing? >
Sure, almost same with the one in that changelog, this is why I didn't provide it in my previous email. See the bottom of this email for details. Note, I am not entirely sure it is because OOM killer tried to kill the one sleeping on inode mutex which caused the deadlock, it may be because OOM killer failed to kill some frozen process too as I saw many processes got frozen. If it is this case, we will need my patch: https://lkml.org/lkml/2014/9/4/646. But anyway, that commit definitely fixes some real soft lockups, which could be a stable candidate although it is a large one. I am willing to help if needed. Thanks! ----------------------> [8073927.905238] INFO: task mesos-slave:10041 blocked for more than 120 seconds. [8073927.905241] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [8073927.905243] mesos-slave D ffff88081bf46060 0 10041 10030 0x00000000 [8073927.905247] ffff8808208bddb8 0000000000000082 ffff8808545e2e40 ffff8808208bdfd8 [8073927.905252] ffff8808208bdfd8 0000000000012a00 ffff88081bf45c80 ffff88081bf45c80 [8073927.905255] ffff880da4351f94 ffff880da4351f90 ffff880da4351f98 0000000000000000 [8073927.905258] Call Trace: [8073927.905267] [<ffffffff814a40a6>] schedule+0x69/0x6b [8073927.905270] [<ffffffff814a4484>] schedule_preempt_disabled+0xe/0x10 [8073927.905273] [<ffffffff814a3287>] __mutex_lock_common.isra.9+0x148/0x1d6 [8073927.905278] [<ffffffff811ec271>] ? security_inode_permission+0x1c/0x21 [8073927.905281] [<ffffffff814a3401>] __mutex_lock_slowpath+0x13/0x15 [8073927.905284] [<ffffffff814a3102>] mutex_lock+0x1f/0x2f [8073927.905287] [<ffffffff81134f3c>] vfs_unlink+0x44/0xb7 [8073927.905289] [<ffffffff8113508b>] do_unlinkat+0xdc/0x17f [8073927.905292] [<ffffffff814a426d>] ? _cond_resched+0xe/0x1e [8073927.905295] [<ffffffff81058252>] ? task_work_run+0x82/0x94 [8073927.905300] [<ffffffff81002811>] ? do_notify_resume+0x57/0x65 [8073927.905303] [<ffffffff81135ca4>] SyS_unlink+0x16/0x18 [8073927.905307] [<ffffffff814aba46>] system_call_fastpath+0x1a/0x1f sysrq-t output: [8821221.981672] mesos-slave D ffff88081bf46060 0 10041 10030 0x00000000 [8821221.981674] ffff8808208bddb8 0000000000000082 ffff8808545e2e40 ffff8808208bdfd8 [8821221.981677] ffff8808208bdfd8 0000000000012a00 ffff88081bf45c80 ffff88081bf45c80 [8821221.981679] ffff880da4351f94 ffff880da4351f90 ffff880da4351f98 0000000000000000 [8821221.981682] Call Trace: [8821221.981685] [<ffffffff814a40a6>] schedule+0x69/0x6b [8821221.981687] [<ffffffff814a4484>] schedule_preempt_disabled+0xe/0x10 [8821221.981690] [<ffffffff814a3287>] __mutex_lock_common.isra.9+0x148/0x1d6 [8821221.981693] [<ffffffff811ec271>] ? security_inode_permission+0x1c/0x21 [8821221.981696] [<ffffffff814a3401>] __mutex_lock_slowpath+0x13/0x15 [8821221.981698] [<ffffffff814a3102>] mutex_lock+0x1f/0x2f [8821221.981701] [<ffffffff81134f3c>] vfs_unlink+0x44/0xb7 [8821221.981703] [<ffffffff8113508b>] do_unlinkat+0xdc/0x17f [8821221.981705] [<ffffffff814a426d>] ? _cond_resched+0xe/0x1e [8821221.981707] [<ffffffff81058252>] ? task_work_run+0x82/0x94 [8821221.981711] [<ffffffff81002811>] ? do_notify_resume+0x57/0x65 [8821221.981714] [<ffffffff81135ca4>] SyS_unlink+0x16/0x18 [8821221.981716] [<ffffffff814aba46>] system_call_fastpath+0x1a/0x1f [...] [8821221.986069] python2.6 D ffff881054386060 0 41843 10049 0x00000004 [8821221.986071] ffff8809677f5930 0000000000000082 ffff880eedf0ae40 ffff8809677f5fd8 [8821221.986074] ffff8809677f5fd8 0000000000012a00 ffff881054385c80 000000030d5d1e86 [8821221.986077] ffff88084ea73000 ffff88084ea73000 ffff88041ef49720 ffff88084ea73000 [8821221.986080] Call Trace: [8821221.986082] [<ffffffff814a40a6>] schedule+0x69/0x6b [8821221.986085] [<ffffffff814a2e24>] schedule_timeout+0xf3/0x129 [8821221.986087] [<ffffffff810499ce>] ? __internal_add_timer+0xb6/0xb6 [8821221.986090] [<ffffffff814a2eb8>] schedule_timeout_uninterruptible+0x1e/0x20 [8821221.986092] [<ffffffff811232d3>] __mem_cgroup_try_charge+0x3ea/0x8ff [8821221.986095] [<ffffffff81122d7a>] ? mem_cgroup_reclaim+0xb2/0xb2 [8821221.986097] [<ffffffff81123c2a>] mem_cgroup_charge_common+0x35/0x5d [8821221.986100] [<ffffffff811250aa>] mem_cgroup_cache_charge+0x51/0x81 [8821221.986103] [<ffffffff810e237d>] add_to_page_cache_locked+0x3b/0x104 [8821221.986106] [<ffffffff810e245e>] add_to_page_cache_lru+0x18/0x39 [8821221.986110] [<ffffffff810e278a>] grab_cache_page_write_begin+0x87/0xb7 [8821221.986113] [<ffffffff81190c20>] ext4_write_begin+0xef/0x28b [8821221.986116] [<ffffffff810e1adc>] generic_file_buffered_write+0xfd/0x20c [8821221.986119] [<ffffffff8113c8fb>] ? update_time+0xa2/0xa9 [8821221.986122] [<ffffffff810e3375>] __generic_file_aio_write+0x1c0/0x1f8 [8821221.986124] [<ffffffff810e3408>] generic_file_aio_write+0x5b/0xa9 [8821221.986127] [<ffffffff8118951f>] ext4_file_write+0x2e5/0x376 [8821221.986129] [<ffffffff8100665c>] ? emulate_vsyscall+0x212/0x2f6 [8821221.986132] [<ffffffff8149ae9f>] ? __bad_area_nosemaphore+0xb4/0x1bf [8821221.986135] [<ffffffff81128ee1>] do_sync_write+0x68/0x95 [8821221.986138] [<ffffffff81129566>] vfs_write+0xb2/0x117 [8821221.986141] [<ffffffff81129bb9>] SyS_write+0x46/0x74 [8821221.986144] [<ffffffff814aba46>] system_call_fastpath+0x1a/0x1f -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/