Re: [lxc-users] Kernel lockups when running lxc-start

Dao Quang Minh Wed, 12 Mar 2014 08:07:02 -0700

We built the kernel based on ubuntu tree with some patches backported from
3.14 ( https://github.com/nitrous-io/linux/commits/stable-trusty )
I can try to run another stress test to see if we can replicate the bug
with the debug kernel.


Hmm, moving to overlayfs wasnt actually considered ( we actually moved from
overlayfs to aufs because we had some weird bugs with overlayfs, but i
forgot what they were ). However, we can try to remove the bind mounts and
see if that helps.

Daniel.


On Wed, Mar 12, 2014 at 10:46 PM, Serge Hallyn <serge.hal...@ubuntu.com>wrote:

> Quoting Dao Quang Minh (dqmin...@gmail.com):
> > Hi all,
> >
> > We encounter a bug today when one of our systems enter soft-lockup when
> we try to start a container. Unfortunately at that point, we have to do a
> power cycle because we can’t access the system anymore. Here is the
> kernel.log:
> >
> > [14164995.081770] BUG: soft lockup - CPU#3 stuck for 22s!
> [lxc-start:20066]
> > [14164995.081784] Modules linked in: overlayfs(F) veth(F) xt_CHECKSUM(F)
> quota_v2(F) quota_tree(F) bridge(F) stp(F) llc(F) ipt_MASQUERADE(F)
> xt_nat(F) xt_tcpudp(F) iptable_nat(F) nf_conntrack_ipv4(F)
> nf_defrag_ipv4(F) nf_nat_ipv4(F) nf_nat(F) nf_conntrack(F) xt_LOG(F)
> iptable_filter(F) iptable_mangle(F) ip_tables(F) x_tables(F) intel_rapl(F)
> crct10dif_pclmul(F) crc32_pclmul(F) ghash_clmulni_intel(F) aesni_intel(F)
> ablk_helper(F) cryptd(F) lrw(F) gf128mul(F) glue_helper(F) aes_x86_64(F)
> microcode(F) isofs(F) xfs(F) libcrc32c(F) raid10(F) raid456(F) async_pq(F)
> async_xor(F) xor(F) async_memcpy(F) async_raid6_recov(F) raid6_pq(F)
> async_tx(F) raid1(F) raid0(F) multipath(F) linear(F)
> > [14164995.081820] CPU: 3 PID: 20066 Comm: lxc-start Tainted: GF   B
>    3.13.4 #1
> > [14164995.081823] task: ffff880107da9810 ti: ffff8800f494e000 task.ti:
> ffff8800f494e000
> > [14164995.081825] RIP: e030:[<ffffffff811e266b>]  [<ffffffff811e266b>]
> __lookup_mnt+0x5b/0x80
> > [14164995.081835] RSP: e02b:ffff8800f494fcd8  EFLAGS: 00000296
> > [14164995.081837] RAX: ffffffff81c6b7e0 RBX: 00000000011e7ab2 RCX:
> ffff8810a36890b0
> > [14164995.081838] RDX: 0000000000000997 RSI: ffff881005054f00 RDI:
> ffff881017f2fba0
> > [14164995.081840] RBP: ffff8800f494fce8 R08: 0035313638363436 R09:
> ffff881005054f00
> > [14164995.081841] R10: 0001010000000000 R11: ffffc90000000000 R12:
> ffff8810a29a3000
> > [14164995.081842] R13: ffff8800f494ff28 R14: ffff8800f494fdb8 R15:
> 0000000000000000
> > [14164995.081848] FS:  00007fabd0fec800(0000) GS:ffff88110e4c0000(0000)
> knlGS:0000000000000000
> > [14164995.081850] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [14164995.081851] CR2: 0000000001dce000 CR3: 00000000f515f000 CR4:
> 0000000000002660
> > [14164995.081853] Stack:
> > [14164995.081854]  ffff8800f494fd18 ffffffff81c6b7e0 ffff8800f494fd18
> ffffffff811e2740
> > [14164995.081857]  ffff8800f494fdb8 ffff8800f494ff28 ffff8810a29a3000
> ffff8800f494ff28
> > [14164995.081860]  ffff8800f494fd38 ffffffff811cd17e ffff8800f494fda8
> ffff8810a29a3000
> > [14164995.081862] Call Trace:
> > [14164995.081868]  [<ffffffff811e2740>] lookup_mnt+0x30/0x70
> > [14164995.081872]  [<ffffffff811cd17e>] follow_mount+0x5e/0x70
> > [14164995.081875]  [<ffffffff811cffd2>] mountpoint_last+0xc2/0x1e0
> > [14164995.081877]  [<ffffffff811d01c7>] path_mountpoint+0xd7/0x450
> > [14164995.081883]  [<ffffffff817639e3>] ?
> _raw_spin_unlock_irqrestore+0x23/0x50
> > [14164995.081888]  [<ffffffff811a80a3>] ? kmem_cache_alloc+0x1d3/0x1f0
> > [14164995.081891]  [<ffffffff811d225a>] ? getname_flags+0x5a/0x190
> > [14164995.081893]  [<ffffffff811d225a>] ? getname_flags+0x5a/0x190
> > [14164995.081896]  [<ffffffff811d0574>] filename_mountpoint+0x34/0xc0
> > [14164995.081899]  [<ffffffff811d2f9a>] user_path_mountpoint_at+0x4a/0x70
> > [14164995.081902]  [<ffffffff811e317f>] SyS_umount+0x7f/0x3b0
> > [14164995.081907]  [<ffffffff8102253d>] ? syscall_trace_leave+0xdd/0x150
> > [14164995.081912]  [<ffffffff8176c87f>] tracesys+0xe1/0xe6
> > [14164995.081913] Code: 03 0d a2 56 b3 00 48 8b 01 48 89 45 f8 48 8b 55
> f8 31 c0 48 39 ca 74 2b 48 89 d0 eb 13 0f 1f 00 48 8b 00 48 89 45 f8 48 8b
> 45 f8 <48> 39 c8 74 18 48 8b 50 10 48 83 c2 20 48 39 d7 75 e3 48 39 70
> >
> > After this point, it seems that all lxc-start will fail,but the system
> continues to run until we power-cycled it.
> >
> > When i inspected some of the containers that were started during that
> time, i saw that one of them has an existing lxc_putold directory ( which
> should be removed when the container finished starting up right ? ).
> However, i'm not sure if that is related to the lockup above.
> >
> > The host is running on a 12.04 ec2 server, with lxc 1.0.0 and kernel
> 3.13.0-12.32
>
> Hi,
>
> where did you get yoru kernel?  Is there an updated version you can
> fetch or build?
>
> You might want to grab or build the debug symbols and see if you
> can track down what's actually happening in the kernel.  The stack
> trace doesn't really make sense to me - I see where getname_flags
> calls __getname() which is kmem_cache_alloc(), but I don't see
> how that gets us to path_mountpoint().  interrupt?
>
> Anyway, my *guess* is this is a bug in aufs, which unfortunately is
> not upstream.  If you could try replacing aufs with overlayfs, and
> see if that causes the same problem, that would be a helpful datapoint.
>
> -serge
> _______________________________________________
> lxc-users mailing list
> lxc-users@lists.linuxcontainers.org
> http://lists.linuxcontainers.org/listinfo/lxc-users

_______________________________________________
lxc-users mailing list
lxc-users@lists.linuxcontainers.org
http://lists.linuxcontainers.org/listinfo/lxc-users

Re: [lxc-users] Kernel lockups when running lxc-start

Reply via email to