Hi,
we recently updated our kernel to 4.1.16 + patch for "unix: properly
account for FDs passed over unix sockets" and have since then
self-detected stalls triggered by the Samba daemon:
> Feb 1 09:03:14 dcs1 kernel: [ 1152.840007] INFO: rcu_sched self-detected
> stall on CPU { 3} (t=162780 jiffies g=47565 c=47564 q=1055670)
> Feb 1 09:03:14 dcs1 kernel: [ 1152.840007] Task dump for CPU 3:
> Feb 1 09:03:14 dcs1 kernel: [ 1152.840007] smbd R running task
> 0 5938 1 0x0000000c
> Feb 1 09:03:14 dcs1 kernel: [ 1152.840007] 0000000000000004
> ffffffff81851340 ffffffff810d3c84 000000000000b9cd
> Feb 1 09:03:14 dcs1 kernel: [ 1152.840007] ffff8801bfd97100
> ffffffff81851340 ffffffff81851340 ffffffff818f6c60
> Feb 1 09:03:14 dcs1 kernel: [ 1152.840007] ffffffff810d7659
> 0000000000000000 0000000000000000 00001e847fc2f700
> Feb 1 09:03:14 dcs1 kernel: [ 1152.840007] Call Trace:
> Feb 1 09:03:14 dcs1 kernel: [ 1152.840007] <IRQ> [<ffffffff810d3c84>] ?
> rcu_dump_cpu_stacks+0x84/0xc0
> Feb 1 09:03:14 dcs1 kernel: [ 1152.840007] [<ffffffff810d7659>] ?
> rcu_check_callbacks+0x449/0x740
> Feb 1 09:03:14 dcs1 kernel: [ 1152.840007] [<ffffffff810ec7c0>] ?
> tick_sched_do_timer+0x40/0x40
> Feb 1 09:03:14 dcs1 kernel: [ 1152.840007] [<ffffffff810dcc54>] ?
> update_process_times+0x34/0x70
> Feb 1 09:03:14 dcs1 kernel: [ 1152.840007] [<ffffffff810ec45c>] ?
> tick_sched_handle.isra.12+0x2c/0x70
> Feb 1 09:03:14 dcs1 kernel: [ 1152.840007] [<ffffffff810ec809>] ?
> tick_sched_timer+0x49/0x80
> Feb 1 09:03:14 dcs1 kernel: [ 1152.840007] [<ffffffff810dd57d>] ?
> __run_hrtimer+0x6d/0x1b0
> Feb 1 09:03:14 dcs1 kernel: [ 1152.840007] [<ffffffff810ddd4d>] ?
> hrtimer_interrupt+0xed/0x210
> Feb 1 09:03:14 dcs1 kernel: [ 1152.840007] [<ffffffff815a0ed9>] ?
> smp_apic_timer_interrupt+0x39/0x50
> Feb 1 09:03:14 dcs1 kernel: [ 1152.840007] [<ffffffff8159ef7e>] ?
> apic_timer_interrupt+0x6e/0x80
> Feb 1 09:03:14 dcs1 kernel: [ 1152.840007] <EOI> [<ffffffff8159de85>] ?
> _raw_spin_lock+0x35/0x50
> Feb 1 09:03:14 dcs1 kernel: [ 1152.840007] [<ffffffff8153b343>] ?
> unix_dgram_connect+0x93/0x200
> Feb 1 09:03:14 dcs1 kernel: [ 1152.840007] [<ffffffff8147f248>] ?
> SYSC_connect+0xe8/0x100
> Feb 1 09:03:14 dcs1 kernel: [ 1152.840007] [<ffffffff8159e0f2>] ?
> system_call_fast_compare_end+0xc/0x6b
> Feb 1 11:48:13 ucs22f kernel: [307999.162254] INFO: rcu_sched self-detected
> stall on CPU { 0} (t=5250 jiffies
> g=6586733 c=6586732 q=6757)
> Feb 1 11:48:13 ucs22f kernel: [307999.162264] Task dump for CPU 0:
> Feb 1 11:48:13 ucs22f kernel: [307999.162267] smbd R running
> 0 4615 4609 0x00000008
> Feb 1 11:48:13 ucs22f kernel: [307999.162272] 00200082 f5863b90 c10b3fe9
> c1682cc0 c1682cc0 c1682cc0 f79d2b00 f
> 5863bdc
> Feb 1 11:48:13 ucs22f kernel: [307999.162276] c10b722d c15bd400 00001482
> 0064816d 0064816c 00001a65 f79cd840 c
> 108b166
> Feb 1 11:48:13 ucs22f kernel: [307999.162280] 00000001 f5863bb8 f5863bb8
> 00000000 c1682cc0 f6808cf0 00001a65 f
> 6808cf0
> Feb 1 11:48:13 ucs22f kernel: [307999.162285] Call Trace:
> Feb 1 11:48:13 ucs22f kernel: [307999.162296] [<c10b3fe9>] ?
> rcu_dump_cpu_stacks+0x79/0xc0
> Feb 1 11:48:13 ucs22f kernel: [307999.162300] [<c10b722d>] ?
> rcu_check_callbacks+0x3cd/0x630
> Feb 1 11:48:13 ucs22f kernel: [307999.162304] [<c108b166>] ?
> account_process_tick+0x66/0x160
> Feb 1 11:48:13 ucs22f kernel: [307999.162307] [<c10bbe4f>] ?
> update_process_times+0x2f/0x60
> Feb 1 11:48:13 ucs22f kernel: [307999.162310] [<c10cbf9d>] ?
> tick_sched_handle.isra.12+0x2d/0x60
> Feb 1 11:48:13 ucs22f kernel: [307999.162328] [<c10cc210>] ?
> tick_sched_timer+0x40/0x80
> Feb 1 11:48:13 ucs22f kernel: [307999.162331] [<c10bc6b0>] ?
> __remove_hrtimer+0x40/0xa0
> Feb 1 11:48:13 ucs22f kernel: [307999.162334] [<c10bc97f>] ?
> __run_hrtimer+0x6f/0x190
> Feb 1 11:48:13 ucs22f kernel: [307999.162337] [<c10cc1d0>] ?
> tick_sched_do_timer+0x30/0x30
> Feb 1 11:48:13 ucs22f kernel: [307999.162339] [<c10bd15f>] ?
> hrtimer_interrupt+0xef/0x260
> Feb 1 11:48:13 ucs22f kernel: [307999.162343] [<c119ae3d>] ?
> getname_kernel+0x2d/0x100
> Feb 1 11:48:13 ucs22f kernel: [307999.162348] [<c1046f7f>] ?
> local_apic_timer_interrupt+0x2f/0x60
> Feb 1 11:48:13 ucs22f kernel: [307999.162353] [<c14e4543>] ?
> smp_apic_timer_interrupt+0x33/0x50
> Feb 1 11:48:13 ucs22f kernel: [307999.162355] [<c14e3c7c>] ?
> apic_timer_interrupt+0x34/0x3c
> Feb 1 11:48:13 ucs22f kernel: [307999.162358] [<c14e2dc1>] ?
> _raw_spin_lock+0x51/0x70
> Feb 1 11:48:13 ucs22f kernel: [307999.162362] [<c148c075>] ?
> unix_state_double_lock+0x25/0x60
> Feb 1 11:48:13 ucs22f kernel: [307999.162365] [<c148de10>] ?
> unix_dgram_connect+0x90/0x1f0
> Feb 1 11:48:13 ucs22f kernel: [307999.162369] [<c13e4267>] ?
> SYSC_connect+0xc7/0xe0
> Feb 1 11:48:13 ucs22f kernel: [307999.162371] [<c13e2931>] ?
> sock_map_fd+0x41/0x60
> Feb 1 11:48:13 ucs22f kernel: [307999.162374] [<c13e5014>] ?
> SYSC_socketcall+0x1b4/0xa20
> Feb 1 11:48:13 ucs22f kernel: [307999.162376] [<c10c2940>] ?
> ktime_get+0x50/0x100
> Feb 1 11:48:13 ucs22f kernel: [307999.162379] [<c10466db>] ?
> lapic_next_event+0x1b/0x20
> Feb 1 11:48:13 ucs22f kernel: [307999.162381] [<c10ca0ed>] ?
> clockevents_program_event+0x9d/0x140
> Feb 1 11:48:13 ucs22f kernel: [307999.162385] [<c129e068>] ?
> list_del+0x8/0x20
> Feb 1 11:48:13 ucs22f kernel: [307999.162388] [<c1097ef7>] ?
> remove_wait_queue+0x27/0x40
> Feb 1 11:48:13 ucs22f kernel: [307999.162392] [<c11c8795>] ?
> inotify_read+0x295/0x340
> Feb 1 11:48:13 ucs22f kernel: [307999.162396] [<c10acc76>] ?
> handle_irq_event_percpu+0xa6/0x1a0
> Feb 1 11:48:13 ucs22f kernel: [307999.162399] [<c11a786f>] ?
> set_close_on_exec+0x2f/0x60
> Feb 1 11:48:13 ucs22f kernel: [307999.162402] [<c119d084>] ?
> do_fcntl+0x2f4/0x4e0
> Feb 1 11:48:13 ucs22f kernel: [307999.162405] [<c107d6df>] ?
> commit_creds+0xff/0x1f0
> Feb 1 11:48:13 ucs22f kernel: [307999.162407] [<c119d380>] ?
> SyS_fcntl64+0x60/0x100
> Feb 1 11:48:13 ucs22f kernel: [307999.162409] [<c13e5953>] ?
> SyS_socketcall+0x13/0x20
> Feb 1 11:48:13 ucs22f kernel: [307999.162412] [<c14e30db>] ?
> sysenter_do_call+0x12/0x12
We have not yet been able to reproduce the hang, but going back to our
previous kernel 4.1.12 makes the problem go away.
Is this a known issue or do you have an idea where to look?
What information should I collect next time it happens?
(Can unix_diag.ko with `ss` help?)
What other kernel configs should I enable do debug this dead-lock?
Thanks in advance
Philipp