I am trying to debug a problem that has been happening occationally for
years on some of our systems running 3.0.101 kernel (yes I know it is
old, we are moving to 4.9 at the moment but I would like older releases
to be fixed too, assuming 4.9 makes this problem disappear).
What is happening is that once in a while a process does something wrong
and segfaults, and dumps core. We have a handler to process the core dump
to name it and compress it and make sure we don't keep to many around,
so the core_pattern uses the pipe option to pipe the dump to a shell
script that saves it with the pid and current timestamp and gzips it.
Once in a while when this happens, the kernel hits a null pointer
dereference in fpu.state->xsave while doing __switch_to.
The system ix x86_64 with dual E5-2620 CPUs (6 cores each with
hyperthreading). Some people think they have seen it on other systems,
but are not sure. I have not been able to trigger it on other systems
yet.
It used to take about a week of running tests to trigger it, but I have
now managed to hit it in a few minutes pretty reliably.
The way I trigger it is:
abuse.c:
#include
#include
#include
#include
#include
#include
#include
int main() {
pid_t pid;
pid = getpid();
printf("%u: pid / PI = %f\n", pid, pid / M_PI);
kill(pid, SIGSEGV);
return 0;
}
I then run:
for i in `seq 1 1 1`; do ./abuse & echo -n; done
That pretty reliably hits the problem.
The crash dump we get is:
Dec 19 12:24:03 HWC-64 kernel: BUG: unable to handle kernel NULL pointer
dereference at 033f
Dec 19 12:24:03 HWC-64 kernel: IP: [] __switch_to+0x4c/0x2b0
Dec 19 12:24:03 HWC-64 kernel: PGD 39a6ce067 PUD 39a8cf067 PMD 0
Dec 19 12:24:03 HWC-64 kernel: Oops: 0002 [#1] SMP
Dec 19 12:24:03 HWC-64 kernel: CPU 12
Dec 19 12:24:03 HWC-64 kernel: Modules linked in: dpi_drv(P) ccu_util(P)
ipv4_mb(P) l2bridge_config_util(P) l2_config_util(P) route_config_util(P)
qos_config_util(P) sysapp_common(P) chantry_fwd_eng_2800_config(P)
shim_module(P) sadb_cc(P) ipsecXformer(P) libeCrypto(P) ipmatch_cc(P) l2h_cc(P)
ndproxy_cc(P) arpint_cc(P) portinfo_cc(P) chantryqos_cc(P) redirector_cc(P)
ix_ph(P) fpm_core_cc(P) pulse_cc(P) vnstt_cc(P) vnsap_cc(P) fm_cc(P) rutm_cc(P)
mutm_cc(P) ethernet_tx_cc(P) stkdrv_cc(P) l2bridge_cc(P) events_util(P)
sched_cc(P) qm_cc(P) ipv4_cc(P) wred_cc(P) tc_meter_cc(P) dscp_classifier_cc(P)
classifier_6t_cc(P) ent586_cc(P) dev_cc_arp(P) chantry_fwd_eng_2800_tables(P)
ether_arp_lib(P) rtmv4_lib(P) lkup_lib(P) l2tm_lib(P) fragmentation_lib(P)
properties_lib(P) msg_support_lib(P) utilities_lib(P) cci_lib(P) rm_lib(P)
libossl vip productSpec_x86_dp(P) ixgbe igb
Dec 19 12:24:03 HWC-64 kernel:
Dec 19 12:24:03 HWC-64 kernel: Pid: 16440, comm: coremgr.sh Tainted: P
3.0.101 #6 Intel Corporation S2600GZ ../S2600GZ
Dec 19 12:24:03 HWC-64 kernel: RIP: 0010:[]
[] __switch_to+0x4c/0x2b0
Dec 19 12:24:03 HWC-64 kernel: RSP: 0018:88042f2c9de8 EFLAGS: 00010002
Dec 19 12:24:03 HWC-64 kernel: RAX: RBX: 8803d8266ae0 RCX:
000c
Dec 19 12:24:03 HWC-64 kernel: RDX: RSI: 88042f29b840 RDI:
Dec 19 12:24:03 HWC-64 kernel: RBP: 88043f38da40 R08: 8803d8266e08 R09:
000100018011
Dec 19 12:24:03 HWC-64 kernel: R10: 4000 R11: 0246 R12:
Dec 19 12:24:03 HWC-64 kernel: R13: 88042ce62080 R14: 000c R15:
88042f29b840
Dec 19 12:24:03 HWC-64 kernel: FS: 7f72f71a5700()
GS:88043f38() knlGS:
Dec 19 12:24:03 HWC-64 kernel: CS: 0010 DS: ES: CR0: 80050033
Dec 19 12:24:03 HWC-64 kernel: CR2: 033f CR3: 0003d8c7 CR4:
000406e0
Dec 19 12:24:03 HWC-64 kernel: DR0: DR1: DR2:
Dec 19 12:24:03 HWC-64 kernel: DR3: DR6: 0ff0 DR7:
0400
Dec 19 12:24:03 HWC-64 kernel: Process coremgr.sh (pid: 16440, threadinfo
88039a00c000, task 8803d8266ae0)
Dec 19 12:24:03 HWC-64 kernel: Stack:
Dec 19 12:24:03 HWC-64 kernel: 0018 88043f38fe40
88043f38fe40 88042f29b840
Dec 19 12:24:03 HWC-64 kernel: 81301c94
88042f29b840 0046
Dec 19 12:24:03 HWC-64 kernel: 88042f2c9fd8 88042f2c9fd8
a000
Dec 19 12:24:03 HWC-64 kernel: Call Trace:
Dec 19 12:24:03 HWC-64 kernel: Code: 48 63 c1 48 03 14 c5 40 a0 67 81 8b 87 b8
03 00 00 48 89 d5 85 c0 74 37 66 66 90 66 90 b8 ff ff ff ff 48 8b bf c0 03 00
00 89 c2
Dec 19 12:24:03 HWC-64 kernel: <48> 0f ae 37 48 8b 83 c0 03 00 00 f6 80 00 02
00 00 01 0f 84 88
Dec 19 12:24:03 HWC-64 kernel: RIP [] __switch_to+0x4c/0x2b0
Dec 19 12:24:03 HWC-64 kernel: RSP
Dec 19 12:24:03 HWC-64 kernel: CR2: 033f
Dec 19 12:24:03 HWC-64 kernel: ---[ end trace 366732e1020fb678 ]---
That is call