> if you want to give those changes a run, I've uploaded them here: > > git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git#tip-ras
Latest experiments show that sometimes checking kventd_up() before calling schedule_work() helps ... but mostly only when I fake some early logs from low numbered cpus. I added some traces to the real case of a left-over fatal error and got this splat: [ 0.331551] smpboot: CPU0: Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz (fam: 06, model: 3f, stepping: 04) [ 0.342117] Performance Events: PEBS fmt2+, 16-deep LBR, Haswell events, full-width counters, Intel PMU driver. [ 0.353471] ... version: 3 [ 0.357948] ... bit width: 48 [ 0.362523] ... generic registers: 4 [ 0.367000] ... value mask: 0000ffffffffffff [ 0.372935] ... max period: 0000ffffffffffff [ 0.378870] ... fixed-purpose events: 3 [ 0.383347] ... event mask: 000000070000000f [ 0.392357] x86: Booting SMP configuration: [ 0.397031] .... node #0, CPUs: #1 [ 0.423373] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter. [ 0.432705] #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 [ 0.706878] .... node #1, CPUs: #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35 [ 1.094625] .... node #2, CPUs: #36 [ 1.112958] mcelog: cpu 36 bank 8 status be00000000010090 [ 1.119201] mcelog() stashed at entry=0 [ 1.203602] mce: [Hardware Error]: Machine check events logged [ 1.220313] #37 [ 1.220412] BUG: unable to handle kernel [ 1.226954] #38 [ 1.229107] NULL pointer dereference at 0000000000000008 [ 1.235052] IP: [<ffffffff810980a1>] process_one_work+0x31/0x420 [ 1.236829] #39PGD 0 [ 1.244558] Oops: 0000 [#1] SMP [ 1.248189] Modules linked in: [ 1.251617] CPU: 36 PID: 263 Comm: kworker/36:0 Not tainted 4.1.0-rc8 #9 [ 1.259100] #40 [ 1.259100] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0065.R01.1505011640 05/01/2015 [ 1.272832] #41 [ 1.272833] task: ffff88181c1f4470 ti: ffff88181bd24000 task.ti: ffff88181bd24000 [ 1.283350] RIP: 0010:[<ffffffff810980a1>] [ 1.286433] #42 [<ffffffff810980a1>] process_one_work+0x31/0x420 [ 1.294976] RSP: 0000:ffff88181bd27e08 EFLAGS: 00010046 I.e. we die on the first attempt to log ... but that attempt is a long way into bringing up all the cpus. CPU#36 is the first one from socket2 (counting 0, 1, 2, 3). -Tony