An update on this: I have retrieved the machine that's having problems. At home it does still trigger the crashes but not as easily. So I have a choice of "crashes more easily, but some of these crashes hang and will require someone to visit the site to power cycle" - this is a ~2h round trip for me - or "doesn't crash as easily, but I can reboot it".
Still, it has usually crashed within a day or so - except when I tested GENERIC rather than GENERIC.MP. I have seen at least 3 days uptime with GENERIC without hitting a problem (then I accidentally knocked power, after it rebooted back onto MP it crashed within minutes).. So at this point I think it's fairly safe to say that it only seems to affect MP, or at least happens very rarely with non MP. But since I am no longer triggering the problem particularly quickly, bisecting further through old kernels is going to be a very slow process... On 2020/11/29 18:37, Stuart Henderson wrote: > On 2020/11/29 12:54, Stuart Henderson wrote: > > On 2020/11/29 13:20, Theo Buehler wrote: > > > On Sun, Nov 29, 2020 at 11:22:06AM +0000, Stuart Henderson wrote: > > > > I have now seen mine crash with just the base "on by default" daemons, > > > > one incoming ssh connection, top, and dhclient running. > > > > > > > > I'm going to try bisecting old kernels to see if I can figure out when > > > > it was introduced. > > > > > > > > It might also be interesting to try GENERIC rather than GENERIC.MP. > > > > > > > > > > Thanks for digging into this. Your APU seems much worse off than mine, > > > which takes a few weeks before crashing these days, so it's not much use > > > for bisecting. > > > > > > Just a few data points that may help, assuming we see the same thing. > > > > > > I had been running the firwmare 4.10.0.3 for more than a year with > > > seemingly no issues, but I updated to 4.12.0.6 early November. > > > > > > My snapshot updates prior to running into crashes were > > > > > > Jul 7 -> Aug 21 -> Sep 21. > > > > > > The first crash I had was with the Sep 21 snapshot after a bit more than > > > a week uptime. > > > > > > With early October snapshots it got particularly bad with crashes almost > > > daily, that's when I reported. The first snap I saw crashing when going > > > back and forth was from Sep 5. > > > > > > Assuming you see the same thing as me, this would likely make the window > > > for bisecting into > > > > > > Jul 7 <-> Sep 5. > > > > > > I always ran GENERIC.MP. > > > > > > > Thanks, I found your earlier mail and started with Sep 11 which crashed > > after about half an hour. I would have tried something around the 5th next > > (there weren't many snaps built 5-13th) but given what you say I'll go a > > little earlier so I'm now trying Sep 2 and I have kernels from a few other > > snapshots around then lined up. > > > > Results so far: > > 20200913 6.8-beta (GENERIC.MP) #65: Fri Sep 11 11:30:09 MDT 2020 > kernel: double fault trap, code=0 > Stopped at __mp_unlock+0x31: pushq %rax > > 20200903 6.8-beta (GENERIC.MP) #56: Wed Sep 2 10:46:22 MDT 2020 > kernel: double fault trap, code=0 > Stopped at intr_user_exit_post_ast+0x3e: callq > intr_user_exit_post_ast 0x52 > > 20200830 6.7-current (GENERIC.MP) #49: Sat Aug 29 10:11:07 MDT 2020 > attempt to execute user address 0x0 in supervisor mode > kernel: page fault trap, code=0 > Stopped at 0 > ddb{1}> tr > 0(800,0,1771,ffffffff81982d54,0,6d89723) at 0 > ffff80001fc2ca90(bf243abd3ab5d267,ffff80001fa79010,ffff80001fa78ff0,ffff8000fff > fe278,0,0) at 0xffff80001fc2ca90 > sched_idle(ffff80001fa78ff0) at sched_idle+0x27e > end trace frame: 0x0, count: -2 > ddb{1}> sh reg > rdi 0x800 > rsi 0 > rbp 0xffff80001fc2ca90 > rbx 0x1388 __ALIGN_SIZE+0x388 > rdx 0x1771 __ALIGN_SIZE+0x771 > rcx 0x1388 __ALIGN_SIZE+0x388 > rax 0 > r8 0x202 > r9 0x1 > r10 0xb08128fb379a3b3d > r11 0xbf243abd3ab5d267 > r12 0xffff800000024600 > r13 0xffff80001fa78ff0 > r14 0xffff80001fa796f8 > r15 0xffff800000089660 > rip 0 > cs 0x8 > rflags 0x10202 __ALIGN_SIZE+0xf202 > rsp 0xffff80001fc2ca40 > ss 0x10 > 0 > ddb{1}> ps /o > TID PID UID PRFLAGS PFLAGS CPU COMMAND > 332094 85912 736 0x2 0 0 ssh > 34031 33673 736 0x100002 0x480 3 ssh > 277703 64495 736 0x100002 0x480 2 ssh > > 20200825 6.7-current (GENERIC.MP) #39: Mon Aug 24 11:09:52 MDT 2020 > - hanged after 1h21 uptime > > It is not responding to BREAK so I will need to wait until someone > can go on-site to power-cycle it before I can test any more, not sure > when that will be. >