memtest86+ has been running for 6 days, no issues found.

On 2020/12/22 23:49, Stuart Henderson wrote:
> An update on this:
> 
> I have retrieved the machine that's having problems. At home it does
> still trigger the crashes but not as easily. So I have a choice of
> "crashes more easily, but some of these crashes hang and will
> require someone to visit the site to power cycle" - this is a ~2h
> round trip for me - or "doesn't crash as easily, but I can reboot
> it".
> 
> Still, it has usually crashed within a day or so - except when I tested
> GENERIC rather than GENERIC.MP. I have seen at least 3 days uptime
> with GENERIC without hitting a problem (then I accidentally knocked
> power, after it rebooted back onto MP it crashed within minutes)..
> 
> So at this point I think it's fairly safe to say that it only seems
> to affect MP, or at least happens very rarely with non MP.
> 
> But since I am no longer triggering the problem particularly quickly,
> bisecting further through old kernels is going to be a very slow
> process...
> 
> 
> On 2020/11/29 18:37, Stuart Henderson wrote:
> > On 2020/11/29 12:54, Stuart Henderson wrote:
> > > On 2020/11/29 13:20, Theo Buehler wrote:
> > > > On Sun, Nov 29, 2020 at 11:22:06AM +0000, Stuart Henderson wrote:
> > > > > I have now seen mine crash with just the base "on by default" daemons,
> > > > > one incoming ssh connection, top, and dhclient running.
> > > > > 
> > > > > I'm going to try bisecting old kernels to see if I can figure out when
> > > > > it was introduced.
> > > > > 
> > > > > It might also be interesting to try GENERIC rather than GENERIC.MP.
> > > > > 
> > > > 
> > > > Thanks for digging into this. Your APU seems much worse off than mine,
> > > > which takes a few weeks before crashing these days, so it's not much use
> > > > for bisecting.
> > > > 
> > > > Just a few data points that may help, assuming we see the same thing.
> > > > 
> > > > I had been running the firwmare 4.10.0.3 for more than a year with
> > > > seemingly no issues, but I updated to 4.12.0.6 early November.
> > > > 
> > > > My snapshot updates prior to running into crashes were
> > > > 
> > > > Jul 7 -> Aug 21 -> Sep 21.
> > > > 
> > > > The first crash I had was with the Sep 21 snapshot after a bit more than
> > > > a week uptime.
> > > > 
> > > > With early October snapshots it got particularly bad with crashes almost
> > > > daily, that's when I reported. The first snap I saw crashing when going
> > > > back and forth was from Sep 5.
> > > > 
> > > > Assuming you see the same thing as me, this would likely make the window
> > > > for bisecting into
> > > > 
> > > > Jul 7 <-> Sep 5.
> > > > 
> > > > I always ran GENERIC.MP.
> > > > 
> > > 
> > > Thanks, I found your earlier mail and started with Sep 11 which crashed
> > > after about half an hour. I would have tried something around the 5th next
> > > (there weren't many snaps built 5-13th) but given what you say I'll go a
> > > little earlier so I'm now trying Sep 2 and I have kernels from a few other
> > > snapshots around then lined up.
> > > 
> > 
> > Results so far:
> > 
> > 20200913 6.8-beta (GENERIC.MP) #65: Fri Sep 11 11:30:09 MDT 2020
> > kernel: double fault trap, code=0
> > Stopped at      __mp_unlock+0x31:       pushq   %rax
> > 
> > 20200903 6.8-beta (GENERIC.MP) #56: Wed Sep 2 10:46:22 MDT 2020
> > kernel: double fault trap, code=0
> > Stopped at      intr_user_exit_post_ast+0x3e:   callq   
> > intr_user_exit_post_ast 0x52
> > 
> > 20200830 6.7-current (GENERIC.MP) #49: Sat Aug 29 10:11:07 MDT 2020
> > attempt to execute user address 0x0 in supervisor mode
> > kernel: page fault trap, code=0
> > Stopped at      0
> > ddb{1}> tr
> > 0(800,0,1771,ffffffff81982d54,0,6d89723) at 0
> > ffff80001fc2ca90(bf243abd3ab5d267,ffff80001fa79010,ffff80001fa78ff0,ffff8000fff
> > fe278,0,0) at 0xffff80001fc2ca90
> > sched_idle(ffff80001fa78ff0) at sched_idle+0x27e
> > end trace frame: 0x0, count: -2
> > ddb{1}> sh reg
> > rdi                            0x800
> > rsi                                0
> > rbp               0xffff80001fc2ca90
> > rbx                           0x1388    __ALIGN_SIZE+0x388
> > rdx                           0x1771    __ALIGN_SIZE+0x771
> > rcx                           0x1388    __ALIGN_SIZE+0x388
> > rax                                0
> > r8                             0x202
> > r9                               0x1
> > r10               0xb08128fb379a3b3d
> > r11               0xbf243abd3ab5d267
> > r12               0xffff800000024600
> > r13               0xffff80001fa78ff0
> > r14               0xffff80001fa796f8
> > r15               0xffff800000089660
> > rip                                0
> > cs                               0x8
> > rflags                       0x10202    __ALIGN_SIZE+0xf202
> > rsp               0xffff80001fc2ca40
> > ss                              0x10
> > 0
> > ddb{1}> ps /o
> >     TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
> >  332094  85912    736         0x2          0    0  ssh
> >   34031  33673    736    0x100002      0x480    3  ssh
> >  277703  64495    736    0x100002      0x480    2  ssh
> > 
> > 20200825 6.7-current (GENERIC.MP) #39: Mon Aug 24 11:09:52 MDT 2020
> > - hanged after 1h21 uptime
> > 
> > It is not responding to BREAK so I will need to wait until someone
> > can go on-site to power-cycle it before I can test any more, not sure
> > when that will be.
> > 
> 

Reply via email to