On Fri, Feb 12, 2016 at 11:07:35AM -0500, Johan Huldtgren wrote: > hello, > > a bit over a week ago one of my remote boxes stopped responding, once > I got out there I noticed it had panicked, it doesn't have a serial > console so I grabbed some pictures and tried to bring it back alive. > The box came back but my softraid never recovered, this is all > documented in this thread: > > http://marc.info/?t=145476742800004&r=1&w=2 > > The end result seemed that the metadata had been destroyed and > softraid couldn't recover. I cut my losses, recreated the softraid and > started copying the data back. After having copied (with rsync) for > about a day (it's roughly 8TB of data, so takes some time) the machine > panicked again. Once again it seems my softraid is unrecoverable. I'm > not sure if I have some broken hardware (drives), but I don't see any > indication of this in any logs, and to the extent SMART can be trusted > it hasn't not found any errors on any drives either. Any hints on > tracking this down further? If it is indeed flaky hardware it would be > good to be able to narrow down what is bad, and if that's causing the > panics or if I've just snagged a bug. > > One thing perhaps worth mentioning is that once I'd grabbed all the > relevant info trying to boot the box from the ddb prompt is not > possible it locks up hard and has to be reset to be brought back. I > left the "boot dump" over night and eight hours later nothing. Last > time I tried "boot reboot" and it too hung until I reset the machine. > > transcribed panic, trace, ps, uvm, bcstats, registers as well as dmesg > below. I've also put the pictures online in case I made any > transcription errors. > > http://www.huldtgren.com/panics/20160211/ > > thanks, > > .jh > > > uvm_fault(0xffffffff8193f240, 0x38, 0, 1) -> e > kernel: page fault trap, code=0 > Stopped at sr_validate_io+0x36: movl 0x38(%r9),%r10d > ddb{1}> trace > sr_validate_io() at sr_validate_io+0x36 > sr_raid5_rw() at sr_raid5_rw+0x40 > sr_raid_recreate_wu() at sr_raid_recreate_wu+0x2c > sr_wu_done_callback() at sr_wu_done_callback+0x17a > taskq_thread() at taskq_thread+0x6c
Thanks for all the detail you've provided here. The fault appears to be caused by a NULL xs. A diff to error in that case is provided below. Perhaps someone familiar with the softraid/scsi code can comment as to why this is occuring. Index: sys/dev/softraid.c =================================================================== RCS file: /cvs/src/sys/dev/softraid.c,v retrieving revision 1.365 diff -u -p -r1.365 softraid.c --- sys/dev/softraid.c 29 Dec 2015 04:46:28 -0000 1.365 +++ sys/dev/softraid.c 13 Feb 2016 04:35:39 -0000 @@ -4545,6 +4545,9 @@ sr_validate_io(struct sr_workunit *wu, d struct scsi_xfer *xs = wu->swu_xs; int rv = 1; + if (xs == NULL) + goto bad; + DNPRINTF(SR_D_DIS, "%s: %s 0x%02x\n", DEVNAME(sd->sd_sc), func, xs->cmd->opcode);