On Fri, Feb 12, 2016 at 11:07:35AM -0500, Johan Huldtgren wrote:
> hello,
> 
> a bit over a week ago one of my remote boxes stopped responding, once
> I got out there I noticed it had panicked, it doesn't have a serial
> console so I grabbed some pictures and tried to bring it back alive.
> The box came back but my softraid never recovered, this is all
> documented in this thread:
> 
> http://marc.info/?t=145476742800004&r=1&w=2
> 
> The end result seemed that the metadata had been destroyed and
> softraid couldn't recover. I cut my losses, recreated the softraid and
> started copying the data back. After having copied (with rsync) for
> about a day (it's roughly 8TB of data, so takes some time) the machine
> panicked again. Once again it seems my softraid is unrecoverable. I'm
> not sure if I have some broken hardware (drives), but I don't see any
> indication of this in any logs, and to the extent SMART can be trusted
> it hasn't not found any errors on any drives either. Any hints on
> tracking this down further? If it is indeed flaky hardware it would be
> good to be able to narrow down what is bad, and if that's causing the
> panics or if I've just snagged a bug.
> 
> One thing perhaps worth mentioning is that once I'd grabbed all the
> relevant info trying to boot the box from the ddb prompt is not
> possible it locks up hard and has to be reset to be brought back. I
> left the "boot dump" over night and eight hours later nothing. Last
> time I tried "boot reboot" and it too hung until I reset the machine.
> 
> transcribed panic, trace, ps, uvm, bcstats, registers as well as dmesg
> below. I've also put the pictures online in case I made any
> transcription errors.
> 
> http://www.huldtgren.com/panics/20160211/
> 
> thanks,
> 
> .jh
> 
> 
> uvm_fault(0xffffffff8193f240, 0x38, 0, 1) -> e
> kernel: page fault trap, code=0
> Stopped at      sr_validate_io+0x36:    movl    0x38(%r9),%r10d
> ddb{1}> trace
> sr_validate_io() at sr_validate_io+0x36
> sr_raid5_rw() at sr_raid5_rw+0x40
> sr_raid_recreate_wu() at sr_raid_recreate_wu+0x2c
> sr_wu_done_callback() at sr_wu_done_callback+0x17a
> taskq_thread() at taskq_thread+0x6c

Thanks for all the detail you've provided here.  The fault appears to be
caused by a NULL xs.  A diff to error in that case is provided below.

Perhaps someone familiar with the softraid/scsi code can comment as to
why this is occuring.

Index: sys/dev/softraid.c
===================================================================
RCS file: /cvs/src/sys/dev/softraid.c,v
retrieving revision 1.365
diff -u -p -r1.365 softraid.c
--- sys/dev/softraid.c  29 Dec 2015 04:46:28 -0000      1.365
+++ sys/dev/softraid.c  13 Feb 2016 04:35:39 -0000
@@ -4545,6 +4545,9 @@ sr_validate_io(struct sr_workunit *wu, d
        struct scsi_xfer        *xs = wu->swu_xs;
        int                     rv = 1;
 
+       if (xs == NULL)
+               goto bad;
+
        DNPRINTF(SR_D_DIS, "%s: %s 0x%02x\n", DEVNAME(sd->sd_sc), func,
            xs->cmd->opcode);
 

Reply via email to