Hi again,
I attempted to get the debug messages to print by doing:
cd /usr/src/sys/arch/amd64/conf, copying GENERIC.MP to
GENERIC.MP.BIODEBUG, and making this change:
--- GENERIC.MP Wed Feb 5 23:54:35 2025
+++ GENERIC.MP.BIODEBUG Wed Feb 5 15:38:10 2025
@@ -6,4 +6,6 @@
#option MP_LOCKDEBUG
#option WITNESS
+option SR_DEBUG
+
cpu* at mainbus?
Then, I recompiled the kernel and rebooted. I didn't see any debug
messages related to softraid, even though my system partitions are using
a RAID 1C device.
# bioctl -vi softraid0
Volume Status Size Device
softraid0 0 Online 1999861775872 sd5 RAID1C
0 Online 1999861775872 0:0.0 noencl <sd2a>
'unknown serial'
1 Online 1999861775872 0:1.0 noencl <sd3a>
'unknown serial'
so, I checked out the source in /usr/src/sys/dev. It seems that several
of the RAID disciplines have ifdef statements to handle SR_DEBUG, but not
the RAID 1C discipline:
# grep SR_DEBUG softraid*c |sort |uniq
softraid.c:#endif /* SR_DEBUG */
softraid.c:#ifdef SR_DEBUG
softraid_crypto.c:#endif /* SR_DEBUG */
softraid_crypto.c:#ifdef SR_DEBUG0
softraid_raid1.c:#ifdef SR_DEBUG
softraid_raid5.c:#ifdef SR_DEBUG
# grep DEBUG softraid_raid1c.c
#
So, I'm thinking that I set the option correctly, but perhaps the
debugging isn't available for RAID 1C?
Also, I got hold of another drive that can copy the entire to original
16GB partition to and run tests against. Is there a procedure where I
could copy out the correct byte range to my new drive with dd and try to
mount it using the simple CRYPTO discipline (bioctl -c C instead of bioctl
-c 1C)?
Thank you,
--James
On Sat, 25 Jan 2025, Stefan Sperling wrote:
> Date: Sat, 25 Jan 2025 23:12:01 +0100
> From: Stefan Sperling <[email protected]>
> To: James Boyle <[email protected]>
> Cc: [email protected]
> Subject: Re: softraid, bioctl -c 1C failed array question
>
> On Fri, Jan 24, 2025 at 02:53:06PM -0500, James Boyle wrote:
> > Hello,
> >
> > I was hoping to get a little help with bioctl and the 1C raid mode after a
> > drive failure. The most recent error message I'm getting when trying to
> > start the array in a degraded mode is:
> > # bioctl -c 1C -l /dev/sd0a softraid0
> > softraid0: RAID 1C requires two or more chunks
> >
> > Previously, the array had two identical Toshiba 16TB drives as sd0 and
> > sd1. The array used partitions sd0a and sd1a. One of those drives, sd1,
> > failed before Christmas. I was able to run the degraded array without
> > issue. After replacing the failed drive, I kicked off a rebuild using
> > bioctl -R. The array came back to the optimal "Online" state. Just a few
> > days ago, the second drive of the original pair failed. I was able to
> > again start the array with only one working drive (sd0 is the failed
> > drive, sd1 is the new drive, sd2 & sd3 are part of another array):
> >
> > # for X in sd{0,1,2,3,4,5,6} ; do bioctl -v ${X} ; done
> > sd0: <ATA, TOSHIBA MG08ACA1, 0102>, serial 71H0A3SWFVGG
> > sd1: <ATA, TOSHIBA MG08ACA1, 0103>, serial 44M0A008FVGG
> > sd2: <ATA, WDC WD2000F9YZ-0, 01.0>, serial WD-WMC160D3WKSS
> > sd3: <ATA, TOSHIBA HDWE150, FP2A>, serial 38EBK7BTF57D
> > Volume Status Size Device
> > softraid0 0 Online 1999861775872 sd4 RAID1C
> > 0 Online 1999861775872 0:0.0 noencl <sd2a>
> > 'unknown serial'
> > 1 Online 1999861775872 0:1.0 noencl <sd3a>
> > 'unknown serial'
> > Volume Status Size Device
> > softraid0 1 Degraded 16000895729664 sd5 RAID1C
> > 0 Offline 16000895729664 1:0.0 noencl <sd0a>
> > 'unknown serial'
> > 1 Online 16000895729664 1:1.0 noencl <sd1a>
> > 'unknown serial'
> >
> > After that I shut the system down, removed the failed drive. When the
> > system started again, what was previously sd1 had been initialized as sd0.
> > The other (boot/system) array started fine. I was unable to start the
> > degraded array. I got the error messages:
> >
> > softraid0: trying to bring up sd5 degraded
> > softraid0: trying to bring up sd5 degraded
> > softraid0: sd5 is offline, will not be brought online
> > softraid0: trying to bring up sd5 degraded
> > softraid0: trying to bring up sd5 degraded
> > softraid0: sd5 is offline, will not be brought online
> > softraid0: RAID 1C requires two or more chunks
> > softraid0: RAID 1C requires two or more chunks
> >
> > At one point I put the failed drive back in to see if it could start. I'm
> > afraid that may have been the wrong thing to do.
>
> Before you removed the above sd0 drive, the state of the working drive
> (then sd1) was "Online".
>
> What is the current state of this working drive? Is it still Online now?
> It doesn't sound like it is. Maybe it's now also in degrated state, for
> example due to a transient write error?
> If it is still in Online state then the above errors look like a bug.
>
> You will not be able to use bioctl to see the current state while the
> volume isn't assembled. But there is the SR_DEBUG kernel option. A kernel
> compiled with this option enabled should eventually print the state into
> dmesg on a line which contains "scm_status".
>
> The volume state values are defined in sys/dev/biovar.h:
>
> #define BIOC_SDONLINE 0x00
> #define BIOC_SDONLINE_S "Online"
> etc.
>
> The on-disk meta data structures can be found in sys/dev/softraidvar.h.
>
> > Is there a way to troubleshoot and restart the array with just the single
> > working drive as a degraded array again?
>
> You'll need at least one chunk in Online state to perform a rebuild and
> rescue the array. Otherwise, it seems the only officially supported way
> out would be to create a fresh volume and restore the data from backup.
>
> If your working drive is really still working, it should be possible
> to extract the data somehow using raw disk reads to obtain an image of
> the filesystem without the softraid meta data headers, and mounting that
> image on a vnd(4) device with vnconfig(6) and then copying the files out
> to a new array. I've never had to try that myself yet, fortunately.
>