[cc's trimmed back to bugs@]

On 2021/03/02 19:00, Mark Schneider wrote:
> On 02.03.21 10:39, Stuart Henderson wrote:
> > On 2021/03/02 00:09, Mark Schneider wrote:
> > > Hi,
> > > 
> > > Thank you for your feeeback.
> > > 
> > > Also OpenBSD 6.9beta snapshot is crashing when I setup RAID5 with three
> > > "Samsung PRO 860 1TB" SSDs.
> > > OpenBSD obsd69b.it-infra.org 6.9 GENERIC.MP#368 amd64
> > > 
> > > obsd69b# dmesg | grepĀ  -i bios
> > > bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xdc312018 (61 entries)
> > > bios0: vendor American Megatrends Inc. version "2201" date 03/23/2015
> > > bios0: ASUSTeK COMPUTER INC. CROSSHAIR V FORMULA-Z
> > > acpi0 at bios0: ACPI 5.0
> > Can you isolate softraid from the equation? Are the drives reliable with
> > this hardware configuration when not using softraid? I guess it would
> > need testing with simultaneous writes to the 3 drives to give a closer
> > match to the situation with softraid.
> 
> Thanks a lot for all hints Stuart.
> 
> The isolated 1TB SSD Samsung PRO 860 drives have some AHCI errors
> (OpenBSD_6.9beta-RAID5-3x1TB-SSD-isolated.txt in the attachment).
> 
> 
> Writing to an "isolated" drive does not crash OpenBSD even there are AHCI
> errors and sometimes an I/O error from dd (see directly below).

Thanks. So even if softraid were to not crash, things would not be
in good shape and probably all it could do is mark the component as
failed. Which would be better than a crash but suboptimal ;)

> # ---
> obsd69b# dd if=/dev/urandom of=/ssd1T-sd1a/1GB-urandom.bin bs=1M count=1024
> dd: /ssd1T-sd1a/1GB-urandom.bin: Input/output error
> 1+0 records in
> 0+0 records out
> 0 bytes transferred in 0.014 secs (0 bytes/sec)
> 
> obsd69b# dd if=/dev/urandom of=/ssd1T-sd1a/1GB-urandom.bin bs=1M count=1024
> 1024+0 records in
> 1024+0 records out
> 1073741824 bytes transferred in 5.156 secs (208228191 bytes/sec)
> 
> # ---
> 
> ahci2: NCQ errored slot 3 is idle (04000000 active)
> ahci2: NCQ errored slot 13 is idle (7c0f01eb active)
> ahci2: NCQ errored slot 26 is idle (03fe1e07 active)
> ahci2: NCQ errored slot 30 is idle (03e1e38f active)
> ahci2: NCQ errored slot 28 is idle (03e1fc71 active)
> ahci2: NCQ errored slot 30 is idle (03fc9f81 active)
> ahci2: NCQ errored slot 9 is idle (0f0ee03f active)
> ahci2: NCQ errored slot 16 is idle (70f400ff active)
> ahci2: NCQ errored slot 28 is idle (0f3c407f active)
> ahci2: NCQ errored slot 13 is idle (70dc41fc active)
> ahci2: NCQ errored slot 17 is idle (0f3c1fe0 active)
> ahci2: NCQ errored slot 30 is idle (0f7c181f active)
> 
> 
> Writing to all "isolated" drives simultanously does not crash OpenBSD even
> there are AHCI errors

Some layers don't cope well with errors in layers below, especially if
they haven't been bumped into in development or they're relatively
uncommon, so it's not a complete surprise. I don't get the impression
many people are running softraid raid5 to have bumped into bugs before
you. (There's a fair chance that if you used softdep it would run into
problems too, that doesn't cope too well with lower layer failure
either..).

> # OpenBSD 6.9beta is crashing after a dd command writing to the RAID5
> softraid volume (sd4a) and the access to the ddb{4}> prompt is not possible
> to run trace, ps or sh commands (the root console is dead).
> 
> 
> > "trace" and "sh reg" from ddb would give more clues.
> 
> I am not able to run the commands above as the root ddb{4} console is dead
> (I can see only the last error message but I am not able to type in using
> the keyboard)

Things to try for this:

"sysctl machdep.forceukbd=1" may allow the keyboard to work

"sysctl ddb.panic=0" I don't know if this will help as it isn't
technically a panic, but it may show you a stack trace (and then
try to write a kernel coredump to the swap partition which may
or may not work, then reboot). So with a bit of luck you might
be able to at least grab a screenshot.

> I will connect those Samsung PRO 860 1TB SSDs to a Xeon based system
> (another SATA-controller) and check there for AHCI errors.
> 
> Maybe it is worth to mention, that the original RAID tests on Debian buster
> with six of 512GB Samsung PRO 860 (the same drives andf RAID6 set with
> mdadm) worked without crashing the OS.

I guess it will be a combination of the newer drives + the controller
+ something unhandled in the ahci driver.

If you're able (and ideally can arrange serial console to capture the
output) then building a kernel with AHCI_DEBUG defined might give clues
(either add "option AHCI_DEBUG" to kernel config, or just add a #define
in sys/ic/ahci.c).

Reply via email to