Thanks a lot Stuart.

On 02.03.21 18:25, Stuart Henderson wrote:
[cc's trimmed back to bugs@]

On 2021/03/02 19:00, Mark Schneider wrote:
...
Thanks a lot for all hints Stuart.

The isolated 1TB SSD Samsung PRO 860 drives have some AHCI errors
(OpenBSD_6.9beta-RAID5-3x1TB-SSD-isolated.txt in the attachment).


Writing to an "isolated" drive does not crash OpenBSD even there are AHCI
errors and sometimes an I/O error from dd (see directly below).
Thanks. So even if softraid were to not crash, things would not be
in good shape and probably all it could do is mark the component as
failed. Which would be better than a crash but suboptimal ;)

It looks like "not so perfect" mix of a bit outdated ASUS mainboard / it's BIOS, less old Samsung PRO 860 512GB or 1TB SSD drives and the handlig of I/O errors in OpenBSD 6.8 or 6.9beta.

bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xdc312018 (61 entries)
bios0: vendor American Megatrends Inc. version "2201" date 03/23/2015
bios0: ASUSTeK COMPUTER INC. CROSSHAIR V FORMULA-Z
acpi0 at bios0: ACPI 5.0

The Samsung PRO 860 SSDs are new so I do not expect the problem there (as they are working on Linux).

I have taken two of those Samsung PRO 860 512GB SSD drives and connected them to another "P8B WS, BIOS 0704 07/25/2011" Xeon based Asus mainboard and there are no AHCI errors showing up. I have tested there isolated drives as well as plain RAID1 and ecnrypted RAID1 (nested, not using the "-c 1C" option of bioctl in OpenBSD 6.9beta) writing 1, 2 10 or 20GBytes big files to the RAID device and there was no issue at all.

The issue is showing up on three Asus FX CPU based mainboards ( 1 x "SABERTOOTH 990FX R2.0" and 2 x "CROSSHAIR V FORMULA-Z" - all of them have the same SB950 chipset)

AMD ® SB950 Chipset
Supports AMD ® QUAD-GPU CrossFireXTM Technology
- 6 x SATA 6.0 Gb/s ports with RAID 0, 1, 5 and 10 support

ASUS_E7335_Sabertooth_990FX_R2_Manual.pdf
ASUS_E7710_Crosshair_5_Formula-Z_Manual.pdf


Some layers don't cope well with errors in layers below, especially if
they haven't been bumped into in development or they're relatively
uncommon, so it's not a complete surprise. I don't get the impression
many people are running softraid raid5 to have bumped into bugs before
you. (There's a fair chance that if you used softdep it would run into
problems too, that doesn't cope too well with lower layer failure
either..).

I/O errors writing to a additional RAID device should not lead to an OS crash anyway.

I mean it is a good opportunity to check the error handling as I/O errors can always happen.


# OpenBSD 6.9beta is crashing after a dd command writing to the RAID5
softraid volume (sd4a) and the access to the ddb{4}> prompt is not possible
to run trace, ps or sh commands (the root console is dead).


"trace" and "sh reg" from ddb would give more clues.
I am not able to run the commands above as the root ddb{4} console is dead
(I can see only the last error message but I am not able to type in using
the keyboard)
Things to try for this:

"sysctl machdep.forceukbd=1" may allow the keyboard to work

"sysctl ddb.panic=0" I don't know if this will help as it isn't
technically a panic, but it may show you a stack trace (and then
try to write a kernel coredump to the swap partition which may
or may not work, then reboot). So with a bit of luck you might
be able to at least grab a screenshot.

I use USB keyboard and that seems to be the problem with the ddb{4}> prompt.

I will check for a PS/2 keyboard or PS/2 to USB adapter to run "trace" and "sh reg" commands after a OS crash.


I will connect those Samsung PRO 860 1TB SSDs to a Xeon based system
(another SATA-controller) and check there for AHCI errors.

Maybe it is worth to mention, that the original RAID tests on Debian buster
with six of 512GB Samsung PRO 860 (the same drives andf RAID6 set with
mdadm) worked without crashing the OS.
I guess it will be a combination of the newer drives + the controller
+ something unhandled in the ahci driver.

If you're able (and ideally can arrange serial console to capture the
output) then building a kernel with AHCI_DEBUG defined might give clues
(either add "option AHCI_DEBUG" to kernel config, or just add a #define
in sys/ic/ahci.c).

What additional settings on a fresh OpenBSD 6.8 or 6.9beta installation are required to setup and activate a serial console? (as far as I can remember there is a question during the OS installation about the serial console).

All tested ASUS mainboards have serial interfaces (even I have to find right cables with DB9 connectors).



Reply via email to