Hi Bart
That is not too hard to debug. There are two possible options. It's a
Software bug. THe solution to that is simple. Update your Box. The
second and more probable option, is that one of your disks is being
nasty. Unfortunately I have had the pleasure with such disks before.
To find any nasty disks run `iostat -xn 1` where 1 is the interval to
measure the statistics in seconds.
This will produce an output like so:
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
2.9 47.8 27.6 1038.5 0.2 0.0 3.8 0.3 0 1 rpool
2.9 59.2 27.6 1038.5 0.0 0.0 0.2 0.1 1 0 c5t0d0
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 135.6 0.0 1067.6 0.2 0.0 1.6 0.1 1 1 rpool
0.0 137.6 0.0 1067.6 0.0 0.0 0.0 0.0 0 1 c5t0d0
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 rpool
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c5t0d0
Important is asvc_t and %w statistics. Under load it can be normal for
them to be up to 80 or 100. a value over 100 means the disk or a
hardware in between is delaying writes to the disk. In that case ZFS
will try to cache as much as it can to let the hardware catch up until
it exhausts it's resources. Some disks manage to get the server offline
as they simply time out all scsi commands instead of returning errors.
Once you have found the disk offline it with zpool offline and replace
it. As soon as the disk has been taken offline your performance should
normalise. You might need to wait for a situation where load occurs though.
Hope this helps
Greetings
Till
On 21.04.20 21:14, Bart Brashers via openindiana-discuss wrote:
> Hi everyone, I'm new to this listserv. I run Air Quality and Meteorological
> models on an HPC cluster, which is all CentOS except for one storage server
> running OpenIndiana (SunOS 5.11 oi_151a9 November 2013). I know, I know, it's
> a ridiculously old installation, please don't bug me about that.
>
> I would like to figure out what happened to that server this past weekend. My
> goal is to figure out if there's something I can do to avoid having the
> problem described below happen again.
>
> OpenIndiana is running on a Supermicro box, with a SAS attached JBOD, about
> 85 spinning disks in two ZFS pools, one of SAS disks the other of SATA disks.
> Periodically, when load gets too high, it becomes unresponsive for 5 - 30
> minutes, but if we're patient enough it comes back. The load (as reported by
> /usr/bin/top) immediately after such an event is ~200, which rapidly falls
> back to a more normal range of ~0.5.
>
> Two days ago on Sunday evening, it went off into la-la land again, but after
> a few hours hadn't come back. The IMPI interface was also not responding, so
> I couldn't reboot it remotely. I went in to the office on Monday morning and
> shut down the server, then pulled the power cords for 20 seconds. The
> complete removal of power often helps in situations like this, I've found.
>
> The server then entered an endless loop, where it would try to boot, timeout
> about 6 times (taking ~5 minutes for each timeout) with the following
> message, then kernel panic and reboot.
>
> Warning: /pci@0,0/pci8086,3c04@2/pci100,3020@0 (mpt_sas10):
> Disconnected command timeout for Target 156
> ...repeat...
> panic[cpu0]/thread=ffffff01e80cbc40: I/O to pool 'pool0' appears to be hung.
>
> Great! This OS already has so many names for disks, here's another one: which
> disk is Target 156? Sometimes it was Target 75, sometimes it was Target 150.
> Or is that a SAS expander? I could not log in, it would never get that far
> before the kernel panic and reboot.
>
> I was able to boot into single-user mode (append -s to the grub line
> containing "kernel") and poked around until I found two disks that were
> reporting errors. fmdump -eV was useful, though so verbose it took a while to
> figure out what to read. The best/clearest method was echo | format, which is
> not a command I would have guessed based on decades of experience with Linux
> ;-). I pulled two bad disks, and rebooted... and it went back into the
> endless panic-reboot loop.
>
> I eventually found this page:
> https://docs.oracle.com/cd/E23824_01/html/821-1448/gbbwc.html#scrolltoc and
> followed this procedure:
>
>
> * When the booting gets to the grub stage, press e to edit
> * Scroll to the line containing "kernel" and press e again edit
> * At the end of the line, add the text -m milestone=none and press enter
> * Press b to boot
> * Login as root
> * (The root filesystem [mounted at /] was already read-write, not
> read-only, for me)
> * Rename /etc/zfs/zpool.cache to something else
> * Reboot (svcadm milestone all didn't work for me)
> * Login as root
> * Type zpool import and verify that all pools were able to be imported
> * Type zpool import -a and suddenly everything was back to normal!
>
> (Yes, I typed that all out so someone searching could find the step-by-step
> recipe when it happens to them, the link above is not great for beginners.)
>
> Any suggestions on what to look for, and where to look (which logs) would be
> greatly appreciated. Suggestions about upgrading or migrating to new hardware
> are not necessary, I already know. It's all about money - and with the GDP
> outlook for 2020 due to COVID-19, it's looking like I'll have to keep this
> server limping along a while longer.
>
> Thanks,
>
> Bart
> _______________________________________________
> openindiana-discuss mailing list
> [email protected]
> https://openindiana.org/mailman/listinfo/openindiana-discuss
>
_______________________________________________
openindiana-discuss mailing list
[email protected]
https://openindiana.org/mailman/listinfo/openindiana-discuss