Re: [OpenIndiana-discuss] Server crash diagnostics?

Bart Brashers via openindiana-discuss Tue, 21 Apr 2020 13:06:26 -0700

Hi Till,



Thanks for the response. Here's what I'm getting now, under relatively low load 
for us. This listserv doesn't preserve font, or I'd try to make them align a 
little better.



For pool0, asvc_t is at 3.5, and %w is at 86. And you can pretty clearly see 
which raidz2 sets (8 disks each) are being used right now. But none of the 
individual disks show asvc_t  or %w anywhere close to 100.



# iostat -xn 1

                    extended device statistics

    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device

3205.3  688.1 162097.0 22212.6 788.1 13.8  202.4    3.5  86  90 pool0

    0.0    0.1    0.3    1.3  0.0  0.0    7.5    3.8   0   0 pool1

    0.2   13.7    3.9  329.2  0.1  0.0    5.9    0.8   0   1 rpool

    0.1    8.0    2.1  164.6  0.0  0.0    0.0    0.6   0   0 
c4t5000C500576C3D07d0

    0.0    0.5    0.1   17.9  0.0  0.0    0.0    0.4   0   0 c7t0d0

    0.0    0.5    0.1   17.8  0.0  0.0    0.0    0.3   0   0 c7t1d0

    0.0    0.0    0.2    0.0  0.0  0.0    0.0    0.1   0   0 c7t2d0

    0.0    0.0    0.2    0.0  0.0  0.0    0.0    0.1   0   0 c7t3d0

    0.1    8.0    2.0  164.6  0.0  0.0    0.0    0.6   0   0 
c4t5000C500576A3D7Bd0

    5.8    1.4    6.3    2.6  0.0  0.0    0.0    5.4   0   2 
c4t5000C50057373B23d0

    5.8    1.4    6.3    2.6  0.0  0.0    0.0    5.4   0   2 
c4t5000C500573715E3d0

    5.9    2.5    5.8    3.9  0.0  0.0    0.0    5.1   0   2 
c4t5000C500573737B3d0

    5.9    2.5    5.8    3.9  0.0  0.0    0.0    5.1   0   2 
c4t5000C500572D3BA3d0

    6.4    4.3    7.0    7.3  0.0  0.0    0.0    4.4   0   2 
c4t5000C500572D5A23d0

    6.4    4.3    7.0    7.3  0.0  0.0    0.0    4.3   0   2 
c4t5000C500572D5933d0

    5.8    1.4    6.3    2.6  0.0  0.0    0.0    5.4   0   2 
c4t5000C500572D53B7d0

    5.8    1.4    6.3    2.6  0.0  0.0    0.0    5.4   0   2 
c4t5000C50057376527d0

    6.0    2.5    5.8    3.9  0.0  0.0    0.0    5.1   0   2 
c4t5000C500573729C7d0

    6.2    4.2    6.2    7.1  0.0  0.0    0.0    4.4   0   2 
c4t5000C500572D3D37d0

    6.2    4.2    6.2    7.1  0.0  0.0    0.0    4.4   0   2 
c4t5000C500572D5F57d0

    6.2    4.2    6.2    7.1  0.0  0.0    0.0    4.4   0   2 
c4t5000C500572D3967d0

    6.2    4.2    6.2    7.1  0.0  0.0    0.0    4.5   0   2 
c4t5000C500572D3FB7d0

    6.4    4.3    7.0    7.3  0.0  0.0    0.0    4.4   0   2 
c4t5000C500572D3EE7d0

    6.4    4.3    7.0    7.3  0.0  0.0    0.0    4.4   0   2 
c4t5000C50057373597d0

    5.8    1.4    6.3    2.6  0.0  0.0    0.0    5.4   0   2 
c4t5000C50057373A2Bd0

    6.0    2.5    5.8    3.9  0.0  0.0    0.0    5.1   0   2 
c4t5000C50057371E3Bd0

    5.9    2.5    5.8    3.9  0.0  0.0    0.0    5.1   0   2 
c4t5000C500572D433Bd0

    6.2    4.2    6.2    7.1  0.0  0.0    0.0    4.5   0   2 
c4t5000C500572D535Bd0

    6.2    4.2    6.2    7.1  0.0  0.0    0.0    4.4   0   2 
c4t5000C500572D435Bd0

    6.4    4.3    7.0    7.3  0.0  0.0    0.0    4.3   0   2 
c4t5000C500572D3DABd0

    6.4    4.3    7.0    7.3  0.0  0.0    0.0    4.4   0   2 
c4t5000C500572D3BDBd0

    6.4    4.3    7.0    7.3  0.0  0.0    0.0    4.3   0   2 
c4t5000C500572D54FBd0

    7.3    1.4   41.3    2.7  0.0  0.1    0.0    7.6   0   3 
c4t5000C50057374C7Fd0

    5.8    1.4    6.3    2.6  0.0  0.0    0.0    5.4   0   2 
c4t5000C5005737552Fd0

    6.0    2.5    5.8    3.9  0.0  0.0    0.0    5.1   0   2 
c4t5000C50057374CAFd0

    5.9    2.5    5.8    3.9  0.0  0.0    0.0    5.1   0   2 
c4t5000C500572D579Fd0

    6.2    4.2    6.2    7.1  0.0  0.0    0.0    4.4   0   2 
c4t5000C500572D3B3Fd0

    6.2    4.2    6.2    7.1  0.0  0.0    0.0    4.4   0   2 
c4t5000C500572D3C2Fd0

    6.4    4.3    7.0    7.3  0.0  0.0    0.0    4.3   0   2 
c4t5000C500572D3CDFd0

    1.6    1.5    1.9    2.7  0.0  0.0    0.0    5.0   0   1 
c4t5000C50062A7BCE3d0

    1.6    1.5    1.9    2.7  0.0  0.0    0.0    5.0   0   1 
c4t5000C50062B3EFB3d0

    1.6    1.5    1.9    2.7  0.0  0.0    0.0    5.0   0   1 
c4t5000C50062A7D487d0

    1.6    1.5    1.9    2.7  0.0  0.0    0.0    5.0   0   1 
c4t5000C50062B279E7d0

    1.6    1.5    1.9    2.7  0.0  0.0    0.0    5.0   0   1 
c4t5000C50062A6E74Bd0

    1.6    1.5    1.9    2.7  0.0  0.0    0.0    4.9   0   1 
c4t5000C50062A7D78Bd0

    1.6    1.5    1.9    2.7  0.0  0.0    0.0    5.0   0   1 
c4t5000C50062B3EFBFd0

    7.3    1.4   41.3    2.7  0.0  0.1    0.0    7.8   0   3 
c4t5000C5007F5AE0F3d0

    7.2    1.4   41.3    2.7  0.0  0.1    0.0    7.8   0   3 
c4t5000C5007F57A403d0

    7.3    1.4   41.3    2.7  0.0  0.1    0.0    7.8   0   3 
c4t5000C5007F5C8077d0

    7.3    1.4   41.3    2.7  0.0  0.1    0.0    7.8   0   3 
c4t5000C5007F5929DBd0

    7.3    1.4   41.3    2.6  0.0  0.1    0.0    7.8   0   3 
c4t5000C5007F5AC47Bd0

    7.3    1.4   41.3    2.7  0.0  0.1    0.0    7.8   0   3 
c4t5000C5007F5B639Bd0

    7.3    1.4   41.3    2.7  0.0  0.1    0.0    7.8   0   3 
c4t5000C5007F5FE40Bd0

  377.8    1.3 22175.0    2.5  0.0  1.4    0.0    3.6   0  63 
c4t5000C5008518AC43d0

   23.6    1.5  373.1    2.8  0.0  0.1    0.0    5.5   0   7 
c4t5000C5008518F793d0

   23.5    1.5  373.1    2.8  0.0  0.1    0.0    5.6   0   7 
c4t5000C500851C0F43d0

   23.5    1.5  373.1    2.8  0.0  0.1    0.0    5.6   0   7 
c4t5000C5008518D683d0

  377.8    1.3 22176.6    2.5  0.0  1.4    0.0    3.6   0  63 
c4t5000C5008518A757d0

  379.7    1.3 22178.1    2.5  0.0  1.4    0.0    3.6   0  62 
c4t5000C5008518F957d0

   23.6    1.5  373.2    2.8  0.0  0.1    0.0    5.5   0   7 
c4t5000C5008518C007d0

   23.6    1.5  373.2    2.8  0.0  0.1    0.0    5.6   0   7 
c4t5000C50085151C57d0

  379.3    1.3 22176.8    2.5  0.0  1.3    0.0    3.5   0  62 
c4t5000C5008519177Bd0

  378.3    1.3 22177.1    2.5  0.0  1.3    0.0    3.5   0  62 
c4t5000C5008515EE8Bd0

   23.6    1.5  373.3    2.8  0.0  0.1    0.0    5.5   0   7 
c4t5000C5008519BD0Bd0

  378.5    1.3 22194.9    2.5  0.0  1.4    0.0    3.6   0  62 
c4t5000C5008518AFFFd0

  378.0    1.3 22194.5    2.5  0.0  1.4    0.0    3.6   0  63 
c4t5000C5008518BA9Fd0

    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 
c4t5000C50085151A1Fd0

   23.6    1.5  373.6    2.8  0.0  0.1    0.0    5.5   0   7 
c4t5000C5008518D25Fd0

    5.0    3.7    8.4    7.9  0.0  0.0    0.0    4.7   0   2 
c4t5000C500980D4333d0

    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 
c4t5000C500980D11EBd0

    5.0    3.7    8.4    7.9  0.0  0.0    0.0    4.7   0   2 
c4t5000C500980D0E53d0

    5.0    3.8    8.4    7.9  0.0  0.0    0.0    4.7   0   2 
c4t5000C500980D087Fd0

    5.0    3.7    8.4    7.9  0.0  0.0    0.0    4.7   0   2 
c4t5000C500980D443Fd0

    5.0    3.7    8.4    7.9  0.0  0.0    0.0    4.7   0   2 
c4t5000C500980D2913d0

    5.0    3.7    8.4    7.9  0.0  0.0    0.0    4.7   0   2 
c4t5000C500980D0C4Fd0

    1.5    1.3    2.1    2.7  0.0  0.0    0.0    7.4   0   1 
c4t5000C5008CF235EBd0

    0.0    0.0    0.2    0.2  0.0  0.0    0.0    1.5   0   0 
c4t5000C500B1B3BE12d0

    0.0    0.0    0.2    0.2  0.0  0.0    0.0    1.7   0   0 
c4t5000C500B1B5529Ad0

    0.0    0.0    0.2    0.2  0.0  0.0    0.0    1.6   0   0 
c4t5000C500B1B6C4EBd0

    0.0    0.0    0.2    0.2  0.0  0.0    0.0    1.6   0   0 
c4t5000C500B1B65CF2d0

    5.8    1.2    6.5    2.6  0.0  0.0    0.0    6.9   0   2 
c4t5000C5008CF23EC1d0

    4.9    3.4    8.4    7.9  0.0  0.0    0.0    5.7   0   2 
c4t5000C5008CF23896d0

    0.0    0.0    0.2    0.2  0.0  0.0    0.0    1.7   0   0 
c4t5000C500B1B1BE97d0

    0.0    0.0    0.2    0.2  0.0  0.0    0.0    1.9   0   0 
c4t5000C500B1B4DB26d0

    5.8    1.2    6.4    2.6  0.0  0.0    0.0    6.3   0   2 
c4t5000C500B0E7E1A9d0

    5.0    3.8    8.4    7.9  0.0  0.0    0.0    4.9   0   2 
c4t5000C5007E98965Fd0

    5.9    2.5    5.8    3.9  0.0  0.0    0.0    5.2   0   2 
c4t5000C500B70E2A6Fd0

   18.3    1.6   25.2    2.8  0.0  0.1    0.0    6.1   0   6 
c0t5000C5008518AEFAd0

    0.0  543.5    0.0 22610.6  0.0  0.5    0.0    0.9   0  48 
c4t5000C500AF3E90DBd0





-----Original Message-----
From: Till Wegmüller <[email protected]>
Sent: Tuesday, April 21, 2020 12:43 PM
To: [email protected]
Subject: Re: [OpenIndiana-discuss] Server crash diagnostics?



Hi Bart



That is not too hard to debug. There are two possible options. It's a Software 
bug. THe solution to that is simple. Update your Box. The second and more 
probable option, is that one of your disks is being nasty. Unfortunately I have 
had the pleasure with such disks before.



To find any nasty disks run `iostat -xn 1` where 1 is the interval to measure 
the statistics in seconds.



This will produce an output like so:

                    extended device statistics

    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device

    2.9   47.8   27.6 1038.5  0.2  0.0    3.8    0.3   0   1 rpool

    2.9   59.2   27.6 1038.5  0.0  0.0    0.2    0.1   1   0 c5t0d0

                    extended device statistics

    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device

    0.0  135.6    0.0 1067.6  0.2  0.0    1.6    0.1   1   1 rpool

    0.0  137.6    0.0 1067.6  0.0  0.0    0.0    0.0   0   1 c5t0d0

                    extended device statistics

    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device

    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 rpool

    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c5t0d0





Important is asvc_t and %w statistics. Under load it can be normal for them to 
be up to 80 or 100. a value over 100 means the disk or a hardware in between is 
delaying writes to the disk. In that case ZFS will try to cache as much as it 
can to let the hardware catch up until it exhausts it's resources. Some disks 
manage to get the server offline as they simply time out all scsi commands 
instead of returning errors.



Once you have found the disk offline it with zpool offline and replace it. As 
soon as the disk has been taken offline your performance should normalise. You 
might need to wait for a situation where load occurs though.



Hope this helps

Greetings

Till



On 21.04.20 21:14, Bart Brashers via openindiana-discuss wrote:

> Hi everyone, I'm new to this listserv. I run Air Quality and Meteorological 
> models on an HPC cluster, which is all CentOS except for one storage server 
> running OpenIndiana (SunOS 5.11 oi_151a9 November 2013). I know, I know, it's 
> a ridiculously old installation, please don't bug me about that.

>

> I would like to figure out what happened to that server this past weekend. My 
> goal is to figure out if there's something I can do to avoid having the 
> problem described below happen again.

>

> OpenIndiana is running on a Supermicro box, with a SAS attached JBOD, about 
> 85 spinning disks in two ZFS pools, one of SAS disks the other of SATA disks. 
> Periodically, when load gets too high, it becomes unresponsive for 5 - 30 
> minutes, but if we're patient enough it comes back. The load (as reported by 
> /usr/bin/top) immediately after such an event is ~200, which rapidly falls 
> back to a more normal range of ~0.5.

>

> Two days ago on Sunday evening, it went off into la-la land again, but after 
> a few hours hadn't come back. The IMPI interface was also not responding, so 
> I couldn't reboot it remotely. I went in to the office on Monday morning and 
> shut down the server, then pulled the power cords for 20 seconds. The 
> complete removal of power often helps in situations like this, I've found.

>

> The server then entered an endless loop, where it would try to boot, timeout 
> about 6 times (taking ~5 minutes for each timeout) with the following 
> message, then kernel panic and reboot.

>

> Warning: /pci@0,0/pci8086,3c04@2/pci100,3020@0 (mpt_sas10):

>        Disconnected command timeout for Target 156 ...repeat...

> panic[cpu0]/thread=ffffff01e80cbc40: I/O to pool 'pool0' appears to be hung.

>

> Great! This OS already has so many names for disks, here's another one: which 
> disk is Target 156? Sometimes it was Target 75, sometimes it was Target 150. 
> Or is that a SAS expander? I could not log in, it would never get that far 
> before the kernel panic and reboot.

>

> I was able to boot into single-user mode (append -s to the grub line 
> containing "kernel") and poked around until I found two disks that were 
> reporting errors. fmdump -eV was useful, though so verbose it took a while to 
> figure out what to read. The best/clearest method was echo | format, which is 
> not a command I would have guessed based on decades of experience with Linux 
> ;-). I pulled two bad disks, and rebooted... and it went back into the 
> endless panic-reboot loop.

>

> I eventually found this page: 
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.oracle.com%2Fcd%2FE23824_01%2Fhtml%2F821-1448%2Fgbbwc.html%23scrolltoc&amp;data=02%7C01%7Cbbrashers%40ramboll.com%7C2682499759d5455e341b08d7e62c2c21%7Cc8823c91be814f89b0246c3dd789c106%7C0%7C0%7C637230949702808872&amp;sdata=wL8UPoTLATwsmG3ogXKnxj4s7CQiAPngX58Y8OxJzbo%3D&amp;reserved=0
>  and followed this procedure:

>

>

>   *   When the booting gets to the grub stage, press e to edit

>   *   Scroll to the line containing "kernel" and press e again edit

>   *   At the end of the line, add the text -m milestone=none and press enter

>   *   Press b to boot

>   *   Login as root

>   *   (The root filesystem [mounted at /] was already read-write, not 
> read-only, for me)

>   *   Rename /etc/zfs/zpool.cache to something else

>   *   Reboot (svcadm milestone all didn't work for me)

>   *   Login as root

>   *   Type zpool import and verify that all pools were able to be imported

>   *   Type zpool import -a and suddenly everything was back to normal!

>

> (Yes, I typed that all out so someone searching could find the

> step-by-step recipe when it happens to them, the link above is not

> great for beginners.)

>

> Any suggestions on what to look for, and where to look (which logs) would be 
> greatly appreciated. Suggestions about upgrading or migrating to new hardware 
> are not necessary, I already know. It's all about money - and with the GDP 
> outlook for 2020 due to COVID-19, it's looking like I'll have to keep this 
> server limping along a while longer.

>

> Thanks,

>

> Bart

> _______________________________________________

> openindiana-discuss mailing list

> [email protected]<mailto:[email protected]>

> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fopen

> indiana.org%2Fmailman%2Flistinfo%2Fopenindiana-discuss&amp;data=02%7C0

> 1%7Cbbrashers%40ramboll.com%7C2682499759d5455e341b08d7e62c2c21%7Cc8823

> c91be814f89b0246c3dd789c106%7C0%7C0%7C637230949702808872&amp;sdata=apM

> KkS6mopuILLE%2BA0OH6VVubxOBYjg8LIC5T%2F9Bbfo%3D&amp;reserved=0

>



_______________________________________________

openindiana-discuss mailing list

[email protected]<mailto:[email protected]>

https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fopenindiana.org%2Fmailman%2Flistinfo%2Fopenindiana-discuss&amp;data=02%7C01%7Cbbrashers%40ramboll.com%7C2682499759d5455e341b08d7e62c2c21%7Cc8823c91be814f89b0246c3dd789c106%7C0%7C0%7C637230949702818869&amp;sdata=eygqWRTckWc%2FpB735%2FzpG9xSU5%2FTkIel1%2B1n3Xv2CFE%3D&amp;reserved=0
_______________________________________________
openindiana-discuss mailing list
[email protected]
https://openindiana.org/mailman/listinfo/openindiana-discuss

Re: [OpenIndiana-discuss] Server crash diagnostics?

Reply via email to