Thanks for the below pointers. In the end, it appears (so far) to be a
failing DIMM, as the next failure ended up causing the system to no
longer boot until that DIMM was removed. They are either perfectly
related, or a power reset from before caused the DIMM failure and the
initial cause has yet to be isolated. I'll look more into the recs
below.

On 1/22/07, Mahesh Siddheshwar <siddheshwar.mahesh at sun.com> wrote:
> Joe Little wrote:
> > On 12/21/06, Mahesh Siddheshwar <siddheshwar.mahesh at sun.com> wrote:
> >> Sergey wrote:
> >> > The problem with server's frizzz seems to be related to very
> >> intensive usage of the server as NFS server.
> >> >
> >> > "cs" (context switching is really high).
> >> > I've been running vmstat/iostat/nfsstat/prstat in cron. Below is
> >> the last working seconds before the server dies:
> >> >
> >> Sergey, Can you elaborate a bit more on what you mean by "server dies"?
> >> Does nfsd die/dumps core? Does the system panic or hang?  If the system
> >> panics or hangs, can you provide us the panic backtrace or kernel
> >> threadlist or
> >> a location for the crash dump? If it is the nfsd which dies, can you
> >> provide
> >> us the pstack output from the core file? Also details on what build
> >> you are
> >> running?
> >>
> >> Thanks,
> >> Mahesh
> >
> > We ourselves have finally had higher NFS load against a B53 based
> > OpenSolaris server w/ NFS as the network protocol and ZFS partitions.
> > The same symptoms have been seen in the last three or four days, with
> > nfsd taking more CPU and over time the system goes comatose with no
> > local response to the console, no network traffic, and of course
> > nothing in the crash dumps, logs, etc. We've been at a loss to track
> > it down, and I've been running regular "::kmastat" against mdb.
> > Nothing in crash dumps.
> >
> > Any further specific things (dtrace, commands) when suggests running
> > periodically to capture this? Is there anything in B54/B55 that
> > already addresses these problems?
> Is it possible to load kmdb (either at boot time or later load it by mdb
> -K) and
> take a crash dump when the system becomes unresponsive?  You could also
> enable deadman by adding this in /etc/system
>   set snooping=1
> this would make the system panic if the clock downs not move for 50secs.
>
> Another thing you  can do is, keep taking kernel thread list
> in the background at regular intervals and see what the system was up to
> before the hang.  You can obtain the threadlist using, something like:
>  # echo "::threadlist -v" | mdb -k > thrlist.1
>
> While you may not get the last hang scenario in this case, the data
> could probably give you some leads.
>
> Please note that the first two suggestions are destructive actions
> - in the sense that it will panic the system. So please use it
> appropriately.
>
> Thanks,
> Mahesh
>
>
> >> >  r b w   swap  free  re  mf pi po fr de sr s0 s1 s2 s5   in   sy
> >> cs us sy id
> >> >  0 0 0 8067092 124672 0   1  0  0  0  0  0  0  0  0  0 61110 125
> >> 93706 0 13 87
> >> >  0 0 0 8067092 122856 0   1  0  0  0  0  0  0  0  0  0 63327 114
> >> 97495 0 14 86
> >> >  0 0 0 8067092 120324 0   0  0  0  0  0  0  0  0  0  8 65604 101
> >> 105058 0 13 87
> >> >  0 0 0 8067092 118240 0   0  0  0  0  0  0  0  0  0  0 69089 210
> >> 109246 0 14 86
> >> >  0 0 0 8067092 116540 0   0  3  0  0  0  0  0  0  0  2 70143 268
> >> 110813 0 15 85
> >> >  0 0 0 8067092 115416 0   0  0  0  0  0  0  0  0  0  0 73568 111
> >> 118202 0 15 85
> >> >  0 0 0 8067092 113844 0   0  0  0  0  0  0  0  0  0  1 68581 154
> >> 108782 0 15 85
> >> >  0 0 0 8067092 114832 0   0  0  0  0  0  0  0  0  0  3 70752 101
> >> 111744 0 15 85
> >> >  0 0 0 8067092 112988 0   0  0  0  0  0  0  0  0  0  0 71081 103
> >> 113991 0 14 86
> >> >  0 0 0 8067092 109192 0   0  0  0  0  0  0  0  0  0  0 70043 193
> >> 113730 0 14 86
> >> >  0 0 0 8067092 104920 0   0  0  0  0  0  0  0  0  0  1 68339  99
> >> 110622 0 14 86
> >> >  0 0 0 8067092 101464 0   0  0  0  0  0  0  0  0  0  0 59992 106
> >> 96626 0 12 88
> >> >  0 0 0 8067092 99780  0   0  0  0  0  0  0  0  0  0  0 54436 134
> >> 91421 0  9 91
> >> >  0 0 0 8067092 96828  0   0  0  0  0  0  0  0  0  0  0 50614 107
> >> 84133 0  9 91
> >> >  0 0 0 8067092 93992  0   0  0  0  0  0  0  0  0  0  1 50013 122
> >> 86355 0  8 92
> >> >  0 0 0 8067092 90984  0   0  0  0  0  0  0  0  0  0  0 53978 204
> >> 90745 0  9 91
> >> >  0 0 0 8067092 87984  0   0  0  0  0  0  0  0  0  0  1 55482 103
> >> 92300 0 10 90
> >> >  0 0 0 8067092 85820  0   0  0  0  0  0  0  0  0  0  0 58932 109
> >> 98907 0 10 90
> >> >  0 0 0 8067092 83200  0   0  0  0  0  0  0  0  0  0  0 53442 117
> >> 87132 0 10 90
> >> >  kthr      memory            page            disk
> >> faults      cpu
> >> >  r b w   swap  free  re  mf pi po fr de sr s0 s1 s2 s5   in   sy
> >> cs us sy id
> >> >  0 0 0 8067092 80020  0   0  0  0  0  0  0  0  0  0  1 60984 121
> >> 100219 0 11 89
> >> >  0 0 0 8067092 76184 40 189 14 12 12  0  0  0  0  0  4 59148 1312
> >> 100028 0 11 89
> >> >  0 0 0 8067092 73168  0   0  0  0  0  0  0  0  0  0  0 61049 166
> >> 101192 0 11 89
> >> >  0 0 0 8067092 69388  0   0  0  0  0  0  0  0  0  0  1 60986 182
> >> 102213 0 11 89
> >> >  0 0 0 8067092 65452  0   0  0 1156 1931 0 17390 0 0 0 48 56148 100
> >> 90676 0 11 89
> >> >  0 0 0 8067092 71096 40 130  3  1  0  0  0  0  0  0  1 56982 360
> >> 92811 0 10 90
> >> >  0 0 0 8067092 69436  0   1  0  0  0  0  0  0  0  0  0 51107 113
> >> 84473 0  9 91
> >> >  0 0 0 8067092 65936  0   0  0 11 18  0  4  0  0  0  8 54559 122
> >> 91551 0 10 90
> >> >  0 0 0 8067092 64412  0   0  0 452 552 0 309 0 0  0  9 48013  99
> >> 80081 0  8 92
> >> >  0 0 0 8067092 64600  0   2  0 191 193 0 50 0  0  0  4 52709 231
> >> 86733 0  9 91
> >> >  0 0 0 8067092 65972  0   1  0 573 578 0 255 0 0  0 11 61670 112
> >> 102799 0 10 90
> >> >  0 0 0 8067092 72296  0   1  0 56 56  0  0  0  0  0  1 51431 120
> >> 87487 0  8 92
> >> >  0 0 0 8067092 65292  1   1  1 1385 1781 0 2625 0 0 0 33 68007 106
> >> 113257 0 12 88
> >> >  0 0 0 8067092 74700  0   0  0  0  0  0  0  0  0  0  0 69690  97
> >> 113794 0 12 88
> >> >  0 0 0 8067092 71768  0   0  0  0  0  0  0  0  0  0  0 65226 106
> >> 110528 0 11 89
> >> >  0 0 0 8067092 66588  0   1  0 102 235 0 240 0 0  0  3 70976 178
> >> 120869 0 13 87
> >> >  0 0 0 8067092 65592  0   1  0 149 322 0 278 0 0  0  4 66895 110
> >> 113694 0 12 88
> >> >  0 0 0 8067092 63652  0   1  0 624 946 0 4471 0 0 0 18 71986 110
> >> 121720 0 12 88
> >> >  0 0 0 8067092 64884  0   1  2 45 45  0  0  0  0  0  2 72114 121
> >> 122250 0 12 88
> >> >  kthr      memory            page            disk
> >> faults      cpu
> >> >  r b w   swap  free  re  mf pi po fr de sr s0 s1 s2 s5   in   sy
> >> cs us sy id
> >> >  0 0 0 8067092 63968  0   0  0 456 673 0 4396 0 0 0 15 72528 101
> >> 122326 0 13 87
> >> >  0 0 0 8067092 63756  0   0  0 528 697 0 3715 0 0 0 15 71458 105
> >> 119033 0 12 88
> >> >  0 0 0 8067092 64296  0   1  1 627 862 0 947 0 0  0 13 60113 285
> >> 99319 0 11 89
> >> >  0 0 0 8067092 63772  0   0  1 1746 2383 0 11021 0 0 0 42 73214 108
> >> 116913 0 14 86
> >> >  0 0 0 8067092 67636  0   0  0 461 519 0 5003 0 0 0 12 58883 104
> >> 94816 0 11 89
> >> >  0 0 0 8067092 64892  0   0  0 564 662 0 374 0 0  0 17 71484 103
> >> 120414 0 12 88
> >> >  0 0 0 8067092 64464  0   0  0 2570 2800 0 28562 0 0 0 162 64395
> >> 102 106058 0 12 88
> >> >  0 0 0 8067092 64240  7   9  1 2786 2874 0 35810 0 0 0 240 73767
> >> 161 120268 0 14 86
> >> >  0 0 0 8067092 64520  0   1  1 3159 3684 0 12418 0 0 0 124 73128
> >> 178 120327 0 14 86
> >> >  0 0 0 8067092 64084  0   0  0 1001 1219 0 8212 0 0 0 51 72491 104
> >> 120140 0 14 86
> >> >  0 0 0 8067092 64312  1   2  0 2082 2454 0 20176 0 0 0 61 64743 109
> >> 104314 0 13 87
> >> >  0 0 0 8067092 63744 52 145 59 1540 1637 0 64272 0 0 0 61 60956 350
> >> 99672 0 12 88
> >> >  0 0 0 8067092 64060 13  16 13 2365 2534 0 20110 0 0 0 58 45279 108
> >> 72163 0 9 91
> >> >  0 0 0 8067092 63760  8  13 19 2541 2749 0 97689 0 0 0 57 49170 102
> >> 77992 0 11 89
> >> >  0 0 0 8067092 62420 25  37 46 2522 2828 0 199624 0 0 0 60 64358
> >> 208 105817 0 14 86
> >> >  0 0 0 8067092 58672 107 164 258 2792 3268 0 510515 0 0 0 104 64512
> >> 128 105422 0 15 85
> >> >  0 0 0 8066944 47492 135 358 783 720 1228 0 1194256 0 0 0 136 65695
> >> 518 108041 0 19 81
> >> >  0 0 0 8067092 33084 369 623 1038 1210 2543 0 2351516 0 0 0 275
> >> 63952 115 108074 0 23 77
> >> >  0 0 76 8067092 17212 23 155 543 316 603 48 5757690 0 0 0 152 14712
> >> 33 23307 0 24 76
> >> >  kthr      memory            page            disk
> >> faults      cpu
> >> >  r b w   swap  free  re  mf pi po fr de sr s0 s1 s2 s5   in   sy
> >> cs us sy id
> >> >  0 0 87 8067088 16056 60 156 487 381 623 48 4998992 0 0 0 145 8506
> >> 62 13701 0 20 80
> >> >  0 0 83 8067100 15428 30 94 254 148 285 0 4150075 0 0 0 82 2014 14
> >> 2636 0 18 82
> >> >  0 0 76 8333772 16160 65 178 444 441 693 0 5444911 0 0 0 161 2059
> >> 15 2382 0 20 80
> >> >  0 0 76 7808952 15260 73 193 455 464 762 0 5289731 0 0 0 134 1349
> >> 12 1626 0 21 79
> >> >  0 0 75 8279396 16248 104 336 955 682 1272 0 8309621 0 0 0 326 1580
> >> 16 1575 0 27 73
> >> >  0 0 76 7891728 15488 114 350 962 867 1460 0 6556432 0 0 0 240 1413
> >> 20 1579 0 25 75
> >> >  0 0 73 8066816 14196 9  18 33 38 67  0 4454583 0 0 0 10 474   7
> >> 248  0 16 84
> >> >  0 0 103 8055412 15804 264 525 917 1229 1899 0 5356581 0 0 0 265
> >> 3949 288 5345 0 32 68
> >> >
> >> >
> >> >
> >> >                     extended device statistics
> >> >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> >> >   239.5   29.8 1277.6  866.6  0.0  2.3    0.0    8.7   0  37 c0t2d0
> >> >    72.6    0.2  609.4    1.9  0.0  0.7    0.0    9.2   0  21 c0t2d0s0
> >> >   166.4   27.5  665.4  852.9  0.0  1.5    0.0    7.6   0  35 c0t2d0s1
> >> >     0.5    2.1    2.8   11.8  0.0  0.2    0.0   69.6   0  11 c0t2d0s3
> >> >    70.0   44.7 4343.0  643.1  0.0  0.8    0.0    6.7   0  43
> >> c5t600039300001742Bd0
> >> >    70.0   44.7 4343.0  643.1  0.0  0.8    0.0    6.7   0  43
> >> c5t600039300001742Bd0s0
> >> >                     extended device statistics
> >> >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> >> >   127.8   28.4  582.1  370.0  0.0  1.5    0.0    9.5   0  23 c0t2d0
> >> >    38.9    0.0  221.4    0.2  0.0  0.4    0.0   11.1   0  14 c0t2d0s0
> >> >    87.5   24.8  350.0  356.8  0.0  0.8    0.0    7.5   0  16 c0t2d0s1
> >> >     1.5    3.5   10.7   13.0  0.0  0.2    0.0   40.0   0   6 c0t2d0s3
> >> >    24.9   11.5 1546.2  226.5  0.0  0.4    0.0   10.7   0  16
> >> c5t600039300001742Bd0
> >> >    24.9   11.5 1546.2  226.5  0.0  0.4    0.0   10.7   0  16
> >> c5t600039300001742Bd0s0
> >> >                     extended device statistics
> >> >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> >> >   239.5   29.8 1277.6  866.6  0.0  2.3    0.0    8.7   0  37 c0t2d0
> >> >    72.6    0.2  609.4    1.9  0.0  0.7    0.0    9.2   0  21 c0t2d0s0
> >> >   166.4   27.5  665.4  852.9  0.0  1.5    0.0    7.6   0  35 c0t2d0s1
> >> >     0.5    2.1    2.8   11.8  0.0  0.2    0.0   69.6   0  11 c0t2d0s3
> >> >    70.0   44.7 4343.0  643.1  0.0  0.8    0.0    6.7   0  43
> >> c5t600039300001742Bd0
> >> >    70.0   44.7 4343.0  643.1  0.0  0.8    0.0    6.7   0  43
> >> c5t600039300001742Bd0s0
> >> >                     extended device statistics
> >> >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> >> >    65.2   32.7  299.0  302.7  0.0  1.3    0.0   13.4   0  17 c0t2d0
> >> >    10.3    0.0   54.8    0.1  0.0  0.2    0.0   22.0   0   7 c0t2d0s0
> >> >    48.8   16.5  195.3  241.9  0.0  0.6    0.0    8.6   0  11 c0t2d0s1
> >> >     6.1   16.2   49.0   60.7  0.0  0.5    0.0   23.1   0   5 c0t2d0s3
> >> >     0.1    3.7    9.2   13.5  0.0  0.0    0.0    0.7   0   0
> >> c4t6000393000017312d0
> >> >     0.1    3.7    9.2   13.5  0.0  0.0    0.0    0.7   0   0
> >> c4t6000393000017312d0s0
> >> >     8.4   13.5  563.8   85.7  0.0  0.1    0.0    5.1   0   4
> >> c5t600039300001742Bd0
> >> >     8.4   13.5  563.8   85.7  0.0  0.1    0.0    5.1   0   4
> >> c5t600039300001742Bd0s0
> >> >                     extended device statistics
> >> >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> >> >    81.1   19.0  352.3  310.5  0.0  0.9    0.0    9.4   0  14 c0t2d0
> >> >    19.8    0.5  107.0    0.6  0.0  0.3    0.0   13.6   0   8 c0t2d0s0
> >> >    61.0   18.2  244.1  308.8  0.0  0.6    0.0    8.2   0  10 c0t2d0s1
> >> >     0.2    0.3    1.2    1.1  0.0  0.0    0.0   29.9   0   1 c0t2d0s3
> >> >     2.9   11.7  167.1  102.7  0.0  0.0    0.0    2.4   0   2
> >> c5t600039300001742Bd0
> >> >     2.9   11.7  167.1  102.7  0.0  0.0    0.0    2.4   0   2
> >> c5t600039300001742Bd0s0
> >> >                     extended device statistics
> >> >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> >> >   135.2   32.0  577.6  501.8  0.0  1.4    0.0    8.1   0  23 c0t2d0
> >> >    37.8    0.4  187.7    0.4  0.0  0.4    0.0   11.2   0  15 c0t2d0s0
> >> >    97.0   31.2  387.9  499.8  0.0  0.9    0.0    7.1   0  17 c0t2d0s1
> >> >     0.4    0.4    2.0    1.6  0.0  0.0    0.0   24.6   0   2 c0t2d0s3
> >> >     0.0    0.7    1.1    2.7  0.0  0.0    0.0    0.4   0   0
> >> c4t6000393000017312d0
> >> >     0.0    0.7    1.1    2.7  0.0  0.0    0.0    0.4   0   0
> >> c4t6000393000017312d0s0
> >> >     1.1    3.7   53.7   26.6  0.0  0.0    0.0    1.9   0   1
> >> c5t600039300001742Bd0
> >> >     1.1    3.7   53.7   26.6  0.0  0.0    0.0    1.9   0   1
> >> c5t600039300001742Bd0s0
> >> >                     extended device statistics
> >> >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> >> >   160.6   41.1  690.0  612.3  0.0  1.1    0.0    5.5   0  25 c0t2d0
> >> >    43.1    0.0  219.6    0.0  0.0  0.3    0.0    7.5   0  15 c0t2d0s0
> >> >   117.3   40.9  469.1  612.2  0.0  0.8    0.0    4.9   0  18 c0t2d0s1
> >> >     0.2    0.2    1.4    0.6  0.0  0.0    0.0   19.0   0   0 c0t2d0s3
> >> >     0.1    0.0    5.6    0.0  0.0  0.0    0.0    4.5   0   0
> >> c5t600039300001742Bd0
> >> >     0.1    0.0    5.6    0.0  0.0  0.0    0.0    4.6   0   0
> >> c5t600039300001742Bd0s0
> >> >                     extended device statistics
> >> >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> >> >     7.4    2.6   30.7   35.8  0.0  0.1    0.0    6.5   0   1 c0t2d0
> >> >     1.8    0.0    8.4    0.0  0.0  0.0    0.0    8.8   0   1 c0t2d0s0
> >> >     5.6    2.5   22.3   35.7  0.0  0.0    0.0    5.9   0   1 c0t2d0s1
> >> >     0.0    0.0    0.0    0.1  0.0  0.0    0.0   38.5   0   0 c0t2d0s3
> >> >     0.0    0.0    0.1    0.1  0.0  0.0    0.0    0.4   0   0
> >> c4t6000393000017312d0
> >> >     0.0    0.0    0.1    0.1  0.0  0.0    0.0    0.4   0   0
> >> c4t6000393000017312d0s0
> >> >     0.0    0.2    1.4    2.3  0.0  0.0    0.0    1.7   0   0
> >> c5t600039300001742Bd0
> >> >     0.0    0.2    1.4    2.3  0.0  0.0    0.0    1.7   0   0
> >> c5t600039300001742Bd0s0
> >> >
> >> >
> >> >
> >> > Please give me some ideas how to slow down "CS" parameter.
> >> > The hardware is 2 x Dual Core Opteron:
> >> >
> >> > $ psrinfo -v
> >> > Status of virtual processor 0 as of: 12/21/2006 18:19:29
> >> >   on-line since 12/21/2006 15:14:16.
> >> >   The i386 processor operates at 2393 MHz,
> >> >         and has an i387 compatible floating point processor.
> >> > Status of virtual processor 1 as of: 12/21/2006 18:19:29
> >> >   on-line since 12/21/2006 15:14:19.
> >> >   The i386 processor operates at 2393 MHz,
> >> >         and has an i387 compatible floating point processor.
> >> > Status of virtual processor 2 as of: 12/21/2006 18:19:29
> >> >   on-line since 12/21/2006 15:14:21.
> >> >   The i386 processor operates at 2393 MHz,
> >> >         and has an i387 compatible floating point processor.
> >> > Status of virtual processor 3 as of: 12/21/2006 18:19:29
> >> >   on-line since 12/21/2006 15:14:23.
> >> >   The i386 processor operates at 2393 MHz,
> >> >         and has an i387 compatible floating point processor.
> >> >
> >> >
> >> > This message posted from opensolaris.org
> >> > _______________________________________________
> >> > nfs-discuss mailing list
> >> > nfs-discuss at opensolaris.org
> >> >
> >>
> >> _______________________________________________
> >> nfs-discuss mailing list
> >> nfs-discuss at opensolaris.org
> >>
>
>

Reply via email to