Thanks for the below pointers. In the end, it appears (so far) to be a failing DIMM, as the next failure ended up causing the system to no longer boot until that DIMM was removed. They are either perfectly related, or a power reset from before caused the DIMM failure and the initial cause has yet to be isolated. I'll look more into the recs below.
On 1/22/07, Mahesh Siddheshwar <siddheshwar.mahesh at sun.com> wrote: > Joe Little wrote: > > On 12/21/06, Mahesh Siddheshwar <siddheshwar.mahesh at sun.com> wrote: > >> Sergey wrote: > >> > The problem with server's frizzz seems to be related to very > >> intensive usage of the server as NFS server. > >> > > >> > "cs" (context switching is really high). > >> > I've been running vmstat/iostat/nfsstat/prstat in cron. Below is > >> the last working seconds before the server dies: > >> > > >> Sergey, Can you elaborate a bit more on what you mean by "server dies"? > >> Does nfsd die/dumps core? Does the system panic or hang? If the system > >> panics or hangs, can you provide us the panic backtrace or kernel > >> threadlist or > >> a location for the crash dump? If it is the nfsd which dies, can you > >> provide > >> us the pstack output from the core file? Also details on what build > >> you are > >> running? > >> > >> Thanks, > >> Mahesh > > > > We ourselves have finally had higher NFS load against a B53 based > > OpenSolaris server w/ NFS as the network protocol and ZFS partitions. > > The same symptoms have been seen in the last three or four days, with > > nfsd taking more CPU and over time the system goes comatose with no > > local response to the console, no network traffic, and of course > > nothing in the crash dumps, logs, etc. We've been at a loss to track > > it down, and I've been running regular "::kmastat" against mdb. > > Nothing in crash dumps. > > > > Any further specific things (dtrace, commands) when suggests running > > periodically to capture this? Is there anything in B54/B55 that > > already addresses these problems? > Is it possible to load kmdb (either at boot time or later load it by mdb > -K) and > take a crash dump when the system becomes unresponsive? You could also > enable deadman by adding this in /etc/system > set snooping=1 > this would make the system panic if the clock downs not move for 50secs. > > Another thing you can do is, keep taking kernel thread list > in the background at regular intervals and see what the system was up to > before the hang. You can obtain the threadlist using, something like: > # echo "::threadlist -v" | mdb -k > thrlist.1 > > While you may not get the last hang scenario in this case, the data > could probably give you some leads. > > Please note that the first two suggestions are destructive actions > - in the sense that it will panic the system. So please use it > appropriately. > > Thanks, > Mahesh > > > >> > r b w swap free re mf pi po fr de sr s0 s1 s2 s5 in sy > >> cs us sy id > >> > 0 0 0 8067092 124672 0 1 0 0 0 0 0 0 0 0 0 61110 125 > >> 93706 0 13 87 > >> > 0 0 0 8067092 122856 0 1 0 0 0 0 0 0 0 0 0 63327 114 > >> 97495 0 14 86 > >> > 0 0 0 8067092 120324 0 0 0 0 0 0 0 0 0 0 8 65604 101 > >> 105058 0 13 87 > >> > 0 0 0 8067092 118240 0 0 0 0 0 0 0 0 0 0 0 69089 210 > >> 109246 0 14 86 > >> > 0 0 0 8067092 116540 0 0 3 0 0 0 0 0 0 0 2 70143 268 > >> 110813 0 15 85 > >> > 0 0 0 8067092 115416 0 0 0 0 0 0 0 0 0 0 0 73568 111 > >> 118202 0 15 85 > >> > 0 0 0 8067092 113844 0 0 0 0 0 0 0 0 0 0 1 68581 154 > >> 108782 0 15 85 > >> > 0 0 0 8067092 114832 0 0 0 0 0 0 0 0 0 0 3 70752 101 > >> 111744 0 15 85 > >> > 0 0 0 8067092 112988 0 0 0 0 0 0 0 0 0 0 0 71081 103 > >> 113991 0 14 86 > >> > 0 0 0 8067092 109192 0 0 0 0 0 0 0 0 0 0 0 70043 193 > >> 113730 0 14 86 > >> > 0 0 0 8067092 104920 0 0 0 0 0 0 0 0 0 0 1 68339 99 > >> 110622 0 14 86 > >> > 0 0 0 8067092 101464 0 0 0 0 0 0 0 0 0 0 0 59992 106 > >> 96626 0 12 88 > >> > 0 0 0 8067092 99780 0 0 0 0 0 0 0 0 0 0 0 54436 134 > >> 91421 0 9 91 > >> > 0 0 0 8067092 96828 0 0 0 0 0 0 0 0 0 0 0 50614 107 > >> 84133 0 9 91 > >> > 0 0 0 8067092 93992 0 0 0 0 0 0 0 0 0 0 1 50013 122 > >> 86355 0 8 92 > >> > 0 0 0 8067092 90984 0 0 0 0 0 0 0 0 0 0 0 53978 204 > >> 90745 0 9 91 > >> > 0 0 0 8067092 87984 0 0 0 0 0 0 0 0 0 0 1 55482 103 > >> 92300 0 10 90 > >> > 0 0 0 8067092 85820 0 0 0 0 0 0 0 0 0 0 0 58932 109 > >> 98907 0 10 90 > >> > 0 0 0 8067092 83200 0 0 0 0 0 0 0 0 0 0 0 53442 117 > >> 87132 0 10 90 > >> > kthr memory page disk > >> faults cpu > >> > r b w swap free re mf pi po fr de sr s0 s1 s2 s5 in sy > >> cs us sy id > >> > 0 0 0 8067092 80020 0 0 0 0 0 0 0 0 0 0 1 60984 121 > >> 100219 0 11 89 > >> > 0 0 0 8067092 76184 40 189 14 12 12 0 0 0 0 0 4 59148 1312 > >> 100028 0 11 89 > >> > 0 0 0 8067092 73168 0 0 0 0 0 0 0 0 0 0 0 61049 166 > >> 101192 0 11 89 > >> > 0 0 0 8067092 69388 0 0 0 0 0 0 0 0 0 0 1 60986 182 > >> 102213 0 11 89 > >> > 0 0 0 8067092 65452 0 0 0 1156 1931 0 17390 0 0 0 48 56148 100 > >> 90676 0 11 89 > >> > 0 0 0 8067092 71096 40 130 3 1 0 0 0 0 0 0 1 56982 360 > >> 92811 0 10 90 > >> > 0 0 0 8067092 69436 0 1 0 0 0 0 0 0 0 0 0 51107 113 > >> 84473 0 9 91 > >> > 0 0 0 8067092 65936 0 0 0 11 18 0 4 0 0 0 8 54559 122 > >> 91551 0 10 90 > >> > 0 0 0 8067092 64412 0 0 0 452 552 0 309 0 0 0 9 48013 99 > >> 80081 0 8 92 > >> > 0 0 0 8067092 64600 0 2 0 191 193 0 50 0 0 0 4 52709 231 > >> 86733 0 9 91 > >> > 0 0 0 8067092 65972 0 1 0 573 578 0 255 0 0 0 11 61670 112 > >> 102799 0 10 90 > >> > 0 0 0 8067092 72296 0 1 0 56 56 0 0 0 0 0 1 51431 120 > >> 87487 0 8 92 > >> > 0 0 0 8067092 65292 1 1 1 1385 1781 0 2625 0 0 0 33 68007 106 > >> 113257 0 12 88 > >> > 0 0 0 8067092 74700 0 0 0 0 0 0 0 0 0 0 0 69690 97 > >> 113794 0 12 88 > >> > 0 0 0 8067092 71768 0 0 0 0 0 0 0 0 0 0 0 65226 106 > >> 110528 0 11 89 > >> > 0 0 0 8067092 66588 0 1 0 102 235 0 240 0 0 0 3 70976 178 > >> 120869 0 13 87 > >> > 0 0 0 8067092 65592 0 1 0 149 322 0 278 0 0 0 4 66895 110 > >> 113694 0 12 88 > >> > 0 0 0 8067092 63652 0 1 0 624 946 0 4471 0 0 0 18 71986 110 > >> 121720 0 12 88 > >> > 0 0 0 8067092 64884 0 1 2 45 45 0 0 0 0 0 2 72114 121 > >> 122250 0 12 88 > >> > kthr memory page disk > >> faults cpu > >> > r b w swap free re mf pi po fr de sr s0 s1 s2 s5 in sy > >> cs us sy id > >> > 0 0 0 8067092 63968 0 0 0 456 673 0 4396 0 0 0 15 72528 101 > >> 122326 0 13 87 > >> > 0 0 0 8067092 63756 0 0 0 528 697 0 3715 0 0 0 15 71458 105 > >> 119033 0 12 88 > >> > 0 0 0 8067092 64296 0 1 1 627 862 0 947 0 0 0 13 60113 285 > >> 99319 0 11 89 > >> > 0 0 0 8067092 63772 0 0 1 1746 2383 0 11021 0 0 0 42 73214 108 > >> 116913 0 14 86 > >> > 0 0 0 8067092 67636 0 0 0 461 519 0 5003 0 0 0 12 58883 104 > >> 94816 0 11 89 > >> > 0 0 0 8067092 64892 0 0 0 564 662 0 374 0 0 0 17 71484 103 > >> 120414 0 12 88 > >> > 0 0 0 8067092 64464 0 0 0 2570 2800 0 28562 0 0 0 162 64395 > >> 102 106058 0 12 88 > >> > 0 0 0 8067092 64240 7 9 1 2786 2874 0 35810 0 0 0 240 73767 > >> 161 120268 0 14 86 > >> > 0 0 0 8067092 64520 0 1 1 3159 3684 0 12418 0 0 0 124 73128 > >> 178 120327 0 14 86 > >> > 0 0 0 8067092 64084 0 0 0 1001 1219 0 8212 0 0 0 51 72491 104 > >> 120140 0 14 86 > >> > 0 0 0 8067092 64312 1 2 0 2082 2454 0 20176 0 0 0 61 64743 109 > >> 104314 0 13 87 > >> > 0 0 0 8067092 63744 52 145 59 1540 1637 0 64272 0 0 0 61 60956 350 > >> 99672 0 12 88 > >> > 0 0 0 8067092 64060 13 16 13 2365 2534 0 20110 0 0 0 58 45279 108 > >> 72163 0 9 91 > >> > 0 0 0 8067092 63760 8 13 19 2541 2749 0 97689 0 0 0 57 49170 102 > >> 77992 0 11 89 > >> > 0 0 0 8067092 62420 25 37 46 2522 2828 0 199624 0 0 0 60 64358 > >> 208 105817 0 14 86 > >> > 0 0 0 8067092 58672 107 164 258 2792 3268 0 510515 0 0 0 104 64512 > >> 128 105422 0 15 85 > >> > 0 0 0 8066944 47492 135 358 783 720 1228 0 1194256 0 0 0 136 65695 > >> 518 108041 0 19 81 > >> > 0 0 0 8067092 33084 369 623 1038 1210 2543 0 2351516 0 0 0 275 > >> 63952 115 108074 0 23 77 > >> > 0 0 76 8067092 17212 23 155 543 316 603 48 5757690 0 0 0 152 14712 > >> 33 23307 0 24 76 > >> > kthr memory page disk > >> faults cpu > >> > r b w swap free re mf pi po fr de sr s0 s1 s2 s5 in sy > >> cs us sy id > >> > 0 0 87 8067088 16056 60 156 487 381 623 48 4998992 0 0 0 145 8506 > >> 62 13701 0 20 80 > >> > 0 0 83 8067100 15428 30 94 254 148 285 0 4150075 0 0 0 82 2014 14 > >> 2636 0 18 82 > >> > 0 0 76 8333772 16160 65 178 444 441 693 0 5444911 0 0 0 161 2059 > >> 15 2382 0 20 80 > >> > 0 0 76 7808952 15260 73 193 455 464 762 0 5289731 0 0 0 134 1349 > >> 12 1626 0 21 79 > >> > 0 0 75 8279396 16248 104 336 955 682 1272 0 8309621 0 0 0 326 1580 > >> 16 1575 0 27 73 > >> > 0 0 76 7891728 15488 114 350 962 867 1460 0 6556432 0 0 0 240 1413 > >> 20 1579 0 25 75 > >> > 0 0 73 8066816 14196 9 18 33 38 67 0 4454583 0 0 0 10 474 7 > >> 248 0 16 84 > >> > 0 0 103 8055412 15804 264 525 917 1229 1899 0 5356581 0 0 0 265 > >> 3949 288 5345 0 32 68 > >> > > >> > > >> > > >> > extended device statistics > >> > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > >> > 239.5 29.8 1277.6 866.6 0.0 2.3 0.0 8.7 0 37 c0t2d0 > >> > 72.6 0.2 609.4 1.9 0.0 0.7 0.0 9.2 0 21 c0t2d0s0 > >> > 166.4 27.5 665.4 852.9 0.0 1.5 0.0 7.6 0 35 c0t2d0s1 > >> > 0.5 2.1 2.8 11.8 0.0 0.2 0.0 69.6 0 11 c0t2d0s3 > >> > 70.0 44.7 4343.0 643.1 0.0 0.8 0.0 6.7 0 43 > >> c5t600039300001742Bd0 > >> > 70.0 44.7 4343.0 643.1 0.0 0.8 0.0 6.7 0 43 > >> c5t600039300001742Bd0s0 > >> > extended device statistics > >> > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > >> > 127.8 28.4 582.1 370.0 0.0 1.5 0.0 9.5 0 23 c0t2d0 > >> > 38.9 0.0 221.4 0.2 0.0 0.4 0.0 11.1 0 14 c0t2d0s0 > >> > 87.5 24.8 350.0 356.8 0.0 0.8 0.0 7.5 0 16 c0t2d0s1 > >> > 1.5 3.5 10.7 13.0 0.0 0.2 0.0 40.0 0 6 c0t2d0s3 > >> > 24.9 11.5 1546.2 226.5 0.0 0.4 0.0 10.7 0 16 > >> c5t600039300001742Bd0 > >> > 24.9 11.5 1546.2 226.5 0.0 0.4 0.0 10.7 0 16 > >> c5t600039300001742Bd0s0 > >> > extended device statistics > >> > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > >> > 239.5 29.8 1277.6 866.6 0.0 2.3 0.0 8.7 0 37 c0t2d0 > >> > 72.6 0.2 609.4 1.9 0.0 0.7 0.0 9.2 0 21 c0t2d0s0 > >> > 166.4 27.5 665.4 852.9 0.0 1.5 0.0 7.6 0 35 c0t2d0s1 > >> > 0.5 2.1 2.8 11.8 0.0 0.2 0.0 69.6 0 11 c0t2d0s3 > >> > 70.0 44.7 4343.0 643.1 0.0 0.8 0.0 6.7 0 43 > >> c5t600039300001742Bd0 > >> > 70.0 44.7 4343.0 643.1 0.0 0.8 0.0 6.7 0 43 > >> c5t600039300001742Bd0s0 > >> > extended device statistics > >> > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > >> > 65.2 32.7 299.0 302.7 0.0 1.3 0.0 13.4 0 17 c0t2d0 > >> > 10.3 0.0 54.8 0.1 0.0 0.2 0.0 22.0 0 7 c0t2d0s0 > >> > 48.8 16.5 195.3 241.9 0.0 0.6 0.0 8.6 0 11 c0t2d0s1 > >> > 6.1 16.2 49.0 60.7 0.0 0.5 0.0 23.1 0 5 c0t2d0s3 > >> > 0.1 3.7 9.2 13.5 0.0 0.0 0.0 0.7 0 0 > >> c4t6000393000017312d0 > >> > 0.1 3.7 9.2 13.5 0.0 0.0 0.0 0.7 0 0 > >> c4t6000393000017312d0s0 > >> > 8.4 13.5 563.8 85.7 0.0 0.1 0.0 5.1 0 4 > >> c5t600039300001742Bd0 > >> > 8.4 13.5 563.8 85.7 0.0 0.1 0.0 5.1 0 4 > >> c5t600039300001742Bd0s0 > >> > extended device statistics > >> > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > >> > 81.1 19.0 352.3 310.5 0.0 0.9 0.0 9.4 0 14 c0t2d0 > >> > 19.8 0.5 107.0 0.6 0.0 0.3 0.0 13.6 0 8 c0t2d0s0 > >> > 61.0 18.2 244.1 308.8 0.0 0.6 0.0 8.2 0 10 c0t2d0s1 > >> > 0.2 0.3 1.2 1.1 0.0 0.0 0.0 29.9 0 1 c0t2d0s3 > >> > 2.9 11.7 167.1 102.7 0.0 0.0 0.0 2.4 0 2 > >> c5t600039300001742Bd0 > >> > 2.9 11.7 167.1 102.7 0.0 0.0 0.0 2.4 0 2 > >> c5t600039300001742Bd0s0 > >> > extended device statistics > >> > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > >> > 135.2 32.0 577.6 501.8 0.0 1.4 0.0 8.1 0 23 c0t2d0 > >> > 37.8 0.4 187.7 0.4 0.0 0.4 0.0 11.2 0 15 c0t2d0s0 > >> > 97.0 31.2 387.9 499.8 0.0 0.9 0.0 7.1 0 17 c0t2d0s1 > >> > 0.4 0.4 2.0 1.6 0.0 0.0 0.0 24.6 0 2 c0t2d0s3 > >> > 0.0 0.7 1.1 2.7 0.0 0.0 0.0 0.4 0 0 > >> c4t6000393000017312d0 > >> > 0.0 0.7 1.1 2.7 0.0 0.0 0.0 0.4 0 0 > >> c4t6000393000017312d0s0 > >> > 1.1 3.7 53.7 26.6 0.0 0.0 0.0 1.9 0 1 > >> c5t600039300001742Bd0 > >> > 1.1 3.7 53.7 26.6 0.0 0.0 0.0 1.9 0 1 > >> c5t600039300001742Bd0s0 > >> > extended device statistics > >> > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > >> > 160.6 41.1 690.0 612.3 0.0 1.1 0.0 5.5 0 25 c0t2d0 > >> > 43.1 0.0 219.6 0.0 0.0 0.3 0.0 7.5 0 15 c0t2d0s0 > >> > 117.3 40.9 469.1 612.2 0.0 0.8 0.0 4.9 0 18 c0t2d0s1 > >> > 0.2 0.2 1.4 0.6 0.0 0.0 0.0 19.0 0 0 c0t2d0s3 > >> > 0.1 0.0 5.6 0.0 0.0 0.0 0.0 4.5 0 0 > >> c5t600039300001742Bd0 > >> > 0.1 0.0 5.6 0.0 0.0 0.0 0.0 4.6 0 0 > >> c5t600039300001742Bd0s0 > >> > extended device statistics > >> > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > >> > 7.4 2.6 30.7 35.8 0.0 0.1 0.0 6.5 0 1 c0t2d0 > >> > 1.8 0.0 8.4 0.0 0.0 0.0 0.0 8.8 0 1 c0t2d0s0 > >> > 5.6 2.5 22.3 35.7 0.0 0.0 0.0 5.9 0 1 c0t2d0s1 > >> > 0.0 0.0 0.0 0.1 0.0 0.0 0.0 38.5 0 0 c0t2d0s3 > >> > 0.0 0.0 0.1 0.1 0.0 0.0 0.0 0.4 0 0 > >> c4t6000393000017312d0 > >> > 0.0 0.0 0.1 0.1 0.0 0.0 0.0 0.4 0 0 > >> c4t6000393000017312d0s0 > >> > 0.0 0.2 1.4 2.3 0.0 0.0 0.0 1.7 0 0 > >> c5t600039300001742Bd0 > >> > 0.0 0.2 1.4 2.3 0.0 0.0 0.0 1.7 0 0 > >> c5t600039300001742Bd0s0 > >> > > >> > > >> > > >> > Please give me some ideas how to slow down "CS" parameter. > >> > The hardware is 2 x Dual Core Opteron: > >> > > >> > $ psrinfo -v > >> > Status of virtual processor 0 as of: 12/21/2006 18:19:29 > >> > on-line since 12/21/2006 15:14:16. > >> > The i386 processor operates at 2393 MHz, > >> > and has an i387 compatible floating point processor. > >> > Status of virtual processor 1 as of: 12/21/2006 18:19:29 > >> > on-line since 12/21/2006 15:14:19. > >> > The i386 processor operates at 2393 MHz, > >> > and has an i387 compatible floating point processor. > >> > Status of virtual processor 2 as of: 12/21/2006 18:19:29 > >> > on-line since 12/21/2006 15:14:21. > >> > The i386 processor operates at 2393 MHz, > >> > and has an i387 compatible floating point processor. > >> > Status of virtual processor 3 as of: 12/21/2006 18:19:29 > >> > on-line since 12/21/2006 15:14:23. > >> > The i386 processor operates at 2393 MHz, > >> > and has an i387 compatible floating point processor. > >> > > >> > > >> > This message posted from opensolaris.org > >> > _______________________________________________ > >> > nfs-discuss mailing list > >> > nfs-discuss at opensolaris.org > >> > > >> > >> _______________________________________________ > >> nfs-discuss mailing list > >> nfs-discuss at opensolaris.org > >> > >