We're having issues with our kernel getting bogged down by callout_list_get.
I sent a mail about this a couple months ago but I never got a reply. My server is once again spending 80-90% of its time in the kernel and looking into it, it is callout_list_get that is using up 81% of the kernel time. Looking at the callout table it is full of cv_wakeup() calls. Last time the only thing we could do to fix it is reboot. I have not gotten any further with the problem this time. Has anyone seen something like this? Are there any suggestions on what I can do next time to get more information out of the system to help figure out what is going wrong. Even better would be a way to clear out callout_list_get and get the system back to normal. As of right now the server loads are between 30 and 50 where they are normally below 1 and the CPU is only 4-8% idle and 85-90% of the time is being spent in the kernel. Output from top right now: load averages: 33.5, 33.1, 38.5; up 14+14:47:02 16:53:58 380 processes: 369 sleeping, 3 running, 2 zombie, 6 on cpu CPU states: 1.7% idle, 7.6% user, 90.8% kernel, 0.0% iowait, 0.0% swap Kernel: 2953 ctxsw, 2711 trap, 344965 intr, 272864 syscall, 1 fork, 1844 flt Memory: 32G phys mem, 5383M free mem, 16G total swap, 16G free swap We turned both web servers and both mysql database servers off so the machine was idle and the loads were still between 3 and 4 and callout_list_get was using up all the kernel time still, which was now between 1-2% of the total CPU time. So we shut down each zone one by one and still no change. We decided to reboot the machine. The last time this happened we got to where rebooting was the only thing left we could think of and the machine got stuck someplace in either shut down or booting. We had to contact the data center to have them do a hard reboot. They reported to us the system lights looked normal as if it were up but the screen was blank and they could get no response from the machine. We assumed somehow we had not cleanly shut down one of the zones. The time we made sure to shut off all services by hand and then each zone by hand before we rebooted the global zone. Again, it got stuck someplace in the reboot process and the data center reported the same condition of lights all on and normal but server unresponsive and a blank screen. A hard reboot (power cycle) brought the machine back. Top output with all services running after the reboot: load averages: 0.17, 0.18, 0.08; up 0+00:03:26 17:30:24 155 processes: 154 sleeping, 1 on cpu CPU states: 96.3% idle, 2.1% user, 1.6% kernel, 0.0% iowait, 0.0% swap Kernel: 2735 ctxsw, 5209 trap, 1312 intr, 7688 syscall, 2 fork, 4582 flt Memory: 32G phys mem, 29G free mem, 16G total swap, 16G free swap Top output with system under normal usage loads: load averages: 1.01, 0.71, 0.36; up 0+00:10:17 17:37:15 312 processes: 310 sleeping, 1 zombie, 1 on cpu CPU states: 85.9% idle, 5.9% user, 8.3% kernel, 0.0% iowait, 0.0% swap Kernel: 6916 ctxsw, 1060 trap, 2530 intr, 758256 syscall, 687 flt Memory: 32G phys mem, 27G free mem, 16G total swap, 16G free swap Looking at hotkernel, callout_list_get does show up in the screen of top processes. Here is my mail the last time it happened, the only thing that has changed since then is now we are using all 4 zones: We are running an OpenSolaris snv_111b server for our web deployment. We have the machine split into 4 zones (plus the root zone), two are still being set up so not really in use, one is running apache/php, and the other mysql. We are using crossbow to create two virtual networks to isolate our database zones from each other and from the outside. We noticed over the last week the loads on the server were slowly creeping up, even when the number of users online was not very high. Eventually we were seeing < 50% idle times for the CPU and loads at 4-5 when the number of users was nearly nothing compared to our normal usage. Our server usually has loads around 1 at peak usage times. Digging into what was taking up so much time we found callout_list_get in hot kernel competing with the idle loop as the top process. It was using up 40-50% of the CPU and it counts were off the chart. Looking at the callout table, cv_wakeup() was 95% of the entries. Eventually the loads reached around 12 and CPU idle time was < 30% with only 300-400 users on the web site and we decided to try rebooting the system to see if it cleared it up. It did and now the loads are back to normal at < 1 with over 700 users online. I am wondering what callout_list_get is doing that would cause it to be using increasingly more and more resources. Has anyone seen something like this? What other information about the system should I be getting so we can track down what is going on if it starts to happen again? Thanks for your help! -Kristin -- Kristin Cubanski http://tomorrowisobsolete.blogspot.com & http://kamundse.blogspot.com http://flickr.com/photos/kamundse/
_______________________________________________ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org