We're having issues with our kernel getting bogged down by callout_list_get.

I sent a mail about this a couple months ago but I never got a reply.  My
server is once again spending 80-90% of its time in the kernel and looking
into it, it is callout_list_get that is using up 81% of the kernel time.

Looking at the callout table it is full of cv_wakeup() calls.

Last time the only thing we could do to fix it is reboot.  I have not gotten
any further with the problem this time.  Has anyone seen something like
this?  Are there any suggestions on what I can do next time to get more
information out of the system to help figure out what is going wrong.  Even
better would be a way to clear out callout_list_get and get the system back
to normal.

As of right now the server loads are between 30 and 50 where they are
normally below 1 and the CPU is only 4-8% idle and 85-90% of the time is
being spent in the kernel.

Output from top right now:

load averages:  33.5,  33.1,  38.5;               up 14+14:47:02
16:53:58
380 processes: 369 sleeping, 3 running, 2 zombie, 6 on cpu
CPU states:  1.7% idle,  7.6% user, 90.8% kernel,  0.0% iowait,  0.0% swap
Kernel: 2953 ctxsw, 2711 trap, 344965 intr, 272864 syscall, 1 fork, 1844 flt
Memory: 32G phys mem, 5383M free mem, 16G total swap, 16G free swap

We turned both web servers and both mysql database servers off so the
machine was idle and the loads were still between 3 and 4 and
callout_list_get was using up all the kernel time still, which was now
between 1-2% of the total CPU time.  So we shut down each zone one by one
and still no change.

We decided to reboot the machine.  The last time this happened we got to
where rebooting was the only thing left we could think of and the machine
got stuck someplace in either shut down or booting.  We had to contact the
data center to have them do a hard reboot.  They reported to us the system
lights looked normal as if it were up but the screen was blank and they
could get no response from the machine.  We assumed somehow we had not
cleanly shut down one of the zones.  The time we made sure to shut off all
services by hand and then each zone by hand before we rebooted the global
zone.  Again, it got stuck someplace in the reboot process and the data
center reported the same condition of lights all on and normal but server
unresponsive and a blank screen.   A hard reboot (power cycle) brought the
machine back.

Top output with all services running after the reboot:

load averages:  0.17,  0.18,  0.08;               up 0+00:03:26
17:30:24
155 processes: 154 sleeping, 1 on cpu
CPU states: 96.3% idle,  2.1% user,  1.6% kernel,  0.0% iowait,  0.0% swap
Kernel: 2735 ctxsw, 5209 trap, 1312 intr, 7688 syscall, 2 fork, 4582 flt
Memory: 32G phys mem, 29G free mem, 16G total swap, 16G free swap

Top output with system under normal usage loads:

load averages:  1.01,  0.71,  0.36;               up 0+00:10:17
17:37:15
312 processes: 310 sleeping, 1 zombie, 1 on cpu
CPU states: 85.9% idle,  5.9% user,  8.3% kernel,  0.0% iowait,  0.0% swap
Kernel: 6916 ctxsw, 1060 trap, 2530 intr, 758256 syscall, 687 flt
Memory: 32G phys mem, 27G free mem, 16G total swap, 16G free swap

Looking at hotkernel, callout_list_get does show up in the screen of top
processes.

Here is my mail the last time it happened, the only thing that has changed
since then is now we are using all 4 zones:

We are running an OpenSolaris snv_111b server for our web deployment. We
have
the machine split into 4 zones (plus the root zone), two are still being set
up
so not really in use, one is running apache/php, and the other mysql. We are

using crossbow to create two virtual networks to isolate our database zones
from each other and from the outside.

We noticed over the last week the loads on the server were slowly creeping
up,
even when the number of users online was not very high. Eventually we were
seeing < 50% idle times for the CPU and loads at 4-5 when the number of
users
was nearly nothing compared to our normal usage. Our server usually has
loads
around 1 at peak usage times.

Digging into what was taking up so much time we found callout_list_get in
hot
kernel competing with the idle loop as the top process. It was using up
40-50%
of the CPU and it counts were off the chart. Looking at the callout table,
cv_wakeup() was 95% of the entries.

Eventually the loads reached around 12 and CPU idle time was < 30% with only

300-400 users on the web site and we decided to try rebooting the system to
see
if it cleared it up. It did and now the loads are back to normal at < 1 with

over 700 users online.

I am wondering what callout_list_get is doing that would cause it to be
using
increasingly more and more resources. Has anyone seen something like this?
What other information about the system should I be getting so we can track
down what is going on if it starts to happen again?

Thanks for your help!

-Kristin

-- 
Kristin Cubanski

http://tomorrowisobsolete.blogspot.com & http://kamundse.blogspot.com
http://flickr.com/photos/kamundse/
_______________________________________________
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

Reply via email to