>From time to time a jboss process would end eating all, the available CPU and 
>the load avg would skyrocket. 

Once the operators restarted jboss, the system'd be normal again (sometimes for 
weeks) until the next incident

Since we moved the app from a v440 running Solaris 10 8/07 to a t2000 running 
Solaris 10 5/08 the problem started to happen more frequently (2-3 times a 
week). The only other modification (besides hard and OS release) is we put the 
app server inside a sparse-zone container with the FSS and a 95% limit on the 
CPU usage.

Here's what uptime says:

Mon Nov 24 10:10:00 ARST 2008
 10:10am  up 10 day(s), 14:40,  6 users,  load average: 325.52, 320.07, 318.72

(yes, load avg is almost 320, but the server was still usable)

a little output from mpstat shows this:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0  531   259  150   46   27   17   43    0    29   99   1   0   0
  1    0   0    3    66    0  124   65   36   10    0    86   99   1   0   0
  2    0   0    0    23    0   34   22   17    9    0     8  100   0   0   0
  3    0   0    0    13    0   13   12    6    9    0     1  100   0   0   0
  4    0   0    0    13    0   14   12    9    9    0     2  100   0   0   0
  5    0   0    0    38    0   69   37   17    8    0    23  100   0   0   0
  6    0   0    0    25    0   40   24   13    8    0     7   95   5   0   0
  7    0   0    0    15    0   17   14   10    9    0     3  100   0   0   0
  8    0   0    0    11    0   10   10    6    8    0     0  100   0   0   0
  9    0   0    0    12    0   11   11    7    9    0     0  100   0   0   0
 10    0   0    0    14    0   13   13    7    9    0     0  100   0   0   0
 11    0   0    0    13    0   12   12    6    9    0     0  100   0   0   0
 12    0   0    0    25    0   37   24   14    8    0    25  100   0   0   0
 13    0   0    0    13    0   13   12    9    9    0     1  100   0   0   0
 14    0   0    0    13    0   12   12    4    9    0     0  100   0   0   0
 15    0   0    0    13    0   12   12    7    9    0     0  100   0   0   0
 16    0   0    0    12    0   11   11    5    7    0     0  100   0   0   0
 17    0   0    0    14    0   13   13    8    7    0     0  100   0   0   0
 18    0   0    0    13    0   12   12    8    9    0     0  100   0   0   0
 19    0   0    1    13    0   12   12    7    8    0     0  100   0   0   0
 20    0   0    7    25    0   35   25   13    9    0     3  100   0   0   0
 21    0   0   72    45    3   78   41   29   11    0    19   99   1   0   0
22    0   0    3    44    4   71   39   21   11    0   135  100   0   0   0
 23    0   0    3    20    5   16   14    7    9    0     2  100   0   0   0
 24    0   0    3    17    5   13   12    6    9    0     2   99   1   0   0
 25    0   0    1    15    4   10   10    8    8    0     0  100   0   0   0
 26    0   0    8    15    0   17   14    5   10    0     0  100   0   0   0
 27    0   0    0    12    0   11   11    7    8    0     0  100   0   0   0
 28    0   0    1    11    0   10   10    6    8    0     0  100   0   0   0
 29    0   0    2    67    0  126   66   30   11    0    48  100   0   0   0
 30    0   0    0    19    0   25   18   11    8    0   232  100   0   0   0
 31    0   0    1    27    0   47   26   12   10    0    10  100   0   0   0



So, 100% cpu in usr mode looks a lot like some kind of infinite loop

I tried to dig it up using dtrace:

profile-1001us

/(execname=="java") && (pid==$1)/

{

        @[jstack()]=count();

}



tick-20s

{

        exit(O);

}


But I got a LOT of these arrors:

dtrace: error on enabled probe ID 1 (ID 52070: profile:::Profile-1001us): 
invalid address (0x96a7e000) in action #2

and th stack traces are mostly the hex addresses without the names:


              0xf886ca64
              0xf884cec0
              0xf8805d3c
              0xf8805874
              0xf8805c70
              0xf8805c70
              0xf8805d3c
              0xf88380b8
              0xf9509440
              0xf88b7b44
              0xf8805874
              0xf8805874
              0xf8805764
              0xf8805874
              0xf8805874
              0xf8805d3c
              0xf8805874
              0xf8805874
              0xf8805874
              0xf8805874
              0xf8805874
              0xf8805874
              0xf8805764
              0xf8805764
              0xf8805764
              0xf886993c
              0xf9999004
              0xf884cec0
              0xf8805874
              0xf8805d3c
              0xf884cc7c
              0xf994090c
              0xf8838df8
              0xf8800218
              
libjvm.so`__1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_+0x5a0
              libjvm.so`JVM_DoPrivileged+0x500
              
libjava.so`Java_java_security_AccessController_doPrivileged__Ljava_security_PrivilegedExceptionAction_2Ljava_security_AccessControlContext_2+0x14
              0xf880e22c
              0xf880e1d0
              0xf89c6470
              0xf9dfc018
              0xf99ee350
              0xf9d74fd8
              0xf8838df8
              0xf8800218
              
libjvm.so`__1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_+0x5a0
              libjvm.so`JVM_DoPrivileged+0x500
              
libjava.so`Java_java_security_AccessController_doPrivileged__Ljava_security_PrivilegedExceptionAction_2+0x14
              0xf9467cc8
              0xf9badd44
             9852


So my questions are:

any ideas on what to do next? I'm pretty sure it's a jboss/application problem, 
but I need to get more data to show the jboss/devel people
what's causing those dtrace errors?


thanks in advance


      
_______________________________________________
dtrace-discuss mailing list
dtrace-discuss@opensolaris.org

Reply via email to