Update: between the memory leakage started approx. at 20:15 and reached the end of the bottle at 21:10 with 40 Mb free memory. At this time the server started to loose users. Until 21:45 it droped about 1000 users leaving 21 on the server, before it got restarted at 21:50 as I stated in the previous post.
I don't have profiling info, but it looks like ProxyDirContext.cache is the right place for further investigations, but I might be wrong. Opinions? regards Leon On 3/16/06, Leon Rosenberg <[EMAIL PROTECTED]> wrote: > On 3/15/06, Caldarale, Charles R <[EMAIL PROTECTED]> wrote: > > > From: Leon Rosenberg [mailto:[EMAIL PROTECTED] > > > Subject: Deadlock -> Out of Threads -> Strange Exception -> > > > OutOfMemory -> Server Death. > > > > > > I can't find a locked <0x61655498> entry nowhere. > > > But I should be able to see it to recognize the thread causing the > > > deadlock, shouldn't I? > > > > I would think so. More questions: > > > > 1) Did the server exhibit any other symptoms before the apparent > > deadlock, such as slow response times? > > Unfortunatelly our monitoring systems do not detect everything. But > what I can reconstruct: > The log entry All Threads busy came at 20:53:03 > > at 21:17:37,965 The server (or a monitoring thread in the server) > detected that it hasn't received any events (backend sends events with > user logins, approx. 10 per second at the evening, if the events > aren't coming anymore, that means that the server was slow or had > other problems with the tcp-ip connection from backend and is removed > from the recipient list) for 2 minutes and a mail alert was generated. > > two minutes later a nagious alert came in: > Service: HTTP_RESPONSE > Host: xxx > Address: xxx > State: CRITICAL > > Date/Time: Tue Mar 14 21:19:36 CET 2006 > > Additional Info: > > CRITICAL - Socket timeout after 10 seconds > > ... somehow, don't know why yet, the server recovered and I received > next email at 21:24 > Notification Type: RECOVERY > > Service: HTTP_RESPONSE > Host: xxx > Address: xxx > State: OK > > Date/Time: Tue Mar 14 21:24:26 CET 2006 > > Additional Info: > > HTTP OK HTTP/1.1 200 OK - 47065 bytes in 0.042 seconds > > ---------------------------------- > 21:36 next problem mail came: > Notification Type: PROBLEM > > Service: HTTP_RESPONSE > Host: xxx > Address: xxx > State: CRITICAL > > Date/Time: Tue Mar 14 21:36:36 CET 2006 > > Additional Info: > > CRITICAL - Socket timeout after 10 seconds > > approx. 21:45 I checked the mails (I'm not the support guy, but just > feeling responsible, and therefore the one who watches for the system > in the evenings) > > created a thread dump and restarted the server at 21:50 (according to > the change.log entry i made). > > I don't see anything in SAR, but sar only shows mid-times, so if the > cpuload was high for a short moment (or idle time was zero) sar could > have missed it. > > > > > 2) What was the time interval between the all threads busy log entry and > > the OOME reports? > > I don't have a timestamp for the first OOME, but we have marks in the > catalina.out written all 5 minutes. The nearest one (before the oome) > is > 561146456 Tue Mar 14 21:13:18 CET 2006 -MARK- > > 21:13-21:15 Seems to be the time in question for me. > > I don't have it digital, but our sysadmin showed me a printed excel > sheet today, according to which we lost about 500MB ram in 20 Minutes > (somewhere between 20:55 and 21:15). I will try to get the sheet > tomorrow and send it to you. > > > > > > 3) Do you have -verbose:gc or -XX:+PrintGCDetails on, or perhaps any > > other monitoring in place that might show the state of the heap during > > normal running and then leading up to the hang? > > No verbose:gc or other options, it would just spam the logs completely > (and the logs are large already), but we have monitoring for the > System freeMemory, totalMemory and availableMemory values aswell, as > for /proc/meminfo. > > > > > 4) What are your -Xmx and PermGen size values? > > The server has a total amount of 8 GB Ram, mostly at least 4 of them > are free (since our standart servers are 2-4GB 32 bit machines and we > can't use more). > The JVM parameters are: > > export JAVA_OPTS="-server -mx1200M -ms1200M -Djacorb.config.dir=conf > -Djacorb.home=$JACORB_HOME -Dorg.omg.CORBA.ORBClass=org.jacorb.orb.ORB > -Dorg.omg.CORBA.ORBSingletonClass=org.jacorb.orb.ORBSingleton > -Djavax.net.ssl.trustStore=$HOME/.keystore > -Djavax.net.ssl.trustStorePassword=xxxxxx" > > Until two weeks ago we ran with jdk1.4. With jdk1.4 normal memory > utilization was about 350 MB, we always had about 850MB free. With JDK > 1.5 the amount of free memory is mostly over 1GB. Under high load its > about 850MB free memory. > Btw, I noticed long ago, that when something happens to tomcat or the > JVM (infinite loops, very high load, and so on) the garbage collector > seems to loose some memory between the collections. Probably (just > guessing, no real knowledge behind, but it looks this way) it tries > not to hold all other threads during the garbage collection and gives > required memory from free space (or allocate more if total<available) > but afterwards forgets this memory. I know this sounds idiotically, > but it's the impression one gets looking at the memory usage under > load. > > > > > I'm speculating (aka grasping at straws) that a heap full situation > > existed for a bit before the hang, but that some code somewhere ate the > > OOMEs and tried to continue. > > Well it seems according to above log entries that we in fact needed 20 > minutes to consume all of the memory, so I'd say some code just ate > memory and did nothing, until the available memory ended. > Almost forgot to mention, apparently the load balancer noticed the > server having problems before nagious did, at the time of the OOME it > only had 20 users left (have to check the exact time tomorow), and it > takes some time for an inactive user to be removed from the server. > > > > > > > - Chuck > > Thanx again for your help > > Leon > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]