Server Death. Bug in org.apache.naming.resources.ProxyDirContext.cacheLoad?

Leon Rosenberg Thu, 16 Mar 2006 00:21:53 -0800

Update:

between the memory leakage started approx. at 20:15 and reached the
end of the bottle at 21:10 with 40 Mb free memory. At this time the
server started to loose users. Until 21:45 it droped about 1000 users
leaving 21 on the server, before it got restarted at 21:50 as I stated
in the previous post.


I don't have profiling info, but it looks like ProxyDirContext.cache
is the right place for further investigations, but I might be wrong.

Opinions?

regards
Leon

On 3/16/06, Leon Rosenberg <[EMAIL PROTECTED]> wrote:
> On 3/15/06, Caldarale, Charles R <[EMAIL PROTECTED]> wrote:
> > > From: Leon Rosenberg [mailto:[EMAIL PROTECTED]
> > > Subject: Deadlock -> Out of Threads -> Strange Exception ->
> > > OutOfMemory -> Server Death.
> > >
> > > I can't find a locked <0x61655498> entry nowhere.
> > > But I should be able to see it to recognize the thread causing the
> > > deadlock, shouldn't I?
> >
> > I would think so.  More questions:
> >
> > 1) Did the server exhibit any other symptoms before the apparent
> > deadlock, such as slow response times?
>
> Unfortunatelly our monitoring systems do not detect everything. But
> what I can reconstruct:
> The log entry All Threads busy came at 20:53:03
>
> at 21:17:37,965 The server (or a monitoring thread in the server)
> detected that it hasn't received any events (backend sends events with
> user logins, approx. 10 per second at the evening, if the events
> aren't coming anymore, that means that the server was slow or had
> other problems with the tcp-ip connection from backend and is removed
> from the recipient list) for 2 minutes and a mail alert was generated.
>
> two minutes later a nagious alert came in:
> Service: HTTP_RESPONSE
> Host: xxx
> Address: xxx
> State: CRITICAL
>
> Date/Time: Tue Mar 14 21:19:36 CET 2006
>
> Additional Info:
>
> CRITICAL - Socket timeout after 10 seconds
>
> ... somehow, don't know why yet, the server recovered and I received
> next email at 21:24
> Notification Type: RECOVERY
>
> Service: HTTP_RESPONSE
> Host: xxx
> Address: xxx
> State: OK
>
> Date/Time: Tue Mar 14 21:24:26 CET 2006
>
> Additional Info:
>
> HTTP OK HTTP/1.1 200 OK - 47065 bytes in 0.042 seconds
>
> ----------------------------------
> 21:36 next problem mail came:
> Notification Type: PROBLEM
>
> Service: HTTP_RESPONSE
> Host: xxx
> Address: xxx
> State: CRITICAL
>
> Date/Time: Tue Mar 14 21:36:36 CET 2006
>
> Additional Info:
>
> CRITICAL - Socket timeout after 10 seconds
>
> approx. 21:45 I checked the mails (I'm not the support guy, but just
> feeling responsible, and therefore the one who watches for the system
> in the evenings)
>
> created a thread dump and restarted the server at 21:50 (according to
> the change.log entry i made).
>
> I don't see anything in SAR, but sar only shows mid-times, so if the
> cpuload was high for a short moment (or idle time was zero) sar could
> have missed it.
>
> >
> > 2) What was the time interval between the all threads busy log entry and
> > the OOME reports?
>
> I don't have a timestamp for the first OOME, but we have marks in the
> catalina.out written all 5 minutes. The nearest one (before the oome)
> is
> 561146456 Tue Mar 14 21:13:18 CET 2006 -MARK-
>
> 21:13-21:15 Seems to be the time in question for me.
>
> I don't have it digital, but our sysadmin showed me a printed excel
> sheet today, according to which we lost about 500MB ram in 20 Minutes
> (somewhere between 20:55 and 21:15). I will try to get the sheet
> tomorrow and send it to you.
>
>
> >
> > 3) Do you have -verbose:gc or -XX:+PrintGCDetails on, or perhaps any
> > other monitoring in place that might show the state of the heap during
> > normal running and then leading up to the hang?
>
> No verbose:gc or other options, it would just spam the logs completely
> (and the logs are large already), but we have monitoring for the
> System freeMemory, totalMemory and availableMemory values aswell, as
> for /proc/meminfo.
>
> >
> > 4) What are your -Xmx and PermGen size values?
>
> The server has a total amount of 8 GB Ram, mostly at least 4 of them
> are free (since our standart servers are 2-4GB 32 bit machines and we
> can't use more).
> The JVM parameters are:
>
> export JAVA_OPTS="-server -mx1200M -ms1200M -Djacorb.config.dir=conf
> -Djacorb.home=$JACORB_HOME -Dorg.omg.CORBA.ORBClass=org.jacorb.orb.ORB
> -Dorg.omg.CORBA.ORBSingletonClass=org.jacorb.orb.ORBSingleton
> -Djavax.net.ssl.trustStore=$HOME/.keystore
> -Djavax.net.ssl.trustStorePassword=xxxxxx"
>
> Until two weeks ago we ran with jdk1.4. With jdk1.4 normal memory
> utilization was about 350 MB, we always had about 850MB free. With JDK
> 1.5 the amount of free memory is mostly over 1GB. Under high load its
> about 850MB free memory.
> Btw, I noticed long ago, that when something happens to tomcat or the
> JVM (infinite loops, very high load, and so on) the garbage collector
> seems to loose some memory between the collections. Probably (just
> guessing, no real knowledge behind, but it looks this way) it tries
> not to hold all other threads during the garbage collection and gives
> required memory from free space (or allocate more if total<available)
> but afterwards forgets this memory. I know this sounds idiotically,
> but it's the impression one gets looking at the memory usage under
> load.
>
> >
> > I'm speculating (aka grasping at straws) that a heap full situation
> > existed for a bit before the hang, but that some code somewhere ate the
> > OOMEs and tried to continue.
>
> Well it seems according to above log entries that we in fact needed 20
> minutes to consume  all of the memory, so I'd say some code just ate
> memory and did nothing, until the available memory ended.
> Almost forgot to mention, apparently the load balancer noticed the
> server having problems before nagious did, at the time of the OOME it
> only had 20 users left (have to check the exact time tomorow), and it
> takes some time for an inactive user to be removed from the server.
>
> >
>
>
> >  - Chuck
>
> Thanx again for your help
>
> Leon
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Deadlock -> Out of Threads -> Strange Exception -> OutOfMemory -> Server Death. Bug in org.apache.naming.resources.ProxyDirContext.cacheLoad?

Reply via email to