[ 
https://issues.apache.org/jira/browse/AMQ-5543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580607#comment-14580607
 ] 

Volker Kleinschmidt commented on AMQ-5543:
------------------------------------------

We've seen this very same issue. Unable to create native thread due to ulimit 
setting, then broker gets into bad state and sulks for hours, slamming the door 
in the face of every client that's trying to connect. On the client this is 
witnessed by sockets being closed from the remote end, which looks like 
potential external interference at first, but when you look at the broker log 
it's clear that it's the broker resetting these sockets.

I vote for making the broker more resilient against this type of problem, since 
hitting a ulimit and thus not being able to create a new pool worker or some 
such shouldn't throw the entire broker into the abyss. Here's the OOM reported 
in the JVM's stdout.log:
{noformat}
INFO   | jvm 1    | 2015/06/08 05:30:26 | WARNING: RMI TCP Accept-0: accept 
loop for ServerSocket[addr=localhost/127.0.0.1,localport=42882] throws
INFO   | jvm 1    | 2015/06/08 05:30:26 | java.lang.OutOfMemoryError: unable to 
create new native thread
INFO   | jvm 1    | 2015/06/08 05:30:26 |       at 
java.lang.Thread.start0(Native Method)
INFO   | jvm 1    | 2015/06/08 05:30:26 |       at 
java.lang.Thread.start(Thread.java:714)
INFO   | jvm 1    | 2015/06/08 05:30:26 |       at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
INFO   | jvm 1    | 2015/06/08 05:30:26 |       at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
INFO   | jvm 1    | 2015/06/08 05:30:26 |       at 
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:414)
INFO   | jvm 1    | 2015/06/08 05:30:26 |       at 
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:371)
INFO   | jvm 1    | 2015/06/08 05:30:26 |       at 
java.lang.Thread.run(Thread.java:745)
{noformat}


> Broker logs show repeated "java.lang.IllegalSt​ateException: Timer already 
> cancelled." lines
> --------------------------------------------------------------------------------------------
>
>                 Key: AMQ-5543
>                 URL: https://issues.apache.org/jira/browse/AMQ-5543
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: Transport
>    Affects Versions: 5.10.0
>            Reporter: Tim Bain
>
> One of our brokers running 5.10.0 spewed over 1 million of the following 
> exceptions to the log over a 2-hour period:
> Transport Connection to: tcp://127.0.0.1:xxxxx failed: java.io.IOException: 
> Unexpected error occurred: java.lang.IllegalStateException: Timer already 
> cancelled.
> Clients were observed to hang on startup (which we believe means they were 
> unable to connect to the broker) until it was rebooted, after which we 
> haven't seen the exception again.
> Once the exceptions started, there were no stack traces or other log lines 
> that would indicate anything else about the cause, just those messages 
> repeating.  The problems started immediately (a few milliseconds) after an 
> EOFException in the broker logs; we see those EOFExceptions pretty often and 
> they've never before resulted in "Timer already cancelled" exceptions, so 
> that might indicate what got us into a bad state but then again it might be 
> entirely unrelated.
> I searched JIRA and the mailing list archives for similar issues, and 
> although there are a lot of incidences of "Timer already cancelled" 
> exceptions, none of them exactly match our situation.
> * 
> http://activemq.2283324.n4.nabble.com/Producer-connections-keep-breaking-td4671152.html
>  describes repeated copies of the line in the logs and is the closest 
> parallel I've found, but it sounded like messages were still getting passed, 
> albeit more slowly than normal, whereas the developer on my team who hit this 
> said he didn't think any messages were getting sent.  But it's the closest 
> match of the group.
> * AMQ-5508 has a detailed investigation into the root cause of the problem 
> that Pero Atanasov saw, but his scenario occurred only on broker shutdown, 
> whereas our broker was not shutting down at the time, and it wasn't clear 
> that the log line repeated for him.
> * 
> http://activemq.2283324.n4.nabble.com/Timer-already-cancelled-and-KahaDB-Recovering-checkpoint-thread-after-death-td4676684.html
>  has repeated messages like I'm seeing, but appears to be specific to KahaDB 
> which I'm not using (we use non-persistent messages only).
> * AMQ-4805 has the same inner exception but not the full log line, and it 
> shows a full stack trace whereas I only see the one line without the stack 
> trace.
> * 
> http://activemq.2283324.n4.nabble.com/can-t-send-message-Timer-already-cancelled-td4680175.html
>  appears to see the exception on the producer when sending rather than on the 
> broker.
> The only thing that we observed that might be related was that earlier in the 
> broker's uptime, the developer who was running it ran up against his maxProc 
> limit and wasn't able to create new native threads.  That didn't result in 
> any issues in the broker logs at the time (we only hit these issues several 
> hours later), so I'm skeptical that it's related given that there are lots of 
> other things that cause "Timer already cancelled" exceptions as evidenced by 
> the partial list above, but it could be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to