Re: Cassandra crashed - possible JMX threads leak

Norman Maurer Tue, 26 Oct 2010 08:54:40 -0700

Depending on finalize() is really not want you want todo, so I think
the API change would be preferable.


Bye,
Norman


2010/10/26 Bill Au <bill.w...@gmail.com>:
> I would be happy to submit a patch but is it a bit more trickier than simply
> calling JMXConenctor.close().  NodeProbe's use of the JMXConnector is not
> exposed in its API  The JMX connection is created in NodeProbe's
> constructor.  Without changing the API, the only place to call close() would
> be in NodeProbe's finalize().  I am not sure if that's the best thing to
> do.  I think this warrant a discussion on the developer mailing list.  I
> will start an new mail thread there.
>
> Anyways, I am still trying to understand why the JMX server connection
> timeout threads pile up rather quickly when I restart a node in a live
> cluster.  I took a look at the Cassandra source and see that NodeProbe is
> the only place that creates and uses a JMX connection.  And NobeProbe is
> only used by the tools.  So it seems that there is another JMX thread leak
> in Cassandra.
>
> Bill
>
> On Fri, Oct 22, 2010 at 4:33 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
>>
>> Is the fix as simple as calling close() then?  Can you submit a patch for
>> that?
>>
>> On Fri, Oct 22, 2010 at 2:49 PM, Bill Au <bill.w...@gmail.com> wrote:
>> > Not with the nodeprobe or nodetool command because the JVM these two
>> > commands spawn has a very short life span.
>> >
>> > I am using a webapp to monitor my cassandra cluster.  It pretty much
>> > uses
>> > the same code as NodeCmd class.  For each incoming request, it creates
>> > an
>> > NodeProbe object and use it to get get various status of the cluster.  I
>> > can
>> > reproduce the Cassandra JVM crash by issuing requests to this webapp in
>> > a
>> > bash while loop.  I took a deeper look and here is what I discovered:
>> >
>> > In the webapp when NodeProbe creates a JMXConnector to connect to the
>> > Cassandra JMX port, a thread
>> > (com.sun.jmx.remote.internal.ClientCommunicatorAdmin$Checker) is created
>> > and
>> > run in the webapp's JVM.  Meanwhile in the Cassamdra JVM there is a
>> > com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout thread to
>> > timeout remote JMX connection.  However, since NodeProbe does not call
>> > JMXConnector.close(), the JMX client checker threads remains in the
>> > webapp's
>> > JVM even after the NobeProbe object has been garbage collected.  So this
>> > JMX
>> > connection is still considered open and that keeps the JMX timeout
>> > thread
>> > running inside the Cassandra JVM.  The number of JMX client checker
>> > threads
>> > in my webapp's JVM matches up with the number of JMX server timeout
>> > threads
>> > in my Cassandra's JVM.  If I stop my webapp's JVM,
>> > all the JMX server timeout threads in my Cassandra's JVM all disappear
>> > after
>> > 2 minutes, the default timeout for a JMX connection.  This is why the
>> > problem cannot be reproduced by nodeprobe or nodetool.  Even though
>> > JMXConnector.close() is not called, the JVM exits shortly so the JMX
>> > client
>> > checker thread do not stay around.  So their corresponding JMX server
>> > timeout thread goes away after two minutes.  This is not the case with
>> > my
>> > weabpp since its JVM keeps running, so all the JMX client checker
>> > threads
>> > keep running as well.  The threads keep piling up until it crashes
>> > Casssandra's JVM.
>> >
>> > In my case I think I can change my webapp to use a static NodeProbe
>> > instead
>> > of creating a new one for every request.  That should get around the
>> > leak.
>> >
>> > However, I have seen the leak occurs in another situation.  On more than
>> > one
>> > occasions when I restarted one node in a live multi-node clusters, I see
>> > that the JMX server timeout threads quickly piled up (number in the
>> > thousands) in Cassandra's JVM.  It only happened on a live cluster that
>> > is
>> > servicing read and write requests.  I am guessing the hinted hand off
>> > might
>> > have something to do with it.  I am still trying to understand what is
>> > happening there.
>> >
>> > Bill
>> >
>> >
>> > On Wed, Oct 20, 2010 at 5:16 PM, Jonathan Ellis <jbel...@gmail.com>
>> > wrote:
>> >>
>> >> can you reproduce this by, say, running nodeprobe ring in a bash while
>> >> loop?
>> >>
>> >> On Wed, Oct 20, 2010 at 3:09 PM, Bill Au <bill.w...@gmail.com> wrote:
>> >> > One of my Cassandra server crashed with the following:
>> >> >
>> >> > ERROR [ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn] 2010-10-19 00:25:10,419
>> >> > CassandraDaemon.java (line 82) Uncaught exception in thread
>> >> > Thread[ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn,5,main]
>> >> > java.lang.OutOfMemoryError: unable to create new native thread
>> >> >         at java.lang.Thread.start0(Native Method)
>> >> >         at java.lang.Thread.start(Thread.java:597)
>> >> >         at
>> >> >
>> >> >
>> >> > org.apache.cassandra.net.MessagingService$SocketThread.run(MessagingService.java:533)
>> >> >
>> >> >
>> >> > I took threads dump in the JVM on all the other Cassandra severs in
>> >> > my
>> >> > cluster.  They all have thousand of threads looking like this:
>> >> >
>> >> > "JMX server connection timeout 183373" daemon prio=10
>> >> > tid=0x00002aad230db800
>> >> > nid=0x5cf6 in Object.wait() [0x00002aad7a316000]
>> >> >    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>> >> >         at java.lang.Object.wait(Native Method)
>> >> >         at
>> >> >
>> >> >
>> >> > com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout.run(ServerCommunicatorAdmin.java:150)
>> >> >         - locked <0x00002aab056ccee0> (a [I)
>> >> >         at java.lang.Thread.run(Thread.java:619)
>> >> >
>> >> > It seems to me that there is a JMX threads leak in Cassandra.
>> >> > NodeProbe
>> >> > creates a JMXConnector but never calls its close() method.  I tried
>> >> > setting
>> >> > jmx.remote.x.server.connection.timeout to 0 hoping that would disable
>> >> > the
>> >> > JMX server connection timeout threads.  But that did not make any
>> >> > difference.
>> >> >
>> >> > Has anyone else seen this?
>> >> >
>> >> > Bill
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Jonathan Ellis
>> >> Project Chair, Apache Cassandra
>> >> co-founder of Riptano, the source for professional Cassandra support
>> >> http://riptano.com
>> >
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>
>

Re: Cassandra crashed - possible JMX threads leak

Reply via email to