Re: Messages stuck after Client host reboot

Josh Carlson Wed, 14 Apr 2010 08:44:09 -0700

Hmm. If a timeout was the solution to this problem how would you be ableto tell the difference between something being wrong and the client justbeing slow.

I did an strace on the server and discovered how the timeout is beingused. As a parameter to poll

6805 10:31:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000<unfinished ...>

6805  10:31:15 <... poll resumed> )     = 1 ([{fd=94, revents=POLLIN}])

6805 10:31:15 recvfrom(94, "CONNECT\npasscode:...."..., 8192, 0, NULL,NULL) = 396805 10:31:15 sendto(94, "CONNECTED\nsession:ID:mmq1-40144-"..., 53, 0,NULL, 0) = 536805 10:31:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000) = 1([{fd=94, revents=POLLIN}])6805 10:31:15 recvfrom(94, "SUBSCRIBE\nactivemq.prefetchSize:"...,8192, 0, NULL, NULL) = 1386805 10:31:15 sendto(94, "RECEIPT\nreceipt-id:39ef0e071a549"..., 55, 0,NULL, 0) = 556805 10:31:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000<unfinished ...>

6805  10:32:15 <... poll resumed> )     = 0 (Timeout)

6805 10:32:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000<unfinished ...>

6805  10:33:15 <... poll resumed> )     = 0 (Timeout)

6805 10:33:15 poll([{fd=94, events=POLLIN|POLLERR}], 1, 60000<unfinished ...>

6805  10:34:15 <... poll resumed> )     = 0 (Timeout)

In the output above I stripped lines that were not operations directlyon the socket. The poll Timeouts continue on ... with nothing in between.


[r...@mmq1 tmp]# lsof -p 6755 | grep mmq1

java 6755 root 85u IPv6 1036912 TCPmmq1.eng.e-dialog.com:61613 (LISTEN)java 6755 root 92u IPv6 1038039 TCPmmq1.eng.e-dialog.com:61613->10.0.13.230:46542 (ESTABLISHED)java 6755 root 94u IPv6 1036997 TCPmmq1.eng.e-dialog.com:61613->mmd2.eng.e-dialog.com:41743 (ESTABLISHED)

The connection to mmd2 is the host that is gone. The one to 10.0.13.230is up and active. When I kill -9 the process on 10.0.13.230 I see thisin the logs:

2010-04-13 17:13:55,322 | DEBUG | Transport failed: java.io.EOFException| org.apache.activemq.broker.TransportConnection.Transport | ActiveMQTransport: tcp:///10.0.13.230:45463

java.io.EOFException
        at java.io.DataInputStream.readByte(Unknown Source)

atorg.apache.activemq.transport.stomp.StompWireFormat.readLine(StompWireFormat.java:186)atorg.apache.activemq.transport.stomp.StompWireFormat.unmarshal(StompWireFormat.java:94)atorg.apache.activemq.transport.tcp.TcpTransport.readCommand(TcpTransport.java:211)atorg.apache.activemq.transport.tcp.TcpTransport.doRun(TcpTransport.java:203)atorg.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:186)

        at java.lang.Thread.run(Unknown Source)


Which is what I want to happen when the host goes down.

It seems to be that something should be noticing that the connection isreally gone. Maybe this is more of a kernel issue. I would think thatwhen the poll is done that it would trigger the connection to move fromthe ESTABLISHED state and get closed.

We are using Linux, kernel version 2.6.18, but I've seen this same issueon a range of different 2.6 versions.


-Josh


On 04/14/2010 09:38 AM, Josh Carlson wrote:

Thanks Gary for the, as usual, helpful information.

It looks like the broker maybe suffering from exactly the same problem
we encountered when implementing client-side failover. Namely that when
the master broker went down a subsequent read on the socket by the
client would hang (well actually take a very long time to fail/timeout).
In that case our TCP connection was ESTABLISHED and looking at the
broker I see the same thing after the client host goes away (the
connection is ESTABLISHED). We fixed this issue in our client by setting
the socket option SO_RCVTIMEO on the connection to the broker.

I noted what the broker appears to do the same thing with the TCP
transport option soTimeout. It looks like when this is set it winds up
as a call to java.net.Socket.setSoTimeout when the socket is getting
initialized. I have not done any socket programming in Java but my
assumption is that SO_TIMEOUT maps to both SO_RCVTIMEO and SO_SNDTIMEO
in the C world.

I was hopeful with this option but when I set in in my transport connector:

<transportConnector name="stomp" uri="stomp://mmq1:61613?soTimeout=60000"/>

the timeout does not occur. I actually ran my test case about 15 hours
ago and I can still see that the broker still has an ESTABLISHED
connection to the dead client and has a message dispatched to it.

Am I miss understanding what soTimeout is for? I can see in
org.apache.activemq.transport.tcp.TcpTransport.initialiseSocket that
setSoTimeout is getting called unconditionally. So what I'm wondering is
if it is actually calling it with a 0 value despite the way I set up my
transport connector. I suppose setting this to 0 would explain why it
apparently never times out where in our client case it eventually did
timeout (because we were not setting the option at all before).




On 04/14/2010 05:15 AM, Gary Tully wrote:

The re-dispatch is triggered by the tcp connection dying, netstat can
help with the diagnosis here. Check the connection state of the broker
port after the client host is rebooted, if the connection is still
active, possibly in a timed_wait state, you may need to configure some
additional timeout options on the broker side.

On 13 April 2010 19:43, Josh Carlson<[email protected]
<mailto:[email protected]>>  wrote:

     I am using client acknowledgements with a prefetch size of 1 with
     no message expiration policy. When a consumer subscribes to a
     queue I can see that the message gets dispatched correctly. If the
     process gets killed before retrieving and acknowledging the
     message I see the message getting re-dispatched (correctly). I
     expected this same behaviour if the host running the process gets
     rebooted or crashes. However, after reboot I can see that the
     message is stuck in the dispatched state to the consumer that is
     long gone. Is there a way that I can get messages re-dispatched
     when a host hosting consumer processes gets re-booted? How does it
     detect the case when a process dies (even with SIGKILL)?

     I did notice that if I increase my prefetch size and enqueue
     another message after the reboot, that activemq will re-dispatch
     the original message. However with prefetch size equal to one the
     message never seems to get re-dispatched.




--
http://blog.garytully.com

Open Source Integration
http://fusesource.com

Re: Messages stuck after Client host reboot

Reply via email to