We basically run a server here in our local office behind a firewall, and
the rest of our stuff out on Amazon's EC2 cloud.  We suspect there were
issues with NAT timeouts and half dead TCP connections.
The specific behaviors we saw using NMS manifested themselves in the
following ways:

1. Client blocked on TCP connection waiting for messages, server does not
think client is connected anymore.

2. Client blocked on TCP connection, server reports *multiple* listeners for
a queue that should only have one listener (the number changes over time,
tended to tick upwards, and then to downwards, probably after the server
timed out a dead tcp connection, sometimes saw a listener count upwards of 9
or 10 when there should only be 1).

3. Clients do not appear to always re-establish connection to server once
connection is dead.  Frequently had to restart clients, occasionally had to
restart server.

4. Message queues that were idle for long periods at a time exhibited
problematic behavior.  Messages queues that were active remained available
(a huge indicator what was going on after fixing #5).

5. Hitting ^C to kill our application and not handling break to properly
close connections caused behaviors very similar to what we were eventually
seeing with our TCP connections.  This, of course, made the issue that much
more confusing and difficult to debug since not all communication problems
were rooted at the network layer and the results were at least initially
maddeningly inconsistent.

We experimented with more aggressive request timeouts on the transport
layer/session/connection (even modified the driver to ensure these were
getting set), setting up static routes, opening up firewall ports and
playing with the TCP timeouts (at least on our end, we have no control on
the Amazon side).  We tried prefetch size of one and tried to enable the
keep alive but never figured out how to do it.  The only solution that
worked was the ActiveMQ to ActiveMQ bridge, and I suspect some of that may
have to do with that we were never able to get keep alives working and we
have no control over fine-grained NAT settings on the Amazon side.

Bryan


On Tue, Sep 9, 2008 at 10:09 AM, James Strachan <[EMAIL PROTECTED]>wrote:

> Maybe the WAN is dropping connections; we have failover in Java; am
> not sure we've added that to NMS yet have we?
>
> 2008/9/9 Jim Gomes <[EMAIL PROTECTED]>:
> > Hi Bryan,
> > That's interesting.  I wonder where the problem is with ActiveMQ => NMS
> > connection.  Without knowing your exact network topology, I can't point
> to
> > where the problem is.  All I can do is speak to my experience and I have
> > been able to keep connections alive for a very long time without errors,
> > both with high- and low-activity, even going over what my infrastructure
> > team has told me is a WAN connection.
> >
> > Best,
> > Jim
> >
> > On Tue, Sep 9, 2008 at 7:35 AM, Bryan Murphy <[EMAIL PROTECTED]>
> wrote:
> >
> >> Thanks for the info.  I suspected that's what the timeout meant, but you
> >> never really know until you ask..
> >> Anyway, we finally solved our issue.  We setup two instances of ActiveMQ
> in
> >> the two data centers to forward messages back and forth between each
> other.
> >>  This is working much better for us.  It seems the ActiveMQ to ActiveMQ
> >> communication is a bit more robust than the ActiveMQ to Apache.NMS
> >> communication (at least when running over a WAN).
> >>
> >> Bryan
> >>
> >> On Mon, Sep 8, 2008 at 2:49 PM, Jim Gomes <[EMAIL PROTECTED]> wrote:
> >>
> >> > Hi Bryan,
> >> > I can't answer all of your questions, yet.  But I can answer some of
> >> them,
> >> > anyway.
> >> >
> >> > 1. As far as the ResponseTimeout property goes, that is used for
> network
> >> > timeouts.  It's not a JMS timeout value like TimeToLive.  The
> >> > ResponseTimeout is used by the client to wait for a response from the
> >> > broker.  Since a network call is inherently a blocking operation (send
> >> > request, wait for response), if we never receive a response from a
> >> > dead/hung
> >> > broker, the client will hang as well.  The ResponseTimeout lets client
> >> > abort
> >> > waiting for the response from the broker.  This can be set to whatever
> >> > performance constraints your application requires.  In a WAN
> environment,
> >> > this might be set to something fairly high where there is a lot of
> >> latency
> >> > in network round-trips.  The socket connection is not dropped.  The
> >> client
> >> > simply stops waiting for the broker to respond and goes into its
> >> > error-handling code for a non-response.
> >> >
> >> > 2. I see the marshalling code for the KeepAliveInfo, but like you I
> don't
> >> > see how this is turned on or controlled from the client-side.  This
> would
> >> > need more investigation to see if it is enabled via a URI parameter,
> or
> >> if
> >> > new code needs to be written to enable its use.
> >> >
> >> > 3. Can't answer the server-side socket issue.  Don't know that code.
> >> >
> >> >
> >>
> >
>
>
>
> --
> James
> -------
> http://macstrac.blogspot.com/
>
> Open Source Integration
> http://open.iona.com
>

Reply via email to