[axis2] Status on Axis2 1.5.1 and Rampart 1.5

Glen Daniels Mon, 12 Oct 2009 07:15:58 -0700

Hi folks!

OK, so here are the results of my weekend investigations.  The lockup when
running the Rampart 1.5 tests with Axis2 1.5.1 was due to http connection
starvation.  I've fixed two issues and everything works now, but I'd like to
respin both Axis2 1.5.1 and Rampart 1.5 as a result.  Details below.


First, a quick summary of a major change in Axis2 1.5.1 : we were formerly
creating new MultithreadedHTTPConenctionManagers all the time in the HTTP
sender code.  In typical usage you'd never see connection pool starvation
(since each new MHCM had a new pool), but two major problems occurred.  1)
Connection reuse wasn't really possible, and 2) we would eventually (in
high-volume situations) run into the OS limits for open sockets.  So I fixed
this so that 1.5.1 now re-uses a single MHCM for each ConfigurationContext,
which allows for sharing connections across ServiceClient instances.

The bigger problem *behind* the problem above is that users of the commons
HTTPClient library (like Axis2) need to call releaseConnection() on each and
every HTTPMethod after they are finished.  The
ServiceClient.cleanupTransport() call does this, but since we never told
people to call that explicitly, no one was in the habit of doing it.  A
number of bugs about connection starvation came up, and we put in the
Options.setCallTransportCleanup() option, which automatically calls
cleanupTransport() after each call, but at a cost - since we're releasing
connection resources you need to make sure you've read everything, which
means building the whole Axiom tree.  Bye-bye, streaming.  So I also added a
different connection cleanup option which automatically cleans up the *last*
operation as you're setting up the next one.

So, to make the Rampart story very short, the problem was this: a new
ServiceClient gets created to deal with SecureConversation interactions (see
STSClient.getServiceClient()).  This SC shares the same ConfigurationContext
with the outer (i.e. user) SC, so it shares a MHCM and a connection pool.
The problem is since the STS operations happen inside a user-level operation,
the record of the "last operation" gets overwritten, and as a result my
automatic cleanup mechanism can't catch both!  So we lose one connection each
time we go through the STS process, and that causes a hard lock.

SOLUTION
--------

I did two things to fix this, both of which I think should be reflected in
the released code.  First, in Rampart, I added a call to
setCallTransportCleanup(true) in STSClient - this means that the STS
operations will be forced to build the complete Axiom tree (see above), but
solves the connection starvation issue.  Second, in Axis2, I added a default
30-second timeout while waiting for new connections - this doesn't change the
functionality at all, but it does mean that we can no longer get into
situations where the system just locks up forever.  With that change, we'll
now at least get an Exception if there's a starvation issue, which can then
be debugged.

Nandana/all, can you check what I did in Rampart and let me know if you
foresee any problems with it?  I'm going to respin Axis2 1.5.1 with this and
one other fix, and we should respin Rampart 1.5 as well.

Thoughts/comments?

Thanks,
--Glen

[axis2] Status on Axis2 1.5.1 and Rampart 1.5

Reply via email to