Objective and Scope.
Overview
The broker is prone to heap exhaustion, leading to OutOfMemory errors. The servers main memory usage is storing messages on queues, Producer flow control is needed to manage that area and is out of scope for this work.
The other major source of memory consumption are the network buffers used to hold unprocessed AMQP frames once they have been read from the socket. These are currently able to grow uninhibited, and there is no means available to control them. Further, the transport layer is poorly implemented and difficult to work in. Improving encapsulation is an explicit goal of this work.
Problem Statement
When the broker is unable to process frames as quickly as they are being sent these buffers begin to fill up and the broker has no way to limit those. For the broker to effectively manage its memory usage, it needs to be able to at least place an upper bound on the size of it's network buffers. It also has no way to know how large those buffers are.
Current Architecture
- The current MINA networking uses unbounded buffers.
- We replace over a dozen MINA classes, none of which have any unit test coverage. We failed to get our patches upstream and haven't attempted since then.
- Existing unit test coverage is minimal (approx 30%)
- Improving unit test coverage is difficult due to poor encapsulation
- Poor encapsulation has lead to tight coupling of MINA to server
- The current behaviour of send() leaves the potential for message loss when not using transactions and violates JMS spec. Persistent messages which are held in either the client or servers buffers before being written to disk can be lost.
- MINA's internal state is currently a black box, leaving no way to determine how much memory is being used by an individual client connection.
- The way that we use MINA is suboptimal for our purpouses but is difficult to change due to the tight coupling
- Supporting alternative transport layers is impossible due to tight coupling of MINA (OSI layer 4) with the AMQP handlers (OSI layer 7).
Exclusions: / Assumptions
- No AMQP semantics are involved. The aim of this work is purely to limt the size of the network buffers between the client producing AMQP frames and the broker processing them. It does not involve any protocol specific work. In OSI terms, this work is aimed at layer 4, not layer 7.
- Higher level information should be determined by the broker itself. No policy will be applied beyond blocking reads if the buffer is full.
- Buffers are sized uniformly across all connections
- Buffers are fixed at startup and do not change
- Standard TCP flow control is the only mechanism used to signal to the client that it should cease to send data.
- It is better for the client to block further writes to the socket than allowing memory consumption to grow unimpeeded
- The broker should not block
High-Level Technical Architecture
Functional Requirements
- Buffer size control
- TCP options: nodelay, keepalive, window size etc.
- SSL: link level encryption, do we want to consider things like certificate validation etc here or at a higher level? Consult with RHS
- (Log when tcp flow control kicks in) - RG
- (Information available about current memory usage available through JMX interface) - RG out of scope
- (Dynamic removal of buffer bounds?) (fundamentally not possible with TransportIO) - RG
- signal on idle requires timer support
- Need to be notified when socket has been closed
- The broker needs to know that the transport layer is full and writes will block - "I took your frame, but don't send anymore just now until I clear this Future"
- Non-TCP transports such as InVM, infiniband.
Non Functional Requirements
- Startup loading of transport plugins
- User can select specific transport to use
- Peer A running transport A can talk to Peer B running transport B
- Inactive connections do not require threads (broker only, client can probably live with that)
- the semantics of org.apache.qpid.BasicMessageProducer.send() need to change. It may now block if there isn't enough free space to write the entire message out. The change to this methods semantics needs to be considered in the light of the stated JMS semantics and the change to support acknowledgement of publishes in AMQP 0-10 and higher.
Architecture Design
Common should have an interface which all transport plugins can implement and which the server and client can use. The interface would include a means to set the standard socket options and to limit it's total memory usage.
Overview of Design
Common will hold a transport layer interface which the existing MINA transport will be ported too. We will also port the 0-10 client o.a.q.transport.network.io package to that interface. This interface should be quite simple.
Methods to send, receive, flush, close, open, listen and a method to set TCP options are likely to be sufficient. These would operate on a QpidByteBuffer, essentially MinaByteBuffer to avoid having to fix our use of expanding buffers at the same time.
The server and client both use common for their network layer, and will need to be updated to use the new interface. They will need to pass through the configured socket options.
When processing the incoming data, one frame at a time will be processed and that frames processing will be completed before the next one is read. There will be no other data structures used to hold unprocessed frames.
The server will need to be substantively modified to push the MINA specific parts into the appropriate plugin.
Breakdown of component parts.
- Common
- The interface for common needs to be designed, implemented and tested.
- Timer support for heartbeats
- "I'm full, hold on" support
- Server
- Existing code needs to be refactored to remove dependence on MINA
- Existing code ported to use new common interface
- (Management functionality added to JMX interface - UI changes?) - RG
- Configuration to select a transport
- Client
- Existing code needs to be refactored to remove dependence on MINA
- Existing code ported to use new common interface
- Changes to send() semantics need to be considered and documented
- Configuration to select a transport
- Tests
- Representative workload tests need to be developed and put into perftests.
Testing
Testing under load and handling error conditions (unexpected disconnection etc) will need to be carried out. New load tests which accurately simulate application workloads need to be developed so that we can provide accurate configuration guidance. This needs to be carried out on Windows, Linux and Solaris as they have significantly different OS in all permutations of client/server.
New unit tests will need to be written to cover the transport plugins and the new interfaces. Existing test coverage in this area is minimal.
Impact
There is a potential effect upon performance, we will need to measure this once it has been implemented to quantify what effect, if any, it has had.
Compatibility / Migration Implications
- Older clients connected to a new broker may suffer OOM when tcp flow control kicks in. This seems preferrable to the broker suffering OOM.
- Clients which upgrade their library may experience a change in behaviour of the send() method, since it may now block if the clients network buffer is full. This needs to be appropriately communicated.
Risks
- MINA is quite deeply embedded in the server and will require some work to excise it fully. This is somewhat mitigated by the decision to import mina.ByteBuffer and continue using that.
- Differences in behaviour of transport layer may expose other bugs in the broker which were being hidden before.
- Inadequate test coverage, in particular the lack of representative application workloads in the performance test suite.