On existing broker failover - can you point me to where that behavior is
documented?  Because neither myself or anyone on the four teams I work with
has come across the functionality you describe.  I've never seen a client
failover to another broker, only code to attempt to reconnect.  Basic
features we need:
- externally adjustable retry / timeout on connections - to handle
differences between LAN, WAN, and satellite internet.
- updating broker list: How do you do this?  Never seen it...
- to prevent network splits, how are recovered brokers monitored?  When a
failed broker recovers, do clients switch back?  How often / aggressively
checked?
- how is the application notified on broker failure, connection failover,
recovery?
Finally, we were ending up with LOTS of application complexity in SOA code
when broker failure / recovery meant connection, sender and receiver
objects had to be recreated.  This was compounded by Connection being a
different types of boost object than senders and receivers.  We ended up
building a bunch of code we were using everywhere to handle this, ended up
collapsing it into a layer that just exposes send(queue name, message) and
listen( queue name, callback) with all the functionality underneath, plus
callback for status messages (broker failure, connection failover,
recovery).

On connection failure detection, I'll dig up my notes and send to the list,
hopefully this has been cleaned up since 0.14.

And anything you can think of for dynamically load balancing across brokers?

Greatly appreciate the feedback and input...

Kerry


On Fri, Jun 14, 2013 at 4:09 AM, Gordon Sim <g...@redhat.com> wrote:

> On 06/13/2013 02:37 PM, Kerry Bonin wrote:
>
>> I'm the system architect for a large program that has just shipped a major
>> product, and Qpid is one of the foundations of our infrastructure. I
>> thought I'd share a few things, and ask a few questions about my next
>> step...
>>
>
> Thanks for taking the time Kerry, it's always great to get feedback!
>
>
>    We shipped using Qpid 0.14, am updating now. Our product is (currently)
>> 100% Windows, although most subsystems are implemented in cross-platform
>> C++ or Python. We are essentially a video surveillance and access control
>> system, so we have high volumes of events, low millions per day in bursts
>> up into the low hundreds per second. In addition to event transport, our
>> infrastructure is a ESB model SOA built over Qpid. We have high
>> reliability
>> requirements – no single points of failure, fast failover and recovery,
>> active-active services, load balancing, and encryption throughout all.
>>
>>   Our biggest challenges came from the large reliability gap between the
>> Windows and *nix implementations. We've contributed all of our fixes back,
>> but under high load we had issues that burnt a few man months of senior
>> developer time. Its now running reasonably well for us.
>>
>>   Broker failover was a challenge. We ended up building a wrapper over the
>> Qpid client library to abstract all the connection, sender, receiver, ect.
>> objects, so applications didn't have to deal with tearing down and
>> building
>> up these objects when a broker died. (Which happened often under high load
>> in early testing, and still happens often in our worst case testbeds where
>> we take them down or break connections to test reliability.) Since
>> federation didn't work on Windows, system splits were unacceptable, so we
>> also had to implement failover recovery.  This also required distributed
>> and maintaining an ordered list of brokers. Again, this was a pain, but is
>> now working. It would be nice if the client handled these things for us.
>>
>
> The client does have the ability to reconnect and re-establish the
> sessions, senders and receiver automatically. A list of brokers can also be
> provided and updated. There is additionally a utility that subscribes to
> the 'failover exchange' to get updates. That latter mechanism could be
> modified or copied for any sort of messaging based distribution of updates.
>
> Can you give a bit more detail on what is missing or doesn't work as
> required?
>
>
>    Another serious challenge – can someone (if they haven't already) clean
>> up
>> the error propagation from the client to the application? Someone who was
>> trying to recreate our broker failure detection asked me how we did it
>> reliably – I had to give them a LIST of exceptions and error returns that
>> we had to discover by trial and error to do this 100% of the time. This
>> should be simple, and it isn't...
>>
>
> Can you give a bit more detail on this? I know at one point there were
> some cases where a ConnectionException would get thrown when failing first
> to connect (rather than a TransportFailure as expected), but I believe that
> was fixed already.
>
>
>    For our next release, we need to dramatically increase the volume of
>> messages we handle. In leau of federation on Windows (when will that
>> work?), I'm facing having to add code to manage pools of brokers, and
>> dynamically load balance queues across brokers - once a broker is
>> approaching 80% load, or dies, I need to move queues to other brokers, and
>> coordinate these moves via my wrapper.  I can do this, but I'd rather not.
>> Does anyone have any other suggestions on how to handle 10M messages a
>> second on (many) Windows boxes, spread across lots of queues at dynamic
>> and
>> unpredictably varying loads, that works with a fast failover and failover
>> recovery mechanism?
>>
>>
>> Kerry
>>
>>
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: 
> dev-unsubscribe@qpid.apache.**org<dev-unsubscr...@qpid.apache.org>
> For additional commands, e-mail: dev-h...@qpid.apache.org
>
>

Reply via email to