On Wed, 2009-11-18 at 13:39 -0500, Carl Trieloff wrote: > Alan Conway wrote: > > At the moment a clustered broker stalls clients while it is > > initializing, giving or receiving an update. It's been pointed out > > that this can result in long delays for clients connected to a broker > > that elects to give an update to a newcomer, and it might be better > > for the broker to disconnect clients so they can fail over to another > > broker not busy with an update. > > > > There are 3 cases to consider: > > > > - new member joining/getting update, new client: stall or reject? > > - established member giving update, new client: stall or reject? > > - established member giving update, connected client: stall or > > disconnect? > > > > On the 3rd point I would note that it's possible for clients to > > disconnect themselves if the broker is unresponsive by using > > heartbeats, and that not all clients can fail-over so I'd lean towards > > stall on that one, but I think rejecting new clients may make sense here. > > > > Part of the original motivation for stalling is that it makes it easy > > to write tests. You can start a broker and immediately start a client > > without worrying about waiting till the broker is ready. That's a nice > > property but there are other ways to achieve that. Current qpidd -d > > returns as soon as the broker is ready to listen for TCP requests, > > which may be before the broker is has joined the cluster. We could > > change that behavior to wait till all plugins report "ready". For > > tests we could also grep the log output for the ready message.
I think it would be better not to stall existing clients at all in the established member giving update case. It would be better to not be available for connect at all until up to date in the new member case It's arguable what the best behaviour is in the established member giving update gets a new client case. However I would note that the low level code isn't capable of stopping accepting connections and then starting again once it has started accepting connections. So they would have to connect then be disconnected with an exception. I would also suggest that considering the number of likely cluster members is important here - I'd expect very installations to run more than 4 machines in a cluster and 2 is probably the norm. So if a single broker goes down and restarts it's going to be made up to date by the only other cluster member. In this case if that member stalls no more work can get done until the rejoining member is now up to date. I guess this sort of case can be dealt with in the current scheme by having multiple cluster members on a single piece of hardware. Andrew > > > > Thoughts appreciated! > > I would dis-allow connections to the new broker until it is synced. I > would not bump any active connections, but rather leave that to heartbeat. > > One other idea would be to add an option to cluster config which could > specify the preferred nodes to update from, and it would try this list > first. I.e. in a 4 node cluster, all updates are made from node 4 > (preferred) if there, and then from an app point of view I connect to > node 1-3 for example. This way updates have no effect on my clients and > if I care about being stalled I set this option. if the prefered node/s > are not running it would just pick one as it does today > > Carl. > > --------------------------------------------------------------------- > Apache Qpid - AMQP Messaging Implementation > Project: http://qpid.apache.org > Use/Interact: mailto:[email protected] > --------------------------------------------------------------------- Apache Qpid - AMQP Messaging Implementation Project: http://qpid.apache.org Use/Interact: mailto:[email protected]
