On Sun, Apr 18, 2010 at 5:54 AM, Martin Sustrik <sust...@250bpm.com> wrote: > Brian Candler wrote: >> On Sun, Apr 18, 2010 at 02:28:04AM -0500, Joe Holloway wrote: >>> I guess this is an age old problem of fault tolerant service >>> discovery. The service registry can't know whether or not a service >>> is reachable until a client has tried to connect. >> >> So maybe the solution is for the socket to have a list of endpoints to try, >> instead of a single one? It could just walk around the list until one >> connects. If you randomize this list first, then you get load-balancing >> too. >> >> The API docs don't make it clear what happens if you call zmq_connect >> multiple times on the same socket. > > Yes, you pretty much right. > > You can connect your client to multiple instances of a service: > > zmq_connect (c, "tcp://svr001.example.com:5555"); > zmq_connect (c, "tcp://svr002.example.com:5555"); > zmq_connect (c, "tcp://svr003.example.com:5555"); > > What happens is that the requests are load balanced among the servers.
OK, that part is starting to make more sense now. I'm playing around with putting my service definitions in SRV RR records (actually DNS-SD, but same concept). The RFC [1] spells out a load balancing algorithm for client programs based on the weight and priority of each record. It's not round robin, but similar. The idea is that the person registering the service knows the attributes of the hardware that it's running on and can hint to the client agent that X node has more capacity than Y node although they are both offering the same service. Just thinking out loud, but is there a way to provide those hints to 0MQ or should we just trust that 0MQ is better at dynamic load balancing than we could hope to be using deployment time hints? > When one of them fails, up to ZMQ_HWM requests is queued for it and once > the limit (HWM) is reached it no more request will be sent to that > server. Once it gets back online queued requests will be sent to it and > load-balancing starts to dispatch new requests to it automatically. The part that confuses me about this is in a traditional RPC scenario you may be calling services that have side effects or that lack idempotence, per se. You'd want the message to be handled by the first reachable service provider and no others. Unless I misunderstand what you're saying, it seems counter-intuitive that RPC-style messages would queue up rather than failover to alternate instances or fail fast when there are no reachable instances? I probably need to go back and read the whitepapers again now that I'm becoming more familiar with the concepts. Thanks -- [1] http://tools.ietf.org/html/rfc2782 _______________________________________________ zeromq-dev mailing list zeromq-dev@lists.zeromq.org http://lists.zeromq.org/mailman/listinfo/zeromq-dev