|
I've got a app which works good. Now I need to make it fault tolerant. I have a polling loop where I listen for new workers and maintain a list of them. Before I use a worker, I need to verify that they are still there. I can perform a simple handshake right before I assign the worker a job but I need to properly and quickly detect when this handshake fails. I can't have a 30 second time out or infinite wait in the event of a fault. If I am calling zmq_send and zmq_receive, on what I am expecting to be a real quick back and forth, can I set a 2 second timeout so I can detect that: the connection has dropped, the network is broken, the worker won't reply right now, or the worker isn't there anymore? (As in any of the above.) Then, after my quick handshake is successful, I set the timeout back to big or infinite and perform my normal zmq_send and resume polling. Idea: set a socket option to not block, try the send and retry for 2 seconds if it doesn't complete right away. Because of the infrastructure of zmq, this would only queue up the message. But if I then did the same thing on the following zmq_recv call, I guess that would be an effective timeout. As long as I can properly kill that connection so that if the worker comes back it will detect that it needs to re-connect, this may work. I don't mind busy looping in the event of a fault (which should be exceedingly rare). I just need to find a way to verify and recover from faults, without halting production for more than two seconds. Any opinions? This seems to be a weakness of zmq to be totally geared toward normal and we have to go through major hoops to handle problems. For example, it may help my cause to be able to get the number of outstanding messages on a socket. This would help me to detect when there's a problem without blindly sending more messages that won't send, making the backlog worse. Then, when my state for a worker says that it's idle, and I send a message and busy loop for two seconds waiting for 'Outstanding messages' to reach zero, this could be more graceful than than calling zmq_recv. Also for loadbalancing in other circumstances, this could be a great feature. scott --
![]() |
_______________________________________________ zeromq-dev mailing list [email protected] http://lists.zeromq.org/mailman/listinfo/zeromq-dev

