Hi, continuing with replication analysis, here is a description of the different use cases we have when it comes to detect if the provider is connected or disconected.
I have identified 6 use cases for that : 1) The consumer has been stopped because the server it runs on has been shutdown. This is a pretty obvious use case : either the server has been killed properly, and we can kindly close the connection to the providers, or the server has been brutally closed, then the socket might be up for a delay depending on the underlaying OS, but will eventually be closed. Nothing specifal to do there. 2) The admin stopped a consumer. This is an interesting use case, but in 2.0, we don't handle such a use case. An admin might want to shutdown a consumer, or restart it, because the configuration has changed. In 2.0, we won't support dynamic configuration, so that ends with a server restart. Cf use case 1. 3) The provider has cleanly disconnected The consumer will receive a disconnection notice for the associated consumers, which will stop processing the incoming data (as we won't get anymore), and switch to a connection polling thread. We will try to connect back every N seconds, until the provider is back. 4) The connection is closed because we haven't received any message for more than the socket inactivity delay We will receive a disconnection notification, and the consumer will exit, and try to reconnect after a delay. This is very simular to (3). We can do better : having a separate thread that polls the various provider periodically, keeping the socket opened. 5) The provider has brutaly disconnected We won't be informed from such a disconnection. The RefreshOnly replication will be able to detect it, because it periodically tries to contact the remote peer, but the Refresh&Persist replication is just waiting for incoming messages, which it won't received anymore. If we have the thread described in (4), we can detect such a use case 6) We got an exception during the replication This is a special case, as we are not suppose to get any exception there. But still, shit happens. I suggest we stop the consumer, dsiconnect it, reconnect back and try to reconnect. I think those 6 use case cover all the possibility, and the proposed solution are ok, but feel free to comment ! -- Regards, Cordialement, Emmanuel Lécharny www.iktek.com