Hi Flavio Thanks for the reply.
These are the config parameters tickTime=2000 initLimit=10 syncLimit=5 Data size is -rw-r--r-- 1 xoom xoom 85K Oct 3 10:16 snapshot.170001badb -rw-r--r-- 1 xoom xoom 65M Oct 3 10:16 log.170000da7f -rw-r--r-- 1 xoom xoom 88K Oct 13 23:30 snapshot.170002bfcd -rw-r--r-- 1 xoom xoom 65M Oct 13 23:30 log.170001badd -rw-r--r-- 1 xoom xoom 65M Oct 14 11:36 log.170002bfcf These are all defaults and i have never deleted any log files. Do you still think this could be due to syncLimit or the snapshot being large On Wed, Oct 14, 2015 at 9:44 AM, Flavio Junqueira <[email protected]> wrote: > > > On 14 Oct 2015, at 17:22, elastic search <[email protected]> wrote: > > > > The link between the AWS and the local DataCenter was down for around 2 > > minutes. > > I have been running ping continuously from the DataCenter to the AWS and > > that wasn't responding for a few minutes. > > > > Are you saying since we see Send Notification messages in the logs , that > > would mean the Observers are able to connect to the ZK . > > Only that the ZK server leader is not able to respond back ? > > Yeah, the logs indicate that the observer is being able to connect ("There > is a connection already for server..." in the logs) but isn't able to make > progress, which implies that it isn't receiving responses. > > > > > This is what i see from the Server logs > > 2015-10-09 16:02:45,780 [myid:3] - ERROR [LearnerHandler-/ > 10.10.4.46:38161 > > :LearnerHandler@633] - Unexpected exception causing shutdown while sock > > still open > > 2015-10-09 16:19:28,160 [myid:3] - WARN > > [RecvWorker:5:QuorumCnxManager$RecvWorker@780] - Connection broken for > id > > 5, my id = 3, error = > > > > These lines belong to different parts of the code path. The first actually > indicates that the observer found the leader, but could not keep going. For > this particular time period, it could be that the observer is trying to > sync up with the leader, but the value of syncLimit is too low and the > leader is abandoning the observer. One reason for that to happen is the > snapshot being large (or your available bandwidth being low). > > The second line is from the part of the code that does connections for > leader election. That one also says that the connection broke. > > -Flavio > > > > On Wed, Oct 14, 2015 at 1:28 AM, Flavio Junqueira <[email protected]> > wrote: > > > >> Can you tell why the server wasn't responding to the notifications from > >> the observer? The log file is from the observer and it sounds like it is > >> being able to send messages out, but it isn't clear why the server isn't > >> responding. > >> > >> -Flavio > >> > >>> On 14 Oct 2015, at 01:51, elastic search <[email protected]> > wrote: > >>> > >>> > >>> Hello Experts > >>> > >>> We have 2 Observers running in AWS connecting over to local ZK Ensemble > >> in our own DataCenter. > >>> > >>> There have been instances where we see network drop for a minute > between > >> the networks. > >>> However the Observers take around 15 minutes to recover even if the > >> network outage is for a minute. > >>> > >>> From the logs > >>> java.net.SocketTimeoutException: Read timed out > >>> 2015-10-13 22:26:03,927 [myid:4] - INFO > >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > >> Notification time out: 400 > >>> 2015-10-13 22:26:04,328 [myid:4] - INFO > >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > >> Notification time out: 800 > >>> 2015-10-13 22:26:05,129 [myid:4] - INFO > >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > >> Notification time out: 1600 > >>> 2015-10-13 22:26:06,730 [myid:4] - INFO > >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > >> Notification time out: 3200 > >>> 2015-10-13 22:26:09,931 [myid:4] - INFO > >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > >> Notification time out: 6400 > >>> 2015-10-13 22:26:16,332 [myid:4] - INFO > >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > >> Notification time out: 12800 > >>> 2015-10-13 22:26:29,133 [myid:4] - INFO > >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > >> Notification time out: 25600 > >>> 2015-10-13 22:26:54,734 [myid:4] - INFO > >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > >> Notification time out: 51200 > >>> 2015-10-13 22:27:45,935 [myid:4] - INFO > >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > >> Notification time out: 60000 > >>> 2015-10-13 22:28:45,936 [myid:4] - INFO > >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > >> Notification time out: 60000 > >>> 2015-10-13 22:29:45,937 [myid:4] - INFO > >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > >> Notification time out: 60000 > >>> 2015-10-13 22:30:45,938 [myid:4] - INFO > >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > >> Notification time out: 60000 > >>> 2015-10-13 22:31:45,939 [myid:4] - INFO > >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > >> Notification time out: 60000 > >>> 2015-10-13 22:32:45,940 [myid:4] - INFO > >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > >> Notification time out: 60000 > >>> 2015-10-13 22:33:45,941 [myid:4] - INFO > >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > >> Notification time out: 60000 > >>> 2015-10-13 22:34:45,942 [myid:4] - INFO > >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > >> Notification time out: 60000 > >>> 2015-10-13 22:35:45,943 [myid:4] - INFO > >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > >> Notification time out: 60000 > >>> 2015-10-13 22:36:45,944 [myid:4] - INFO > >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > >> Notification time out: 60000 > >>> 2015-10-13 22:37:45,945 [myid:4] - INFO > >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > >> Notification time out: 60000 > >>> 2015-10-13 22:38:45,946 [myid:4] - INFO > >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > >> Notification time out: 60000 > >>> 2015-10-13 22:39:45,947 [myid:4] - INFO > >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > >> Notification time out: 60000 > >>> 2015-10-13 22:40:45,948 [myid:4] - INFO > >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > >> Notification time out: 60000 > >>> 2015-10-13 22:41:45,949 [myid:4] - INFO > >> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - > >> Notification time out: 60000 > >>> > >>> And then finally exits the QuorumCnxManager run loop with the following > >> message > >>> WARN [RecvWorker:2:QuorumCnxManager$RecvWorker@780] - Connection > >> broken for id 2 > >>> > >>> How can we ensure the observer does not go out for service such a long > >> duration ? > >>> > >>> Attached the full logs > >>> > >>> Please help > >>> Thanks > >>> > >>> <zookeeper.log.zip> > >> > >> > >
