We face a similar issue with remote observers connecting to 3 node cluster. Using netstat we see a send-q tcp traffic accumulating. We believe this is a tcp timeout issue.
Bob On 10/14/15, 2:54 PM, "elastic search" <[email protected]> wrote: >When it outputs this message "There is a connection already for server" , >is it really valid connection or using the existing map in memory to >figure >if the connection already exists. >If that is the case it should empty this map and initate a connection when >it is LOOKING for a leader after a connection drop. >Right now it waits 15 minutes before it does that and re-iniates a >connection > > if (senderWorkerMap.get(sid) == null){ > } else { > LOG.debug("There is a connection already for server " + sid); > >If the ping to the server is not working how is it able to send >notifications . The logs says Sending notifications but the ping is not >working at that time. > >I am not sure how to decrease the 15 minute wait period before the >Observer >re-iniates a connection. >Or Alternatively how to make this work > > >On Wed, Oct 14, 2015 at 12:50 PM, Flavio Junqueira <[email protected]> wrote: > >> The snapshot is really small, so it is unlikely to be it, unless the >> connection is really crappy. >> >> -Flavio >> >> > On 14 Oct 2015, at 19:38, elastic search <[email protected]> >>wrote: >> > >> > Hi Flavio >> > >> > Thanks for the reply. >> > >> > These are the config parameters >> > tickTime=2000 >> > initLimit=10 >> > syncLimit=5 >> > >> > Data size is >> > -rw-r--r-- 1 xoom xoom 85K Oct 3 10:16 snapshot.170001badb >> > -rw-r--r-- 1 xoom xoom 65M Oct 3 10:16 log.170000da7f >> > -rw-r--r-- 1 xoom xoom 88K Oct 13 23:30 snapshot.170002bfcd >> > -rw-r--r-- 1 xoom xoom 65M Oct 13 23:30 log.170001badd >> > -rw-r--r-- 1 xoom xoom 65M Oct 14 11:36 log.170002bfcf >> > >> > These are all defaults and i have never deleted any log files. >> > Do you still think this could be due to syncLimit or the snapshot >>being >> > large >> > >> > >> > >> > On Wed, Oct 14, 2015 at 9:44 AM, Flavio Junqueira <[email protected]> >> wrote: >> > >> >> >> >>> On 14 Oct 2015, at 17:22, elastic search <[email protected]> >> wrote: >> >>> >> >>> The link between the AWS and the local DataCenter was down for >>around 2 >> >>> minutes. >> >>> I have been running ping continuously from the DataCenter to the AWS >> and >> >>> that wasn't responding for a few minutes. >> >>> >> >>> Are you saying since we see Send Notification messages in the logs , >> that >> >>> would mean the Observers are able to connect to the ZK . >> >>> Only that the ZK server leader is not able to respond back ? >> >> >> >> Yeah, the logs indicate that the observer is being able to connect >> ("There >> >> is a connection already for server..." in the logs) but isn't able to >> make >> >> progress, which implies that it isn't receiving responses. >> >> >> >>> >> >>> This is what i see from the Server logs >> >>> 2015-10-09 16:02:45,780 [myid:3] - ERROR [LearnerHandler-/ >> >> 10.10.4.46:38161 >> >>> :LearnerHandler@633] - Unexpected exception causing shutdown while >> sock >> >>> still open >> >>> 2015-10-09 16:19:28,160 [myid:3] - WARN >> >>> [RecvWorker:5:QuorumCnxManager$RecvWorker@780] - Connection broken >>for >> >> id >> >>> 5, my id = 3, error = >> >>> >> >> >> >> These lines belong to different parts of the code path. The first >> actually >> >> indicates that the observer found the leader, but could not keep >>going. >> For >> >> this particular time period, it could be that the observer is trying >>to >> >> sync up with the leader, but the value of syncLimit is too low and >>the >> >> leader is abandoning the observer. One reason for that to happen is >>the >> >> snapshot being large (or your available bandwidth being low). >> >> >> >> The second line is from the part of the code that does connections >>for >> >> leader election. That one also says that the connection broke. >> >> >> >> -Flavio >> >> >> >> >> >>> On Wed, Oct 14, 2015 at 1:28 AM, Flavio Junqueira <[email protected]> >> >> wrote: >> >>> >> >>>> Can you tell why the server wasn't responding to the notifications >> from >> >>>> the observer? The log file is from the observer and it sounds like >>it >> is >> >>>> being able to send messages out, but it isn't clear why the server >> isn't >> >>>> responding. >> >>>> >> >>>> -Flavio >> >>>> >> >>>>> On 14 Oct 2015, at 01:51, elastic search <[email protected]> >> >> wrote: >> >>>>> >> >>>>> >> >>>>> Hello Experts >> >>>>> >> >>>>> We have 2 Observers running in AWS connecting over to local ZK >> Ensemble >> >>>> in our own DataCenter. >> >>>>> >> >>>>> There have been instances where we see network drop for a minute >> >> between >> >>>> the networks. >> >>>>> However the Observers take around 15 minutes to recover even if >>the >> >>>> network outage is for a minute. >> >>>>> >> >>>>> From the logs >> >>>>> java.net.SocketTimeoutException: Read timed out >> >>>>> 2015-10-13 22:26:03,927 [myid:4] - INFO >> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - >> >>>> Notification time out: 400 >> >>>>> 2015-10-13 22:26:04,328 [myid:4] - INFO >> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - >> >>>> Notification time out: 800 >> >>>>> 2015-10-13 22:26:05,129 [myid:4] - INFO >> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - >> >>>> Notification time out: 1600 >> >>>>> 2015-10-13 22:26:06,730 [myid:4] - INFO >> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - >> >>>> Notification time out: 3200 >> >>>>> 2015-10-13 22:26:09,931 [myid:4] - INFO >> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - >> >>>> Notification time out: 6400 >> >>>>> 2015-10-13 22:26:16,332 [myid:4] - INFO >> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - >> >>>> Notification time out: 12800 >> >>>>> 2015-10-13 22:26:29,133 [myid:4] - INFO >> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - >> >>>> Notification time out: 25600 >> >>>>> 2015-10-13 22:26:54,734 [myid:4] - INFO >> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - >> >>>> Notification time out: 51200 >> >>>>> 2015-10-13 22:27:45,935 [myid:4] - INFO >> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - >> >>>> Notification time out: 60000 >> >>>>> 2015-10-13 22:28:45,936 [myid:4] - INFO >> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - >> >>>> Notification time out: 60000 >> >>>>> 2015-10-13 22:29:45,937 [myid:4] - INFO >> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - >> >>>> Notification time out: 60000 >> >>>>> 2015-10-13 22:30:45,938 [myid:4] - INFO >> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - >> >>>> Notification time out: 60000 >> >>>>> 2015-10-13 22:31:45,939 [myid:4] - INFO >> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - >> >>>> Notification time out: 60000 >> >>>>> 2015-10-13 22:32:45,940 [myid:4] - INFO >> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - >> >>>> Notification time out: 60000 >> >>>>> 2015-10-13 22:33:45,941 [myid:4] - INFO >> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - >> >>>> Notification time out: 60000 >> >>>>> 2015-10-13 22:34:45,942 [myid:4] - INFO >> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - >> >>>> Notification time out: 60000 >> >>>>> 2015-10-13 22:35:45,943 [myid:4] - INFO >> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - >> >>>> Notification time out: 60000 >> >>>>> 2015-10-13 22:36:45,944 [myid:4] - INFO >> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - >> >>>> Notification time out: 60000 >> >>>>> 2015-10-13 22:37:45,945 [myid:4] - INFO >> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - >> >>>> Notification time out: 60000 >> >>>>> 2015-10-13 22:38:45,946 [myid:4] - INFO >> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - >> >>>> Notification time out: 60000 >> >>>>> 2015-10-13 22:39:45,947 [myid:4] - INFO >> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - >> >>>> Notification time out: 60000 >> >>>>> 2015-10-13 22:40:45,948 [myid:4] - INFO >> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - >> >>>> Notification time out: 60000 >> >>>>> 2015-10-13 22:41:45,949 [myid:4] - INFO >> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - >> >>>> Notification time out: 60000 >> >>>>> >> >>>>> And then finally exits the QuorumCnxManager run loop with the >> following >> >>>> message >> >>>>> WARN [RecvWorker:2:QuorumCnxManager$RecvWorker@780] - Connection >> >>>> broken for id 2 >> >>>>> >> >>>>> How can we ensure the observer does not go out for service such a >> long >> >>>> duration ? >> >>>>> >> >>>>> Attached the full logs >> >>>>> >> >>>>> Please help >> >>>>> Thanks >> >>>>> >> >>>>> <zookeeper.log.zip> >> >>>> >> >>>> >> >> >> >> >> >>
