Thanks Bob and Flavio for your replies. I am considering opening a JIRA ticket for this issue , unless somebody sees this is not an Observer issue
On Wed, Oct 14, 2015 at 3:20 PM, Bob Sheehan <[email protected]> wrote: > To more specific after leader election the old observer connections to > previous leader are staleŠ(send-q accumulates) but the tcp timeout after > 15 minutes terminates these old connections and allows fresh TCP > connection to new leader. > > Bob > > On 10/14/15, 3:17 PM, "Bob Sheehan" <[email protected]> wrote: > > >We face a similar issue with remote observers connecting to 3 node > >cluster. Using netstat we see a send-q tcp traffic accumulating. We > >believe this is a tcp timeout issue. > > > >Bob > > > > > >On 10/14/15, 2:54 PM, "elastic search" <[email protected]> wrote: > > > >>When it outputs this message "There is a connection already for server" , > >>is it really valid connection or using the existing map in memory to > >>figure > >>if the connection already exists. > >>If that is the case it should empty this map and initate a connection > >>when > >>it is LOOKING for a leader after a connection drop. > >>Right now it waits 15 minutes before it does that and re-iniates a > >>connection > >> > >> if (senderWorkerMap.get(sid) == null){ > >> } else { > >> LOG.debug("There is a connection already for server " + sid); > >> > >>If the ping to the server is not working how is it able to send > >>notifications . The logs says Sending notifications but the ping is not > >>working at that time. > >> > >>I am not sure how to decrease the 15 minute wait period before the > >>Observer > >>re-iniates a connection. > >>Or Alternatively how to make this work > >> > >> > >>On Wed, Oct 14, 2015 at 12:50 PM, Flavio Junqueira <[email protected]> > >>wrote: > >> > >>> The snapshot is really small, so it is unlikely to be it, unless the > >>> connection is really crappy. > >>> > >>> -Flavio > >>> > >>> > On 14 Oct 2015, at 19:38, elastic search <[email protected]> > >>>wrote: > >>> > > >>> > Hi Flavio > >>> > > >>> > Thanks for the reply. > >>> > > >>> > These are the config parameters > >>> > tickTime=2000 > >>> > initLimit=10 > >>> > syncLimit=5 > >>> > > >>> > Data size is > >>> > -rw-r--r-- 1 xoom xoom 85K Oct 3 10:16 snapshot.170001badb > >>> > -rw-r--r-- 1 xoom xoom 65M Oct 3 10:16 log.170000da7f > >>> > -rw-r--r-- 1 xoom xoom 88K Oct 13 23:30 snapshot.170002bfcd > >>> > -rw-r--r-- 1 xoom xoom 65M Oct 13 23:30 log.170001badd > >>> > -rw-r--r-- 1 xoom xoom 65M Oct 14 11:36 log.170002bfcf > >>> > > >>> > These are all defaults and i have never deleted any log files. > >>> > Do you still think this could be due to syncLimit or the snapshot > >>>being > >>> > large > >>> > > >>> > > >>> > > >>> > On Wed, Oct 14, 2015 at 9:44 AM, Flavio Junqueira <[email protected]> > >>> wrote: > >>> > > >>> >> > >>> >>> On 14 Oct 2015, at 17:22, elastic search <[email protected]> > >>> wrote: > >>> >>> > >>> >>> The link between the AWS and the local DataCenter was down for > >>>around 2 > >>> >>> minutes. > >>> >>> I have been running ping continuously from the DataCenter to the > >>>AWS > >>> and > >>> >>> that wasn't responding for a few minutes. > >>> >>> > >>> >>> Are you saying since we see Send Notification messages in the logs > >>>, > >>> that > >>> >>> would mean the Observers are able to connect to the ZK . > >>> >>> Only that the ZK server leader is not able to respond back ? > >>> >> > >>> >> Yeah, the logs indicate that the observer is being able to connect > >>> ("There > >>> >> is a connection already for server..." in the logs) but isn't able > >>>to > >>> make > >>> >> progress, which implies that it isn't receiving responses. > >>> >> > >>> >>> > >>> >>> This is what i see from the Server logs > >>> >>> 2015-10-09 16:02:45,780 [myid:3] - ERROR [LearnerHandler-/ > >>> >> 10.10.4.46:38161 > >>> >>> :LearnerHandler@633] - Unexpected exception causing shutdown while > >>> sock > >>> >>> still open > >>> >>> 2015-10-09 16:19:28,160 [myid:3] - WARN > >>> >>> [RecvWorker:5:QuorumCnxManager$RecvWorker@780] - Connection broken > >>>for > >>> >> id > >>> >>> 5, my id = 3, error = > >>> >>> > >>> >> > >>> >> These lines belong to different parts of the code path. The first > >>> actually > >>> >> indicates that the observer found the leader, but could not keep > >>>going. > >>> For > >>> >> this particular time period, it could be that the observer is trying > >>>to > >>> >> sync up with the leader, but the value of syncLimit is too low and > >>>the > >>> >> leader is abandoning the observer. One reason for that to happen is > >>>the > >>> >> snapshot being large (or your available bandwidth being low). > >>> >> > >>> >> The second line is from the part of the code that does connections > >>>for > >>> >> leader election. That one also says that the connection broke. > >>> >> > >>> >> -Flavio > >>> >> > >>> >> > >>> >>> On Wed, Oct 14, 2015 at 1:28 AM, Flavio Junqueira <[email protected]> > >>> >> wrote: > >>> >>> > >>> >>>> Can you tell why the server wasn't responding to the notifications > >>> from > >>> >>>> the observer? The log file is from the observer and it sounds like > >>>it > >>> is > >>> >>>> being able to send messages out, but it isn't clear why the server > >>> isn't > >>> >>>> responding. > >>> >>>> > >>> >>>> -Flavio > >>> >>>> > >>> >>>>> On 14 Oct 2015, at 01:51, elastic search <[email protected]> > >>> >> wrote: > >>> >>>>> > >>> >>>>> > >>> >>>>> Hello Experts > >>> >>>>> > >>> >>>>> We have 2 Observers running in AWS connecting over to local ZK > >>> Ensemble > >>> >>>> in our own DataCenter. > >>> >>>>> > >>> >>>>> There have been instances where we see network drop for a minute > >>> >> between > >>> >>>> the networks. > >>> >>>>> However the Observers take around 15 minutes to recover even if > >>>the > >>> >>>> network outage is for a minute. > >>> >>>>> > >>> >>>>> From the logs > >>> >>>>> java.net.SocketTimeoutException: Read timed out > >>> >>>>> 2015-10-13 22:26:03,927 [myid:4] - INFO > >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] > - > >>> >>>> Notification time out: 400 > >>> >>>>> 2015-10-13 22:26:04,328 [myid:4] - INFO > >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] > - > >>> >>>> Notification time out: 800 > >>> >>>>> 2015-10-13 22:26:05,129 [myid:4] - INFO > >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] > - > >>> >>>> Notification time out: 1600 > >>> >>>>> 2015-10-13 22:26:06,730 [myid:4] - INFO > >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] > - > >>> >>>> Notification time out: 3200 > >>> >>>>> 2015-10-13 22:26:09,931 [myid:4] - INFO > >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] > - > >>> >>>> Notification time out: 6400 > >>> >>>>> 2015-10-13 22:26:16,332 [myid:4] - INFO > >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] > - > >>> >>>> Notification time out: 12800 > >>> >>>>> 2015-10-13 22:26:29,133 [myid:4] - INFO > >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] > - > >>> >>>> Notification time out: 25600 > >>> >>>>> 2015-10-13 22:26:54,734 [myid:4] - INFO > >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] > - > >>> >>>> Notification time out: 51200 > >>> >>>>> 2015-10-13 22:27:45,935 [myid:4] - INFO > >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] > - > >>> >>>> Notification time out: 60000 > >>> >>>>> 2015-10-13 22:28:45,936 [myid:4] - INFO > >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] > - > >>> >>>> Notification time out: 60000 > >>> >>>>> 2015-10-13 22:29:45,937 [myid:4] - INFO > >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] > - > >>> >>>> Notification time out: 60000 > >>> >>>>> 2015-10-13 22:30:45,938 [myid:4] - INFO > >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] > - > >>> >>>> Notification time out: 60000 > >>> >>>>> 2015-10-13 22:31:45,939 [myid:4] - INFO > >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] > - > >>> >>>> Notification time out: 60000 > >>> >>>>> 2015-10-13 22:32:45,940 [myid:4] - INFO > >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] > - > >>> >>>> Notification time out: 60000 > >>> >>>>> 2015-10-13 22:33:45,941 [myid:4] - INFO > >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] > - > >>> >>>> Notification time out: 60000 > >>> >>>>> 2015-10-13 22:34:45,942 [myid:4] - INFO > >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] > - > >>> >>>> Notification time out: 60000 > >>> >>>>> 2015-10-13 22:35:45,943 [myid:4] - INFO > >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] > - > >>> >>>> Notification time out: 60000 > >>> >>>>> 2015-10-13 22:36:45,944 [myid:4] - INFO > >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] > - > >>> >>>> Notification time out: 60000 > >>> >>>>> 2015-10-13 22:37:45,945 [myid:4] - INFO > >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] > - > >>> >>>> Notification time out: 60000 > >>> >>>>> 2015-10-13 22:38:45,946 [myid:4] - INFO > >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] > - > >>> >>>> Notification time out: 60000 > >>> >>>>> 2015-10-13 22:39:45,947 [myid:4] - INFO > >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] > - > >>> >>>> Notification time out: 60000 > >>> >>>>> 2015-10-13 22:40:45,948 [myid:4] - INFO > >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] > - > >>> >>>> Notification time out: 60000 > >>> >>>>> 2015-10-13 22:41:45,949 [myid:4] - INFO > >>> >>>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] > - > >>> >>>> Notification time out: 60000 > >>> >>>>> > >>> >>>>> And then finally exits the QuorumCnxManager run loop with the > >>> following > >>> >>>> message > >>> >>>>> WARN [RecvWorker:2:QuorumCnxManager$RecvWorker@780] - > Connection > >>> >>>> broken for id 2 > >>> >>>>> > >>> >>>>> How can we ensure the observer does not go out for service such a > >>> long > >>> >>>> duration ? > >>> >>>>> > >>> >>>>> Attached the full logs > >>> >>>>> > >>> >>>>> Please help > >>> >>>>> Thanks > >>> >>>>> > >>> >>>>> <zookeeper.log.zip> > >>> >>>> > >>> >>>> > >>> >> > >>> >> > >>> > >>> > > > >
