> On 14 Oct 2015, at 17:22, elastic search <[email protected]> wrote:
>
> The link between the AWS and the local DataCenter was down for around 2
> minutes.
> I have been running ping continuously from the DataCenter to the AWS and
> that wasn't responding for a few minutes.
>
> Are you saying since we see Send Notification messages in the logs , that
> would mean the Observers are able to connect to the ZK .
> Only that the ZK server leader is not able to respond back ?
Yeah, the logs indicate that the observer is being able to connect ("There is a
connection already for server..." in the logs) but isn't able to make progress,
which implies that it isn't receiving responses.
>
> This is what i see from the Server logs
> 2015-10-09 16:02:45,780 [myid:3] - ERROR [LearnerHandler-/10.10.4.46:38161
> :LearnerHandler@633] - Unexpected exception causing shutdown while sock
> still open
> 2015-10-09 16:19:28,160 [myid:3] - WARN
> [RecvWorker:5:QuorumCnxManager$RecvWorker@780] - Connection broken for id
> 5, my id = 3, error =
>
These lines belong to different parts of the code path. The first actually
indicates that the observer found the leader, but could not keep going. For
this particular time period, it could be that the observer is trying to sync up
with the leader, but the value of syncLimit is too low and the leader is
abandoning the observer. One reason for that to happen is the snapshot being
large (or your available bandwidth being low).
The second line is from the part of the code that does connections for leader
election. That one also says that the connection broke.
-Flavio
> On Wed, Oct 14, 2015 at 1:28 AM, Flavio Junqueira <[email protected]> wrote:
>
>> Can you tell why the server wasn't responding to the notifications from
>> the observer? The log file is from the observer and it sounds like it is
>> being able to send messages out, but it isn't clear why the server isn't
>> responding.
>>
>> -Flavio
>>
>>> On 14 Oct 2015, at 01:51, elastic search <[email protected]> wrote:
>>>
>>>
>>> Hello Experts
>>>
>>> We have 2 Observers running in AWS connecting over to local ZK Ensemble
>> in our own DataCenter.
>>>
>>> There have been instances where we see network drop for a minute between
>> the networks.
>>> However the Observers take around 15 minutes to recover even if the
>> network outage is for a minute.
>>>
>>> From the logs
>>> java.net.SocketTimeoutException: Read timed out
>>> 2015-10-13 22:26:03,927 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 400
>>> 2015-10-13 22:26:04,328 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 800
>>> 2015-10-13 22:26:05,129 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 1600
>>> 2015-10-13 22:26:06,730 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 3200
>>> 2015-10-13 22:26:09,931 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 6400
>>> 2015-10-13 22:26:16,332 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 12800
>>> 2015-10-13 22:26:29,133 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 25600
>>> 2015-10-13 22:26:54,734 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 51200
>>> 2015-10-13 22:27:45,935 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:28:45,936 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:29:45,937 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:30:45,938 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:31:45,939 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:32:45,940 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:33:45,941 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:34:45,942 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:35:45,943 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:36:45,944 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:37:45,945 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:38:45,946 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:39:45,947 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:40:45,948 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>> 2015-10-13 22:41:45,949 [myid:4] - INFO
>> [QuorumPeer[myid=4]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] -
>> Notification time out: 60000
>>>
>>> And then finally exits the QuorumCnxManager run loop with the following
>> message
>>> WARN [RecvWorker:2:QuorumCnxManager$RecvWorker@780] - Connection
>> broken for id 2
>>>
>>> How can we ensure the observer does not go out for service such a long
>> duration ?
>>>
>>> Attached the full logs
>>>
>>> Please help
>>> Thanks
>>>
>>> <zookeeper.log.zip>
>>
>>