Thanks Flavio. Would you know why node2 could not receive ACK from the other 2 nodes .
What is the workaround in scenarios like these where in a 3 node cluster 1 node is not responding ** If we do a rolling restart there is a possiblity of a downtime ** Add 2 more nodes to the configs and do a rolling restart ** Could you think of any way to fix node 2 so that it rejoins the cluster. Would appreciate your reply. On Tue, Apr 12, 2016 at 1:33 AM, Flavio Junqueira <f...@apache.org> wrote: > Good to hear you've been able to sort it out. > > -Flavio > > > On 12 Apr 2016, at 03:02, s influxdb <elastic....@gmail.com> wrote: > > > > created a parallel independant zookeeper cluster on the same set of > > machines with different ports and that worked. This indicates the port > was > > the issue. > > > > On Mon, Apr 11, 2016 at 1:35 PM, s influxdb <elastic....@gmail.com> > wrote: > > > >> reboot of the server didn't help > >> > >> On Thu, Apr 7, 2016 at 6:50 PM, s influxdb <elastic....@gmail.com> > wrote: > >> > >>> I ran tcpdump on all the three nodes. > >>> It looks like that for every [PSH, ACK] there is a missing [ACK] from > >>> the other nodes to this 2nd node on port 3888. > >>> > >>> > >>> On Thu, Apr 7, 2016 at 1:29 PM, s influxdb <elastic....@gmail.com> > wrote: > >>> > >>>> Thanks Flavio for your quick replies. > >>>> The zookeeper version is 3.4.6 > >>>> > >>>> > >>>> > >>>> On Thu, Apr 7, 2016 at 1:23 PM, Flavio P JUNQUEIRA <f...@apache.org> > >>>> wrote: > >>>> > >>>>> You need to determine why it is not receiving notification messages. > >>>>> From > >>>>> the information you've given, it doesn't look like a zookeeper code > >>>>> issue. > >>>>> > >>>>> BTW, which version are you using? > >>>>> > >>>>> -Flavio > >>>>> On 7 Apr 2016 21:20, "s influxdb" <elastic....@gmail.com> wrote: > >>>>> > >>>>>> nothin on the iptables firewall . > >>>>>> > >>>>>> What options do i have to reconnect this node to the cluster ? > >>>>>> > >>>>>> > >>>>>> On Thu, Apr 7, 2016 at 10:14 AM, s influxdb <elastic....@gmail.com> > >>>>> wrote: > >>>>>> > >>>>>>> telnet works on 2888 and 3888 to the other nodes. Now i see > >>>>>>> java.net.SocketTimeoutException: connect timed out messages in the > >>>>> logs > >>>>>> for > >>>>>>> node 2 > >>>>>>> > >>>>>>> On Thu, Apr 7, 2016 at 3:05 AM, Flavio Junqueira <f...@apache.org> > >>>>> wrote: > >>>>>>> > >>>>>>>> I only see notifications from the node to itself. It says that it > >>>>> is > >>>>>>>> connected to 1, but it doesn't seem to be receiving the > >>>>> notification > >>>>>> from > >>>>>>>> 1. It also doesn't seem to be receiving the connection request > >>>>> from 3. > >>>>>>>> > >>>>>>>> Last time I've seen something like this was due to iptables rules, > >>>>> but > >>>>>> if > >>>>>>>> it was working before and no configuration has changed, then I > >>>>> don't > >>>>>> know > >>>>>>>> what it could be. > >>>>>>>> > >>>>>>>> -Flavio > >>>>>>>> > >>>>>>>>> On 07 Apr 2016, at 05:43, s influxdb <elastic....@gmail.com> > >>>>> wrote: > >>>>>>>>> > >>>>>>>>> this is the pastie > >>>>>>>>> http://pastie.org/10788301 > >>>>>>>>> > >>>>>>>>> On Wed, Apr 6, 2016 at 9:41 PM, s influxdb < > >>>>> elastic....@gmail.com> > >>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> We had one of the node giving OOM java.lang.OutOfMemoryError: > >>>>> unable > >>>>>> to > >>>>>>>>>> create new native thread and then being unresponsive. > >>>>>>>>>> > >>>>>>>>>> We tried to add the node back to the cluster but with no luck. > >>>>>>>>>> > >>>>>>>>>> It doesn't seem to "Receive any notification " messages from > >>>>> the > >>>>>> other > >>>>>>>>>> nodes. > >>>>>>>>>> Keeps "Sending notifications " in loop > >>>>>>>>>> > >>>>>>>>>> Please see attached the logs of the node that is out of > >>>>> rotation. > >>>>>>>>>> > >>>>>>>>>> Any inputs appreciated. > >>>>>>>>>> > >>>>>>>>>> Thanks > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >>> > >> > >