Re: node 2 not rejoining cluster

Flavio Junqueira Thu, 14 Apr 2016 02:54:46 -0700

Other than some kind of funky packet filtering rule, I'm not sure why you'd not 
be receiving the ACKs.


I think that reconfiguring isn't the right way of addressing the problem. If 
you have some underlying issue, configuration or even bad hardware, then adding 
more nodes will not fix it. Even worse, it might lurking there for some time 
and might come back to bite you later.

If you do lose a machine (e.g., permanent failure, decommission), then it does 
make sense to reconfigure the ensemble.
 
-Flavio

  
> On 14 Apr 2016, at 01:12, s influxdb <elastic....@gmail.com> wrote:
> 
> Thanks Flavio. 
> 
> Would you know why node2 could not receive ACK from the other 2 nodes .
> 
> What is the workaround in scenarios like these where in a 3 node cluster 1 
> node is not responding
> ** If we do a rolling restart there is a possiblity of a downtime
> ** Add 2 more nodes to the configs and do a rolling restart
> ** Could you think of any way to fix node 2 so that it rejoins the cluster.
> 
> Would appreciate your reply.
> 
> 
> 
> On Tue, Apr 12, 2016 at 1:33 AM, Flavio Junqueira <f...@apache.org 
> <mailto:f...@apache.org>> wrote:
> Good to hear you've been able to sort it out.
> 
> -Flavio
> 
> > On 12 Apr 2016, at 03:02, s influxdb <elastic....@gmail.com 
> > <mailto:elastic....@gmail.com>> wrote:
> >
> > created a parallel independant zookeeper cluster on the same set of
> > machines with different ports and that worked. This indicates the port was
> > the issue.
> >
> > On Mon, Apr 11, 2016 at 1:35 PM, s influxdb <elastic....@gmail.com 
> > <mailto:elastic....@gmail.com>> wrote:
> >
> >> reboot of the server didn't help
> >>
> >> On Thu, Apr 7, 2016 at 6:50 PM, s influxdb <elastic....@gmail.com 
> >> <mailto:elastic....@gmail.com>> wrote:
> >>
> >>> I ran tcpdump on all the three nodes.
> >>> It looks like that for every  [PSH, ACK] there is a missing [ACK] from
> >>> the other nodes to this 2nd node on port 3888.
> >>>
> >>>
> >>> On Thu, Apr 7, 2016 at 1:29 PM, s influxdb <elastic....@gmail.com 
> >>> <mailto:elastic....@gmail.com>> wrote:
> >>>
> >>>> Thanks Flavio for your quick replies.
> >>>> The zookeeper version is 3.4.6
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Apr 7, 2016 at 1:23 PM, Flavio P JUNQUEIRA <f...@apache.org 
> >>>> <mailto:f...@apache.org>>
> >>>> wrote:
> >>>>
> >>>>> You need to determine why it is not receiving notification messages.
> >>>>> From
> >>>>> the information you've given, it doesn't look like a zookeeper code
> >>>>> issue.
> >>>>>
> >>>>> BTW, which version are you using?
> >>>>>
> >>>>> -Flavio
> >>>>> On 7 Apr 2016 21:20, "s influxdb" <elastic....@gmail.com 
> >>>>> <mailto:elastic....@gmail.com>> wrote:
> >>>>>
> >>>>>> nothin on the iptables firewall .
> >>>>>>
> >>>>>> What options do i have to reconnect this node to the cluster ?
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Apr 7, 2016 at 10:14 AM, s influxdb <elastic....@gmail.com 
> >>>>>> <mailto:elastic....@gmail.com>>
> >>>>> wrote:
> >>>>>>
> >>>>>>> telnet works on 2888 and 3888 to the other nodes. Now i see
> >>>>>>> java.net.SocketTimeoutException: connect timed out messages in the
> >>>>> logs
> >>>>>> for
> >>>>>>> node 2
> >>>>>>>
> >>>>>>> On Thu, Apr 7, 2016 at 3:05 AM, Flavio Junqueira <f...@apache.org 
> >>>>>>> <mailto:f...@apache.org>>
> >>>>> wrote:
> >>>>>>>
> >>>>>>>> I only see notifications from the node to itself. It says that it
> >>>>> is
> >>>>>>>> connected to 1, but it doesn't seem to be receiving the
> >>>>> notification
> >>>>>> from
> >>>>>>>> 1. It also doesn't seem to be receiving the connection request
> >>>>> from 3.
> >>>>>>>>
> >>>>>>>> Last time I've seen something like this was due to iptables rules,
> >>>>> but
> >>>>>> if
> >>>>>>>> it was working before and no configuration has changed, then I
> >>>>> don't
> >>>>>> know
> >>>>>>>> what it could be.
> >>>>>>>>
> >>>>>>>> -Flavio
> >>>>>>>>
> >>>>>>>>> On 07 Apr 2016, at 05:43, s influxdb <elastic....@gmail.com 
> >>>>>>>>> <mailto:elastic....@gmail.com>>
> >>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> this is the pastie
> >>>>>>>>> http://pastie.org/10788301 <http://pastie.org/10788301>
> >>>>>>>>>
> >>>>>>>>> On Wed, Apr 6, 2016 at 9:41 PM, s influxdb <
> >>>>> elastic....@gmail.com <mailto:elastic....@gmail.com>>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> We had one of the node giving OOM java.lang.OutOfMemoryError:
> >>>>> unable
> >>>>>> to
> >>>>>>>>>> create new native thread and then being unresponsive.
> >>>>>>>>>>
> >>>>>>>>>> We tried to add the node back to the cluster but with no luck.
> >>>>>>>>>>
> >>>>>>>>>> It doesn't seem to "Receive any notification "  messages from
> >>>>> the
> >>>>>> other
> >>>>>>>>>> nodes.
> >>>>>>>>>> Keeps "Sending notifications " in loop
> >>>>>>>>>>
> >>>>>>>>>> Please see attached the logs of the node that is out of
> >>>>> rotation.
> >>>>>>>>>>
> >>>>>>>>>> Any inputs appreciated.
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> 
>

Re: node 2 not rejoining cluster

Reply via email to