Re: Ensemble fails when one node looses connectivity

Steph van Schalkwyk Thu, 01 Mar 2018 20:15:40 -0800

Hi Jim

You set it in the java.env file in /opt/zookeeper/conf.


JVMFLAGS=" -Xmx4g -Djute.maxbuffer=2147483648"

The example above is for 2GB, so please change the size :) In this case
(-Xmx4g) the ZK node was running on an 8GB VM.
And yes, make sure that you do that to all the servers.

Here is one reference to it:
https://community.cloudera.com/t5/Storage-Random-Access-HDFS/zookeeper-error-Unexpected-exception-causing-shutdown-while-sock/td-p/30914

If you need more debug information, you can add logging level as well:
-Dzookeeper.log.threshold=INFO

for example: JVMFLAGS=" -Xmx4g  -Djute.maxbuffer=2147483648
-Dzookeeper.log.threshold=DEBUG"

Good luck! I hope this works.
Steph



On Thu, Mar 1, 2018 at 8:59 PM, Jim Keeney <[email protected]> wrote:

> Steph -
>
> Read about the maxbuffer and am pretty sure that this might explain the
> behavior we are seeing since it occurs when there has been a significant
> reboot of all the servers. We have over 2 mb of config files for all of our
> indexes and if all the Solr nodes are sync ing their configs at once it
> seems like that might overflow the buffer.
>
> Newbie question, where would i set the -Djute.maxbuffer ? Should I update
> the zkServer.sh file so this is applied every time zookeeper is started or
> restarted.
>
> Also, I noted the caution and will make sure that all of the nodes are set
> to the same value. Saw some discussion about having to change the zkCli
> settings to be larger than that of the server. Is that true?
>
> Thanks in advance.
>
> Jim K.
>
> On Thu, Mar 1, 2018 at 9:13 PM, Jim Keeney <[email protected]> wrote:
>
> > Thanks, Yes, I have about 2MB stored in the configurations folders. I
> will
> > increase the jute.maxbuffer and see if that helps.
> >
> > Jim K.
> >
> > On Thu, Mar 1, 2018 at 8:58 PM, Steph van Schalkwyk <
> > [email protected]> wrote:
> >
> >> Does the log say anything about timing out on init?
> >> Your initLimit is already pretty big, but then we don't know anything
> >> about
> >> your setup.
> >> Are you storing more than 1MB in a znode? Then increase jute.maxbuffer
> (in
> >> java.env as a -Djute.maxbuffer=xxxxxx).
> >> I've recently run into that with Fusion 3.1.
> >> Post more details, if you would.
> >> Good luck.
> >> Steph
> >>
> >>
> >> On Thu, Mar 1, 2018 at 7:43 PM, Jim Keeney <[email protected]> wrote:
> >>
> >> > I'm using Zookeeper with solr to create a cluster and I have come
> across
> >> > what seems like an unexpected behavior. The cluster is setup on AWS
> >> using
> >> > opsworks.  I am using a 3 node zookeeper ensemble. The zookeeper
> config
> >> > on all three nodes is:
> >> >
> >> > clientPort=2181
> >> >
> >> > dataDir=/var/opt/zookeeper/data
> >> >
> >> > tickTime=2000
> >> >
> >> > autopurge.purgeInterval=24
> >> >
> >> > initLimit=100
> >> >
> >> > syncLimit=5
> >> >
> >> > server.1=172.31.86.130:2888:3888
> >> >
> >> > server.2=172.31.16.234:2888:3888
> >> >
> >> > server.3=172.31.73.122:2888:3888
> >> >
> >> >
> >> > Here is the issue:
> >> >
> >> > If one node in the ensemble fails or is shut down the ensemble carries
> >> on.
> >> > However, when the node is restarted it's attempt to connect to the
> other
> >> > members of the cluster are rejected. The only way that I have found to
> >> > restore the ensemble is to restart all of the nodes within a short
> time
> >> > span of each other.
> >> >
> >> > If I do that they are able to discover each other  carry on a proper
> >> > leader election and restore order.
> >> >
> >> > Once they are restored everything is fine but if one of the nodes goes
> >> > down we are faced wit the same problem.
> >> >
> >> > How do I ensure that if a node goes down, it can restart and rejoin
> the
> >> > ensemble with out having to manually restart all the other nodes?
> >> >
> >> > Any help appreciated.
> >> >
> >> > Thanks.
> >> >
> >> > Jim K.
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Jim Keeney
> >> > President, FitterWeb
> >> > E: [email protected]
> >> > M: 703-568-5887 <(703)%20568-5887>
> >> >
> >> > *FitterWeb Consulting*
> >> > *Are you lean and agile enough? *
> >> >
> >>
> >
> >
> >
> > --
> > Jim Keeney
> > President, FitterWeb
> > E: [email protected]
> > M: 703-568-5887 <(703)%20568-5887>
> >
> > *FitterWeb Consulting*
> > *Are you lean and agile enough? *
> >
>
>
>
> --
> Jim Keeney
> President, FitterWeb
> E: [email protected]
> M: 703-568-5887 <(703)%20568-5887>
>
> *FitterWeb Consulting*
> *Are you lean and agile enough? *
>

Re: Ensemble fails when one node looses connectivity

Reply via email to