[ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847273#action_12847273 ]
Lukasz Osipiuk commented on ZOOKEEPER-713: ------------------------------------------ We had timeout of 5 secs when this logs were written. I already increased it to 15 secs. Will if that is enough. > zookeeper fails to start - broken snapshot? > ------------------------------------------- > > Key: ZOOKEEPER-713 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713 > Project: Zookeeper > Issue Type: Bug > Affects Versions: 3.2.2 > Environment: debian lenny; ia64; xen virtualization > Reporter: Lukasz Osipiuk > Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, > node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, > node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, > node3-version-2.tgz-ab, node3-version-2.tgz-ac, node3-zookeeper.log.gz, > zoo.cfg > > > Hi guys, > The following is not a bug report but rather a question - but as I am > attaching large files I am posting it here rather than on mailinglist. > Today we had major failure in our production environment. Machines in > zookeeper cluster gone wild and all clients got disconnected. > We tried to restart whole zookeeper cluster but cluster got stuck in leader > election phase. > Calling stat command on any machine in the cluster resulted in > 'ZooKeeperServer not running' message > In one of logs I noticed 'Invalid snapshot' message which disturbed me a bit. > We did not manage to make cluster work again with data. We deleted all > version-2 directories on all nodes and then cluster started up without > problems. > Is it possible that snapshot/log data got corrupted in a way which made > cluster unable to start? > Fortunately we could rebuild data we store in zookeeper as we use it only for > locks and most of nodes is ephemeral. > I am attaching contents of version-2 directory from all nodes and server logs. > Source problem occurred some time before 15. First cluster restart happened > at 15:03. > At some point later we experimented with deleting version-2 directory so I > would not look at following restart because they can be misleading due to our > actions. > I am also attaching zoo.cfg. Maybe something is wrong at this place. > As I know look into logs i see read timeout during initialization phase after > 20secs (initLimit=10, tickTime=2000). > Maybe all I have to do is increase one or other. which one? Are there any > downsides of increasing tickTime. > Best regards, Ćukasz Osipiuk > PS. due to attachment size limit I used split. to untar use > cat nodeX-version-2.tgz-* |tar -xz -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.