[
https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lukasz Osipiuk updated ZOOKEEPER-713:
-------------------------------------
Attachment: node3-version-2.tgz-ab
> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
> Key: ZOOKEEPER-713
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
> Project: Zookeeper
> Issue Type: Bug
> Affects Versions: 3.2.2
> Environment: debian lenny; ia64; xen virtualization
> Reporter: Lukasz Osipiuk
> Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab,
> node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab,
> node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa,
> node3-version-2.tgz-ab, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am
> attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in
> zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader
> election phase.
> Calling stat command on any machine in the cluster resulted in
> 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot' message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all
> version-2 directories on all nodes and then cluster started up without
> problems.
> Is it possible that snapshot/log data got corrupted in a way which made
> cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for
> locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened
> at 15:03.
> At some point later we experimented with deleting version-2 directory so I
> would not look at following restart because they can be misleading due to our
> actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place.
> As I know look into logs i see read timeout during initialization phase after
> 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any
> downsides of increasing tickTime.
> Best regards, Ćukasz Osipiuk
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.