[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Lukasz Osipiuk (JIRA) Thu, 18 Mar 2010 09:08:52 -0700

     [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lukasz Osipiuk updated ZOOKEEPER-713:
-------------------------------------

    Attachment: node3-version-2.tgz-ab

> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, 
> node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, 
> node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, 
> node3-version-2.tgz-ab, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am 
> attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in 
> zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader 
> election phase.
> Calling stat command on any machine in the cluster resulted in 
> 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all 
> version-2 directories on all nodes and then cluster started up without 
> problems.
> Is it possible that snapshot/log data got corrupted in a way which made 
> cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for 
> locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened 
> at 15:03.
> At some point later we experimented with deleting version-2 directory so I 
> would not look at following restart because they can be misleading due to our 
> actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 
> 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any 
> downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Reply via email to