I ran into a very similar scenario. Disk filling up --> zookeeper crash -->
topologies disappear. The lesson learnt was to isolate zookeeper from other
processes which should have been done from the start.
Here is the code which cleans up topologies -
https://github.com/apache/storm/blob/v0.10.0/storm-core/src/clj/backtype/storm/daemon/nimbus.clj#L810



On Thu, Apr 14, 2016 at 10:42 PM, John Bush <[email protected]> wrote:

> So we had a zookeeper outage the other day, that somehow ended up causing
> Storm to delete all its topologies.  I'm looking to see if this is
> something anyone else has experienced, and whether or not a Storm upgrade
> might address some of my concerns.
>
> Here is what I've figured out so far:
>
> Storm 0.10 version - two worker nodes, run runs nimbus
> Kafka 0.8.2.1 - 3 nodes
> Zookeeper 3.4.5 - 3 nodes
>
> Zookeeper and kafka clusters crashed, Storm jobs went into a whirl wind
> failing, leaving turds in /tmp filling up disk.
> Woke up in morning all topology jars missing, nowhere to be found.
> Look at storm data in zookeeper, looks like everything is missing there.
> Try to republish a job, nimbus picks it up starts it then decides job
> shouldn't be here and kills it
> Cleanout zookeeper data - no change
> Cleanout localstate data - no change
> shutdown storm node2, clean out localstate on node1 and zookeeper data
> restart storm node1
> success!
>
> So I think the localstate also got corrupted.  I'm not sure who exactly
> got corrupted first, but it appears Storm started trusting the wrong source
> for truth and decided all the jobs shouldn't be there.
>
> So anyone else ever run into this, thoughts ?
>
>


-- 
Regards,
Abhishek Agarwal

Reply via email to