So we had a zookeeper outage the other day, that somehow ended up causing Storm to delete all its topologies. I'm looking to see if this is something anyone else has experienced, and whether or not a Storm upgrade might address some of my concerns.
Here is what I've figured out so far: Storm 0.10 version - two worker nodes, run runs nimbus Kafka 0.8.2.1 - 3 nodes Zookeeper 3.4.5 - 3 nodes Zookeeper and kafka clusters crashed, Storm jobs went into a whirl wind failing, leaving turds in /tmp filling up disk. Woke up in morning all topology jars missing, nowhere to be found. Look at storm data in zookeeper, looks like everything is missing there. Try to republish a job, nimbus picks it up starts it then decides job shouldn't be here and kills it Cleanout zookeeper data - no change Cleanout localstate data - no change shutdown storm node2, clean out localstate on node1 and zookeeper data restart storm node1 success! So I think the localstate also got corrupted. I'm not sure who exactly got corrupted first, but it appears Storm started trusting the wrong source for truth and decided all the jobs shouldn't be there. So anyone else ever run into this, thoughts ?
