I could go either way. I think that if ZK is up, then Kafka's going to go crazy trying to figure out who's the master of what, but maybe I'm not thinking the problem through clearly.
That does beg the issue: it seems like it'd be good to have something written down somewhere to say how one should do a whole-cluster shutdown if one has to, and how one should recover from a surprise whole-cluster shutdown (e.g., someone hits the emergency-power-off button) should one happen. It seems like in any long-lived Kafka cluster, that's going to happen *eventually*; even if the answer is "you're doomed, start over" at least that's documented and it can be worked into the plan. If one has an emergency power-off, or needs to shut it all down and bring it all back up, what *should* be the order of operations, then? In my particular situation, I think that what would have happened is that once we brought the rebuilt hosts back into the cluster, they'd have recreated the relevant partitions -- with no data, of course -- and negotiated who's the leader and other than the data loss, we might be OK. I'm not *as* clear though on what happens to the offsets for that partition in that scenario. Would Kafka fish up the last offset for those partitions from ZK and start there? Or would the offset for those partitions be reset to zero? If it's reset to 0, I could see much client wackiness, as clients say things like "message 1000? pah! I don't need that, my offset is 635213513516!", leading to desperation moves like changing group IDs or poking Zookeeper in the eye from zookeeper-client. What's supposed to happen there? Finally, one thing that *is* clear is that messing around with the topics while things are in this sort of deranged state leads to tears. We tried to do some things like delete _schemas before the broken hosts were repaved and brought back online, and at that point nothing I could do seemed to restore _schemas to functioning. The deletion didn't seem to happen. The partition data in ZK ended up being completely missing. None of the brokers seemed to want to forget about the metadata for that topic because they'd all decided they weren't the leader. Attempts at getting them to redo leader election didn't seem to make a difference. Restarting the brokers (doing a rolling restart, with 10 minutes in between in case things needed to do replication -- which they shouldn't, we'd cut off the inbound data feeds!) just ended up with the fun of https://issues.apache.org/jira/browse/KAFKA-1554. Stopping all the brokers at once, deleting the /admin/delete_topics/_schemas and /brokers/topics/_schemas keys from ZK, deleting any 10485760-byte-sized index files just in case, deleting the directories for _schemas-0 everywhere, and starting everything again, seems to have resulted in a completely unstable, unusable, cluster, with the same error from KAFKA-1554, but with index files that aren't the usual 10485760-byte junk size. I figure we'll pave it and start over but I think it'd be useful (not just to me) to have a better idea of the failure states here and how to recover from them. -Steve On Fri, Aug 07, 2015 at 08:36:28PM +0000, Daniel Compton wrote: > I would have thought you'd want ZK up before Kafka started, but I don't > have any strong data to back that up. > On Sat, 8 Aug 2015 at 7:59 AM Steve Miller <st...@idrathernotsay.com> wrote: > > > So... we had an extensive recabling exercise, during which we had to > > shut down and derack and rerack a whole Kafka cluster. Then when we > > brought it back up, we discovered the hard way that two hosts had their > > "rebuild on reboot" flag set in Cobbler. > > > > Everything on those hosts is gone as a result, of course. And a total > > of four partitions had their primary and their replica on the two hosts > > that were nuked. > > > > This isn't the end of the world, in some sense: it's annoying, but > > that's why we did this now before we brought the cluster into "real" > > production rather than being in a pre-production state. The data is all > > transient anyway (well, except for _schemas, of course, which in accordance > > to Murphy's law was one of the topics affected, but we have that mirrored > > elsewhere). > > > > Still, if there's an obvious way to recover from this, I couldn't find > > it googling around for a while. > > > > What's the recommended approach here? Do we need to delete these > > topics and start over? Do we need to delete *everything* and start over? > > > > (Also, other than "don't do that!" what's the recommended way to deal > > with the situation where you need to take a whole cluster down all at > > once? Any order of operations related to how you shut down all the Kafka > > nodes, especially WRT how you shut down Zookeeper? We deliberately brought > > Kafka up first *without* ZK, then brought up ZK, so that the brokers > > wouldn't go nuts with leader election and the like, which seemed to make > > sense, FWIW.) > > > > -Steve > > > -- > -- > Daniel