Not sure what happened, but the issue went away once revived the broker id on a new host....
But it does seem host D's ISR leadership could not be cleared until another member of the ISR came back.....somehow D was stale and remained stuck (and clients therefore kept trying to connect to it)... Jason On Mon, Nov 17, 2014 at 2:06 PM, Jason Rosenberg <j...@squareup.com> wrote: > We have had 2 nodes in a 4 node cluster die this weekend, sadly. > Fortunately there was no critical data on these machines yet. > > The cluster is running 0.8.1.1, and using replication factor of 2 for 2 > topics, each with 20 partitions. > > For sake of discussion, assume that nodes A and B are still up, and C and > D are now down. > > As expected, partitions that had one replica on a good host (A or B) and > one on a bad node (C or D), had their ISR shrink to just 1 node (A or B). > > Roughly 1/6 of the partitions had their 2 replicas on the 2 bad nodes, C > and D. For these, I was expecting the ISR to show up as empty, and the > partition unavailable. > > However, that's not what I'm seeing. When running TopicCommand > --describe, I see that the ISR still shows 1 replica, on node D (D was the > second node to go down). > > And, producers are still periodically trying to produce to node D (but > failing and retrying to one of the good nodes). > > So, it seems the cluster's meta data is still thinking that node D is up > and serving the partitions that were only replicated on C and D. However, > for partitions that were on A and D, or B and D, D is not shown as being in > the ISR. > > Is this correct? Should the cluster continue showing the last node to > have been alive for a partition as still in the ISR? > > Jason > > >