Not sure what happened, but the issue went away once revived the broker id
on a new host....

But it does seem host D's ISR leadership could not be cleared until another
member of the ISR came back.....somehow D was stale and remained stuck (and
clients therefore kept trying to connect to it)...

Jason

On Mon, Nov 17, 2014 at 2:06 PM, Jason Rosenberg <j...@squareup.com> wrote:

> We have had 2 nodes in a 4 node cluster die this weekend, sadly.
> Fortunately there was no critical data on these machines yet.
>
> The cluster is running 0.8.1.1, and using replication factor of 2 for 2
> topics, each with 20 partitions.
>
> For sake of discussion, assume that nodes A and B are still up, and C and
> D are now down.
>
> As expected, partitions that had one replica on a good host (A or B) and
> one on a bad node (C or D), had their ISR shrink to just 1 node (A or B).
>
> Roughly 1/6 of the partitions had their 2 replicas on the 2 bad nodes, C
> and D.  For these, I was expecting the ISR to show up as empty, and the
> partition unavailable.
>
> However, that's not what I'm seeing.  When running TopicCommand
> --describe, I see that the ISR still shows 1 replica, on node D (D was the
> second node to go down).
>
> And, producers are still periodically trying to produce to node D (but
> failing and retrying to one of the good nodes).
>
> So, it seems the cluster's meta data is still thinking that node D is up
> and serving the partitions that were only replicated on C and D.   However,
> for partitions that were on A and D, or B and D, D is not shown as being in
> the ISR.
>
> Is this correct?  Should the cluster continue showing the last node to
> have been alive for a partition as still in the ISR?
>
> Jason
>
>
>

Reply via email to