Thanks for the response; we found the issue. We run in AWS, and inexplicably,
the new instance we launched to replace the dead one had exceedingly low
network bandwidth to exactly one of the remaining brokers, resulting in
timeouts. After rolling the dice again things are replicating normally.
One possibility is that broker 0 has exhausted it's available file
descriptors. If this is the case it will be able to maintain it's existing
connections, giving off the appearance that it is operating normally while
refusing new ones.
I don't recall the exact exception message but something along
We had a hardware failure on broker 1 of a 3 broker cluster over the weekend.
The broker was replaced, and when the replacement broker came up it started to
replicate partitions from the other 2 brokers as you'd expect. But while broker
1 (the replacement) was able to fetch properly from broker