Is that node the only bootstrap broker provided? If the Connect worker was pinned to *only* that broker, it wouldn't have any chance of recovering correct cluster information from the healthy brokers.
It sounds like there was a separate problem as well (the broker should have figured out it was in a bad state wrt ZK), but normally we would expect Connect to detect the issue, mark the coordinator dead (as it did) and then refresh metadata to figure out which broker it should be talking to now. There are some known edge cases around how initial cluster discovery works which *might* be able to get you stuck in this situation. -Ewen On Tue, Mar 21, 2017 at 10:43 PM, Anish Mashankar <an...@systeminsights.com> wrote: > Hello everyone, > We are running a 5 broker Kafka v0.10.0.0 cluster on AWS. Also, the connect > api is in v0.10.0.0. > It was observed that the distributed kafka connector went into infinite > loop of log message of > > (Re-)joining group connect-connect-elasticsearch-indexer. > > And after a little more digging. There was another infinite loop of set of > log messages > > *Discovered coordinator 1.2.3.4:9092 (id: ____ rack: null) for group x* > > *Marking the coordinator 1.2.3.4:9092 <http://1.2.3.4:9092> (id: ____ > rack: > null) dead for group x* > > Restarting Kafka connect did not help. > > Looking at zookeeper, we realized that broker 1.2.3.4 had left the > zookeeper cluster. It had happened due to a timeout when interacting with > zookeeper. The broker was also the controller. Failover of controller > happened, and the applications were fine, but few days later, we started > facing the above mentioned issue. To add to the surprise, the kafka daemon > was still running on the host but was not trying to contact zookeeper in > any time. Hence, zombie broker. > > Also, a connect cluster spread across multiple hosts was not working, > however a single connector worked. > > After replacing the EC2 instance for the broker 1.2.3.4, kafka connect > cluster started working fine. (same broker ID) > > I am not sure if this is a bug. Kafka connect shouldn't be trying the same > broker if it is not able establish connection. We use consul for service > discovery. As broker was running and 9092 port was active, consul reported > it was working and redirected dns queries to that broker when the connector > tried to connect. We have disabled DNS caching in the java options, and > Kafka connect should've tried to connect to some other host each time it > tried to query, but instead it only tried on 1.2.3.4 broker. > > Does kafka connect internally cache DNS? Is there a debugging step I am > missing here? > -- > Anish Samir Mashankar >