[ https://issues.apache.org/jira/browse/KAFKA-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764311#comment-16764311 ]
Ismael Juma edited comment on KAFKA-7845 at 2/10/19 6:07 AM: ------------------------------------------------------------- Thanks for the report. Is this a duplicate of KAFKA-7890 or KAFKA-7755? was (Author: ijuma): Thanks for the report. Is this a duplicate of [KAFKA-7890|https://github.com/apache/kafka/commit/8576667957967cc5d815cc37a2bfb62b6e320069] or KAFKA-7755? > Kafka clients do not re-resolve ips when a broker is replaced. > -------------------------------------------------------------- > > Key: KAFKA-7845 > URL: https://issues.apache.org/jira/browse/KAFKA-7845 > Project: Kafka > Issue Type: Bug > Components: clients > Affects Versions: 2.1.0 > Reporter: Jennifer Thompson > Priority: Major > > When one of our Kafka brokers dies and a new one replaces it (via an aws > ASG), the clients that publish to Kafka still try to publish to the old > brokers. > We see errors like > {code:java} > 2019-01-18 20:16:16 WARN NetworkClient:721 - [Producer clientId=producer-1] > Connection to node 2 (/10.130.98.111:9092) could not be established. Broker > may not be available. > 2019-01-18 20:19:09 WARN Sender:596 - [Producer clientId=producer-1] Got > error produce response with correlation id 3414 on topic-partition aa.pga-2, > retrying (4 attempts left). Error: NOT_LEADER_FOR_PARTITION > 2019-01-18 20:19:09 WARN Sender:641 - [Producer clientId=producer-1] Received > invalid metadata error in produce request on partition aa.pga-2 due to > org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is > not the leader for that topic-partition.. Going to request metadata update now > 2019-01-18 20:21:19 WARN NetworkClient:721 - [Producer clientId=producer-1] > Connection to node 2 (/10.130.98.111:9092) could not be established. Broker > may not be available. > 2019-01-18 20:21:50 ERROR ProducerBatch:233 - Error executing user-provided > callback on message for topic-partition 'aa.test-liz-0'{code} > and > {code:java} > [2019-01-18 20:28:47,732] ERROR WorkerSourceTask{id=rabbit-vpc-2-kafka-1} > Failed to flush, timed out while waiting for producer to flush outstanding 27 > messages (org.apache.kafka.connect.runtime.WorkerSourceTask) > [2019-01-18 20:28:47,732] ERROR WorkerSourceTask{id=rabbit-vpc-2-kafka-1} > Failed to commit offsets > (org.apache.kafka.connect.runtime.SourceTaskOffsetCommitter) > {code} > The ip address referenced is for the broker that died. We have Kafka Manager > running as well, and that picks up the new broker. > We already set > {code:java} > networkaddress.cache.ttl=60{code} > in > {code:java} > jre/lib/security/java.security{code} > Our java version is "Java(TM) SE Runtime Environment (build 1.8.0_192-b12)" > This started happening after we upgraded to 2.1. When had Kafka 1.1, brokers > could failover without a problem. > One thing that might be considered unusual about our deployment is that we > reuse the same broker id and EBS volume for the new broker, so that > partitions do not have to be reassigned. > In kafka-connect, the logs look like > {code} > [2019-01-28 22:11:02,364] WARN [Consumer clientId=consumer-1, > groupId=connect-cluster] Connection to node 3 (/10.130.153.120:9092) could > not be established. Broker may not be available. > (org.apache.kafka.clients.NetworkClient) > [2019-01-28 22:11:02,365] INFO [Consumer clientId=consumer-1, > groupId=connect-cluster] Error sending fetch request (sessionId=201133590, > epoch=INITIAL) to node 3: org.apache.kafka.common.errors.DisconnectException. > (org.apache.kafka.clients.FetchSessionHandler) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)