[jira] [Comment Edited] (KAFKA-7845) Kafka clients do not re-resolve ips when a broker is replaced.

2019-02-09 Thread Ismael Juma (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764311#comment-16764311
 ] 

Ismael Juma edited comment on KAFKA-7845 at 2/10/19 6:07 AM:
-

Thanks for the report. Is this a duplicate of KAFKA-7890 or KAFKA-7755?


was (Author: ijuma):
Thanks for the report. Is this a duplicate of 
[KAFKA-7890|https://github.com/apache/kafka/commit/8576667957967cc5d815cc37a2bfb62b6e320069]
 or KAFKA-7755?

> Kafka clients do not re-resolve ips when a broker is replaced.
> --
>
> Key: KAFKA-7845
> URL: https://issues.apache.org/jira/browse/KAFKA-7845
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.1.0
>Reporter: Jennifer Thompson
>Priority: Major
>
> When one of our Kafka brokers dies and a new one replaces it (via an aws 
> ASG), the clients that publish to Kafka still try to publish to the old 
> brokers.
> We see errors like 
> {code:java}
> 2019-01-18 20:16:16 WARN NetworkClient:721 - [Producer clientId=producer-1] 
> Connection to node 2 (/10.130.98.111:9092) could not be established. Broker 
> may not be available.
> 2019-01-18 20:19:09 WARN Sender:596 - [Producer clientId=producer-1] Got 
> error produce response with correlation id 3414 on topic-partition aa.pga-2, 
> retrying (4 attempts left). Error: NOT_LEADER_FOR_PARTITION
> 2019-01-18 20:19:09 WARN Sender:641 - [Producer clientId=producer-1] Received 
> invalid metadata error in produce request on partition aa.pga-2 due to 
> org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
> not the leader for that topic-partition.. Going to request metadata update now
> 2019-01-18 20:21:19 WARN NetworkClient:721 - [Producer clientId=producer-1] 
> Connection to node 2 (/10.130.98.111:9092) could not be established. Broker 
> may not be available.
> 2019-01-18 20:21:50 ERROR ProducerBatch:233 - Error executing user-provided 
> callback on message for topic-partition 'aa.test-liz-0'{code}
> and
> {code:java}
> [2019-01-18 20:28:47,732] ERROR WorkerSourceTask{id=rabbit-vpc-2-kafka-1} 
> Failed to flush, timed out while waiting for producer to flush outstanding 27 
> messages (org.apache.kafka.connect.runtime.WorkerSourceTask)
> [2019-01-18 20:28:47,732] ERROR WorkerSourceTask{id=rabbit-vpc-2-kafka-1} 
> Failed to commit offsets 
> (org.apache.kafka.connect.runtime.SourceTaskOffsetCommitter)
> {code}
> The ip address referenced is for the broker that died. We have Kafka Manager 
> running as well, and that picks up the new broker.
> We already set
> {code:java}
> networkaddress.cache.ttl=60{code}
> in
> {code:java}
> jre/lib/security/java.security{code}
> Our java version is "Java(TM) SE Runtime Environment (build 1.8.0_192-b12)"
> This started happening after we upgraded to 2.1. When had Kafka 1.1, brokers 
> could failover without a problem.
> One thing that might be considered unusual about our deployment is that we 
> reuse the same broker id and EBS volume for the new broker, so that 
> partitions do not have to be reassigned.
> In kafka-connect, the logs look like
> {code}
> [2019-01-28 22:11:02,364] WARN [Consumer clientId=consumer-1, 
> groupId=connect-cluster] Connection to node 3 (/10.130.153.120:9092) could 
> not be established. Broker may not be available. 
> (org.apache.kafka.clients.NetworkClient)
> [2019-01-28 22:11:02,365] INFO [Consumer clientId=consumer-1, 
> groupId=connect-cluster] Error sending fetch request (sessionId=201133590, 
> epoch=INITIAL) to node 3: org.apache.kafka.common.errors.DisconnectException. 
> (org.apache.kafka.clients.FetchSessionHandler)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-7845) Kafka clients do not re-resolve ips when a broker is replaced.

2019-02-11 Thread Jennifer Thompson (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765577#comment-16765577
 ] 

Jennifer Thompson edited comment on KAFKA-7845 at 2/12/19 1:07 AM:
---

Quite probably. I tried using the 2.1.1-rc1 snapshot jars in the clients, but 
that didn't fix the problem. I also tried different configs like 
"client.dns.lookup" but didn't get anywhere.

I suspect that they are still getting stale data from the other kafka brokers. 
I can't use 2.1.1 for our brokers because we are using the confluent 
distribution, which seems to require extra classes in the kafka-clients jar. I 
will try again when confluent 5.1.1 is available.


was (Author: jentho):
Quite probably. I tried using the 2.1.1-rc1 snapshot jars in the clients, but 
that didn't fix the problem. I also tried different configs like 
"client.dns.lookup" but didn't get anywhere.

I suspect that they are still getting stale data from the other kafka brokers. 
I can't our brokers with 2.1.1 because we are using the confluent distribution, 
which seems to require extra classes in the kafka-clients jar. I will try again 
when confluent 5.1.1 is available.

> Kafka clients do not re-resolve ips when a broker is replaced.
> --
>
> Key: KAFKA-7845
> URL: https://issues.apache.org/jira/browse/KAFKA-7845
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.1.0
>Reporter: Jennifer Thompson
>Priority: Major
>
> When one of our Kafka brokers dies and a new one replaces it (via an aws 
> ASG), the clients that publish to Kafka still try to publish to the old 
> brokers.
> We see errors like 
> {code:java}
> 2019-01-18 20:16:16 WARN NetworkClient:721 - [Producer clientId=producer-1] 
> Connection to node 2 (/10.130.98.111:9092) could not be established. Broker 
> may not be available.
> 2019-01-18 20:19:09 WARN Sender:596 - [Producer clientId=producer-1] Got 
> error produce response with correlation id 3414 on topic-partition aa.pga-2, 
> retrying (4 attempts left). Error: NOT_LEADER_FOR_PARTITION
> 2019-01-18 20:19:09 WARN Sender:641 - [Producer clientId=producer-1] Received 
> invalid metadata error in produce request on partition aa.pga-2 due to 
> org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
> not the leader for that topic-partition.. Going to request metadata update now
> 2019-01-18 20:21:19 WARN NetworkClient:721 - [Producer clientId=producer-1] 
> Connection to node 2 (/10.130.98.111:9092) could not be established. Broker 
> may not be available.
> 2019-01-18 20:21:50 ERROR ProducerBatch:233 - Error executing user-provided 
> callback on message for topic-partition 'aa.test-liz-0'{code}
> and
> {code:java}
> [2019-01-18 20:28:47,732] ERROR WorkerSourceTask{id=rabbit-vpc-2-kafka-1} 
> Failed to flush, timed out while waiting for producer to flush outstanding 27 
> messages (org.apache.kafka.connect.runtime.WorkerSourceTask)
> [2019-01-18 20:28:47,732] ERROR WorkerSourceTask{id=rabbit-vpc-2-kafka-1} 
> Failed to commit offsets 
> (org.apache.kafka.connect.runtime.SourceTaskOffsetCommitter)
> {code}
> The ip address referenced is for the broker that died. We have Kafka Manager 
> running as well, and that picks up the new broker.
> We already set
> {code:java}
> networkaddress.cache.ttl=60{code}
> in
> {code:java}
> jre/lib/security/java.security{code}
> Our java version is "Java(TM) SE Runtime Environment (build 1.8.0_192-b12)"
> This started happening after we upgraded to 2.1. When had Kafka 1.1, brokers 
> could failover without a problem.
> One thing that might be considered unusual about our deployment is that we 
> reuse the same broker id and EBS volume for the new broker, so that 
> partitions do not have to be reassigned.
> In kafka-connect, the logs look like
> {code}
> [2019-01-28 22:11:02,364] WARN [Consumer clientId=consumer-1, 
> groupId=connect-cluster] Connection to node 3 (/10.130.153.120:9092) could 
> not be established. Broker may not be available. 
> (org.apache.kafka.clients.NetworkClient)
> [2019-01-28 22:11:02,365] INFO [Consumer clientId=consumer-1, 
> groupId=connect-cluster] Error sending fetch request (sessionId=201133590, 
> epoch=INITIAL) to node 3: org.apache.kafka.common.errors.DisconnectException. 
> (org.apache.kafka.clients.FetchSessionHandler)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)