Rui Abreu created KAFKA-9531:
--------------------------------

             Summary: java.net.UnknownHostException loop on VM rolling update 
using CNAME
                 Key: KAFKA-9531
                 URL: https://issues.apache.org/jira/browse/KAFKA-9531
             Project: Kafka
          Issue Type: Bug
          Components: clients, controller, producer 
    Affects Versions: 2.4.0
            Reporter: Rui Abreu


Hello,

 

My cluster setup in based on VMs behind DNS CNAME .

Example:  node.internal is a CNAME to either nodeA.internal or nodeB.internal

Since kafka-client 1.2.1,  it has been observed that sometimes Kafka clients 
get stuck on a loop with the exception:
Example after nodeB.internal is replaced with nodeA.internal 

 
{code:java}
2020-02-10T12:11:28.181Z o.a.k.c.NetworkClient [WARN]    - [Consumer 
clientId=consumer-6, groupId=consumer.group] Error connecting to node 
nodeB.internal:9092 (id: 2 rack: null)
java.net.UnknownHostException: nodeB.internal:9092
        at java.net.InetAddress.getAllByName0(InetAddress.java:1281) 
~[?:1.8.0_222]
        at java.net.InetAddress.getAllByName(InetAddress.java:1193) 
~[?:1.8.0_222]
        at java.net.InetAddress.getAllByName(InetAddress.java:1127) 
~[?:1.8.0_222]
        at org.apache.kafka.clients.ClientUtils.resolve(ClientUtils.java:104) 
~[stormjar.jar:?]
        at 
org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.currentAddress(ClusterConnectionStates.java:403)
 ~[stormjar.jar:?]
        at 
org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.access$200(ClusterConnectionStates.java:363)
 ~[stormjar.jar:?]
        at 
org.apache.kafka.clients.ClusterConnectionStates.currentAddress(ClusterConnectionStates.java:151)
 ~[stormjar.jar:?]
        at 
org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:943) 
~[stormjar.jar:?]
        at 
org.apache.kafka.clients.NetworkClient.access$600(NetworkClient.java:68) 
~[stormjar.jar:?]
        at 
org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1114)
 ~[stormjar.jar:?]
        at 
org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1005)
 ~[stormjar.jar:?]
        at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:537) 
~[stormjar.jar:?]
        at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262)
 ~[stormjar.jar:?]
        at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233)
 ~[stormjar.jar:?]
        at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224)
 ~[stormjar.jar:?]
        at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitMetadataUpdate(ConsumerNetworkClient.java:161)
 ~[stormjar.jar:?]
        at 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:366)
 ~[stormjar.jar:?]
        at 
org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1251)
 ~[stormjar.jar:?]
        at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1220) 
~[stormjar.jar:?]
        at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1159) 
~[stormjar.jar:?]
        at 
org.apache.storm.kafka.spout.KafkaSpout.pollKafkaBroker(KafkaSpout.java:365) 
~[stormjar.jar:?]
        at 
org.apache.storm.kafka.spout.KafkaSpout.nextTuple(KafkaSpout.java:294) 
~[stormjar.jar:?]
        at 
org.apache.storm.daemon.executor$fn__10715$fn__10730$fn__10761.invoke(executor.clj:649)
 ~[storm-core-1.1.3.jar:1.1.3]
        at org.apache.storm.util$async_loop$fn__553.invoke(util.clj:484) 
~[storm-core-1.1.3.jar:1.1.3]
        at clojure.lang.AFn.run(AFn.java:22) ~[clojure-1.7.0.jar:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]
{code}
 

The time it spends in the loop is arbitrary, but it seems the client 
effectively stops while this is happening.

This error contrasts with instances where the client is able to recover on its 
own after a few seconds:


{code:java}
2020-02-08T01:15:37.390Z o.a.k.c.c.i.AbstractCoordinator [INFO] - [Consumer 
clientId=consumer-7, groupId=consumer-group] Group coordinator 
nodeA.internal:9092 (id: 2147483645 rack: null) is unavailable or invalid, will 
attempt rediscovery
 
2020-02-08T01:15:37.885Z o.a.k.c.c.i.AbstractCoordinator [INFO] - [Consumer 
clientId=consumer-7, groupId=consumer-group] Discovered group coordinator 
nodeB.internal:9092 (id: 2147483646 rack: null)

2020-02-08T01:15:37.885Z o.a.k.c.ClusterConnectionStates [INFO] - [Consumer 
clientId=consumer-7, groupId=consumer-group] Hostname for node 2147483646 
changed from nodeA.internal to nodeB.internal
{code}


   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to