[ https://issues.apache.org/jira/browse/KAFKA-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Abhi updated KAFKA-8008: ------------------------ Description: Hi, I upgrade to Kafka v2.1.1 recently and seeing the below exceptions in all the servers. The kafka-network-thread-1-ListenerName are all consuming full cpu cycles. Lots of TCP connections are in CLOSE_WAIT state. My broker setup is using kerberos authentication with -Dsun.security.jgss.native=true. I am not sure how to handle this? Will increasing the kafka-network thread count help if it is possible? Does this seem like a bug? I am happy to help in anyway I can as this issue blocking our production usage and would like to get it resolved as early as possible. *server.log snippet from one of the servers: *[2019-02-27 00:00:02,948] DEBUG [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=3] Built full fetch (sessionId=1488865423, epoch=INITIAL) for node 2 with 3 partition(s). (org.apache.kafka.clients.FetchSessionHandler) [2019-02-27 00:00:02,949] DEBUG [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=3] Initiating connection to node mwkafka-prod-02.nyc.foo.com:9092 (id: 2 rack: null) using address mwkafka-prod-02.nyc.foo.com/10.219.247.26 (org.apache.kafka.clients.NetworkClient) [2019-02-27 00:00:02,949] DEBUG Set SASL client state to SEND_APIVERSIONS_REQUEST (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 00:00:02,949] DEBUG Creating SaslClient: client=kafka/mwkafka-prod-01.nyc.foo....@unix.foo.com;service=kafka;serviceHostname=mwkafka-prod-02.nyc.foo.com;mechs=[GSSAPI] (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 00:00:02,949] DEBUG [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=3] Created socket with SO_RCVBUF = 65536, SO_SNDBUF = 166400, SO_TIMEOUT = 0 to node 2 (org.apache.kafka.common.network.Selector) [2019-02-27 00:00:02,949] DEBUG Set SASL client state to RECEIVE_APIVERSIONS_RESPONSE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 00:00:02,949] DEBUG [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=3] Completed connection to node 2. Ready. (org.apache.kafka.clients.NetworkClient) [2019-02-27 00:00:03,007] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=0] Built full fetch (sessionId=2039987243, epoch=INITIAL) for node 5 with 0 partition(s). (org.apache.kafka.clients.FetchSessionHandler) [2019-02-27 00:00:03,317] INFO [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=1] Error sending fetch request (sessionId=397037945, epoch=INITIAL) to node 5: java.net.SocketTimeoutException: Failed to connect within 30000 ms. (org.apache.kafka.clients.FetchSessionHandler) [2019-02-27 00:00:03,317] WARN [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=1] Error in response for fetch request (type=FetchRequest, replicaId=1, maxWait=10000, minBytes=1, maxBytes=10485760, fetchData={reddyvel-159-0=(fetchOffset=3173198, logStartOffset=3173198, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-331-0=(fetchOffset=3173197, logStartOffset=3173197, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-newtp-5-64-0=(fetchOffset=8936, logStartOffset=8936, maxBytes=1048576, currentLeaderEpoch=Optional[18]), reddyvel-tp9-78-0=(fetchOffset=247943, logStartOffset=247943, maxBytes=1048576, currentLeaderEpoch=Optional[19]), reddyvel-tp3-58-0=(fetchOffset=264495, logStartOffset=264495, maxBytes=1048576, currentLeaderEpoch=Optional[19]), fps.trsy.fe_prvt-0=(fetchOffset=24, logStartOffset=8, maxBytes=1048576, currentLeaderEpoch=Optional[3]), reddyvel-7-0=(fetchOffset=3173199, logStartOffset=3173199, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-298-0=(fetchOffset=3173197, logStartOffset=3173197, maxBytes=1048576, currentLeaderEpoch=Optional[23]), fps.guas.peeq.fe_marb_us-0=(fetchOffset=2, logStartOffset=2, maxBytes=1048576, currentLeaderEpoch=Optional[6]), reddyvel-108-0=(fetchOffset=3173198, logStartOffset=3173198, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-988-0=(fetchOffset=3173185, logStartOffset=3173185, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-111-0=(fetchOffset=3173198, logStartOffset=3173198, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-409-0=(fetchOffset=3173194, logStartOffset=3173194, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-104-0=(fetchOffset=3173198, logStartOffset=3173198, maxBytes=1048576, currentLeaderEpoch=Optional[23]), fps.priveq.reins-0=(fetchOffset=12, logStartOffset=6, maxBytes=1048576, currentLeaderEpoch=Optional[5]), reddyvel-353-0=(fetchOffset=3173197, logStartOffset=3173197, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-tp10-63-0=(fetchOffset=220652, logStartOffset=220652, maxBytes=1048576, currentLeaderEpoch=Optional[19]), reddyvel-newtp-5-86-0=(fetchOffset=8935, logStartOffset=8935, maxBytes=1048576, currentLeaderEpoch=Optional[18]), reddyvel-878-0=(fetchOffset=3173187, logStartOffset=3173187, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-621-0=(fetchOffset=3173190, logStartOffset=3173190, maxBytes=1048576, currentLeaderEpoch=Optional[23]), fps.agg.uopt.opt-0=(fetchOffset=28297, logStartOffset=28297, maxBytes=1048576, currentLeaderEpoch=Optional[8]), reddyvel-661-0=(fetchOffset=3173190, logStartOffset=3173190, maxBytes=1048576, currentLeaderEpoch=Optional[23]), fps.guas.jpeq.fe_marb-0=(fetchOffset=532, logStartOffset=10, maxBytes=1048576, currentLeaderEpoch=Optional[3]), reddyvel-607-0=(fetchOffset=3173191, logStartOffset=3173191, maxBytes=1048576, currentLeaderEpoch=Optional[23]), fps.seed.ornt.desim_ornt-0=(fetchOffset=4060, logStartOffset=2433, maxBytes=1048576, currentLeaderEpoch=Optional[5]), reddyvel-962-0=(fetchOffset=3173185, logStartOffset=3173185, maxBytes=1048576, currentLeaderEpoch=Optional[23]), fps.agg.ornt.desim_ornt-0=(fetchOffset=1177, logStartOffset=1177, maxBytes=1048576, currentLeaderEpoch=Optional[8]), reddyvel-tp6-71-0=(fetchOffset=256309, logStartOffset=256309, maxBytes=1048576, currentLeaderEpoch=Optional[19]), fps.trsy.macro-0=(fetchOffset=324, logStartOffset=106, maxBytes=1048576, currentLeaderEpoch=Optional[3]), fps.agg.dist.treas-0=(fetchOffset=0, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[8]), reddyvel-newtp-8-111-0=(fetchOffset=1, logStartOffset=1, maxBytes=1048576, currentLeaderEpoch=Optional[18]), reddyvel-94-0=(fetchOffset=3173198, logStartOffset=3173198, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-544-0=(fetchOffset=3173192, logStartOffset=3173192, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-newtp-2-72-0=(fetchOffset=8679, logStartOffset=8679, maxBytes=1048576, currentLeaderEpoch=Optional[18]), fps.guas.useq.eq_stat_arb_useq-0=(fetchOffset=6071629, logStartOffset=4470145, maxBytes=1048576, currentLeaderEpoch=Optional[4]), reddyvel-newtp-10-75-0=(fetchOffset=9768, logStartOffset=9768, maxBytes=1048576, currentLeaderEpoch=Optional[18]), reddyvel-tp8-51-0=(fetchOffset=249873, logStartOffset=249873, maxBytes=1048576, currentLeaderEpoch=Optional[19]), reddyvel-newtp-8-122-0=(fetchOffset=1, logStartOffset=1, maxBytes=1048576, currentLeaderEpoch=Optional[18]), fps.seed.trsy.pe_china-0=(fetchOffset=24, logStartOffset=8, maxBytes=1048576, currentLeaderEpoch=Optional[3]), fps.seed.trsy.jcas-0=(fetchOffset=93, logStartOffset=31, maxBytes=1048576, currentLeaderEpoch=Optional[3]), reddyvel-tp8-99-0=(fetchOffset=249871, logStartOffset=249871, maxBytes=1048576, currentLeaderEpoch=Optional[19]), reddyvel-643-0=(fetchOffset=3173190, logStartOffset=3173190, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-577-0=(fetchOffset=3173191, logStartOffset=3173191, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-tp6-97-0=(fetchOffset=256307, logStartOffset=256307, maxBytes=1048576, currentLeaderEpoch=Optional[19]), reddyvel-newtp-6-72-0=(fetchOffset=7652, logStartOffset=7652, maxBytes=1048576, currentLeaderEpoch=Optional[18]), reddyvel-959-0=(fetchOffset=3173185, logStartOffset=3173185, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-newtp-4-70-0=(fetchOffset=8828, logStartOffset=8828, maxBytes=1048576, currentLeaderEpoch=Optional[18]), fps.seed.trsy.opt-0=(fetchOffset=114, logStartOffset=38, maxBytes=1048576, currentLeaderEpoch=Optional[4])}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=397037945, epoch=INITIAL)) (kafka.server.ReplicaFetcherThread) java.net.SocketTimeoutException: Failed to connect within 30000 ms at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:95) at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:192) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:274) at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:132) at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:131) at scala.Option.foreach(Option.scala:257) at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:131) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:113) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) [2019-02-27 00:00:04,007] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=0] Built full fetch (sessionId=2039987243, epoch=INITIAL) for node 5 with 45 partition(s). (org.apache.kafka.clients.FetchSessionHandler) [2019-02-27 00:00:04,007] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=0] Initiating connection to node mwkafka-prod-01.tbd.foo.com:9092 (id: 5 rack: null) using address mwkafka-prod-01.tbd.foo.com/10.236.30.30 (org.apache.kafka.clients.NetworkClient) [2019-02-27 00:00:04,008] DEBUG Set SASL client state to SEND_APIVERSIONS_REQUEST (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 00:00:04,008] DEBUG Creating SaslClient: client=kafka/mwkafka-prod-01.nyc.foo....@unix.foo.com;service=kafka;serviceHostname=mwkafka-prod-01.tbd.foo.com;mechs=[GSSAPI] (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 00:00:04,008] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=0] Created socket with SO_RCVBUF = 65536, SO_SNDBUF = 166400, SO_TIMEOUT = 0 to node 5 (org.apache.kafka.common.network.Selector) [2019-02-27 00:00:04,008] DEBUG Set SASL client state to RECEIVE_APIVERSIONS_RESPONSE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 00:00:04,008] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=0] Completed connection to node 5. Ready. (org.apache.kafka.clients.NetworkClient) [2019-02-27 00:00:04,318] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=1] Built full fetch (sessionId=397037945, epoch=INITIAL) for node 5 with 0 partition(s). (org.apache.kafka.clients.FetchSessionHandler) [2019-02-27 00:00:05,318] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=1] Built full fetch (sessionId=397037945, epoch=INITIAL) for node 5 with 48 partition(s). (org.apache.kafka.clients.FetchSessionHandler) [2019-02-27 00:00:05,318] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=1] Initiating connection to node mwkafka-prod-01.tbd.foo.com:9092 (id: 5 rack: null) using address mwkafka-prod-01.tbd.foo.com/10.236.30.30 (org.apache.kafka.clients.NetworkClient) [2019-02-27 00:00:05,318] DEBUG Set SASL client state to SEND_APIVERSIONS_REQUEST (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 00:00:05,318] DEBUG Creating SaslClient: client=kafka/mwkafka-prod-01.nyc.foo....@unix.foo.com;service=kafka;serviceHostname=mwkafka-prod-01.tbd.foo.com;mechs=[GSSAPI] (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 00:00:05,319] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=1] Created socket with SO_RCVBUF = 65536, SO_SNDBUF = 166400, SO_TIMEOUT = 0 to node 5 (org.apache.kafka.common.network.Selector) [2019-02-27 00:00:05,319] DEBUG Set SASL client state to RECEIVE_APIVERSIONS_RESPONSE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 00:00:05,319] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=1] Completed connection to node 5. Ready. (org.apache.kafka.clients.NetworkClient) [2019-02-27 00:00:05,477] DEBUG Set SASL server state to AUTHENTICATE (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator) [2019-02-27 00:00:05,477] DEBUG [SocketServer brokerId=1] Connection with /10.236.30.31 disconnected (org.apache.kafka.common.network.Selector) java.io.EOFException at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:96) at org.apache.kafka.common.security.authenticator.SaslServerAuthenticator.authenticate(SaslServerAuthenticator.java:237) at org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:132) at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:532) at org.apache.kafka.common.network.Selector.poll(Selector.java:467) at kafka.network.Processor.poll(SocketServer.scala:689) at kafka.network.Processor.run(SocketServer.scala:594) at java.base/java.lang.Thread.run(Thread.java:834) [2019-02-27 00:00:05,477] DEBUG Handling Kafka request SASL_HANDSHAKE (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator) [2019-02-27 00:00:05,477] DEBUG Using SASL mechanism 'GSSAPI' provided by client (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator) [2019-02-27 00:00:05,477] DEBUG Creating SaslServer for kafka/mwkafka-prod-01.nyc.foo....@unix.foo.com with mechanism GSSAPI (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator) [2019-02-27 00:00:06,056] INFO [ReplicaFetcher replicaId=1, leaderId=4, fetcherId=2] Error sending fetch request (sessionId=373847113, epoch=INITIAL) to node 4: java.net.SocketTimeoutException: Failed to connect within 30000 ms. (org.apache.kafka.clients.FetchSessionHandler) *Lsof output: *kafka...@mwkafka-prod-01.nyc[toa]:/local/kafka/logs> lsof -P -p 103485 | grep TCP | grep CLOSE java 103485 kafkagod 635u IPv4 86522305 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.nyc.foo.com:46014 (CLOSE_WAIT) java 103485 kafkagod 639u IPv4 86519040 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.nyc.foo.com:45926 (CLOSE_WAIT) java 103485 kafkagod 642u IPv4 86519057 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-01.dr.foo.com:47424 (CLOSE_WAIT) java 103485 kafkagod 643u IPv4 86519058 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-01.dr.foo.com:47428 (CLOSE_WAIT) java 103485 kafkagod 683u IPv4 86509505 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-01.tbd.foo.com:57856 (CLOSE_WAIT) java 103485 kafkagod 684u IPv4 86522910 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-01.tbd.foo.com:57894 (CLOSE_WAIT) java 103485 kafkagod 688u IPv4 86522176 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.nyc.foo.com:45966 (CLOSE_WAIT) java 103485 kafkagod 690u IPv4 86522306 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.tbd.foo.com:35326 (CLOSE_WAIT) java 103485 kafkagod 695u IPv4 86522192 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.nyc.foo.com:45968 (CLOSE_WAIT) java 103485 kafkagod 696u IPv4 86509516 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.dr.foo.com:39676 (CLOSE_WAIT) java 103485 kafkagod 697u IPv4 86522307 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.tbd.foo.com:35328 (CLOSE_WAIT) java 103485 kafkagod 705u IPv4 86519026 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-01.tbd.foo.com:57838 (CLOSE_WAIT) java 103485 kafkagod 726u IPv4 86509517 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.tbd.foo.com:35258 (CLOSE_WAIT) *contoller logs below show that no broker is able connect to each other: * [2019-02-27 03:39:16,135] WARN [RequestSendThread controllerId=1] Controller 1's connection to broker mwkafka-prod-02.dr.foo.com:9092 (id: 4 rack: dr.foo.com) was unsuccessful (kafka.controller.RequestSendThread) java.net.SocketTimeoutException: Failed to connect within 30000 ms at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:280) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) [2019-02-27 03:39:20,915] WARN [RequestSendThread controllerId=1] Controller 1's connection to broker mwkafka-prod-02.tbd.foo.com:9092 (id: 6 rack: tbd.foo.com) was unsuccessful (kafka.controller.RequestSendThread) java.net.SocketTimeoutException: Failed to connect within 30000 ms at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:280) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) [2019-02-27 03:39:25,839] WARN [RequestSendThread controllerId=1] Controller 1's connection to broker mwkafka-prod-01.dr.foo.com:9092 (id: 3 rack: dr.foo.com) was unsuccessful (kafka.controller.RequestSendThread) java.net.SocketTimeoutException: Failed to connect within 30000 ms at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:280) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) [2019-02-27 03:39:31,371] WARN [RequestSendThread controllerId=1] Controller 1's connection to broker mwkafka-prod-01.tbd.foo.com:9092 (id: 5 rack: tbd.foo.com) was unsuccessful (kafka.controller.RequestSendThread) java.net.SocketTimeoutException: Failed to connect within 30000 ms at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:280) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) [2019-02-27 03:39:40,440] WARN [RequestSendThread controllerId=1] Controller 1's connection to broker mwkafka-prod-01.nyc.foo.com:9092 (id: 1 rack: nyc.foo.com) was unsuccessful (kafka.controller.RequestSendThread) java.net.SocketTimeoutException: Failed to connect within 30000 ms at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:280) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) [2019-02-27 03:39:42,149] WARN [RequestSendThread controllerId=1] Controller 1's connection to broker mwkafka-prod-02.nyc.foo.com:9092 (id: 2 rack: nyc.foo.com) was unsuccessful (kafka.controller.RequestSendThread) java.net.SocketTimeoutException: Failed to connect within 30000 ms at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:280) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) *Client consumer logs: consumer is not able to connect to the brokers* [2019-02-27 03:37:37,587] DEBUG [Consumer clientId=test_con, groupId=chow] Created socket with SO_RCVBUF = 65536, SO_SNDBUF = 131072, SO_TIMEOUT = 0 to node -6 (org.apache.kafka.common.network.Selector) [2019-02-27 03:37:37,587] DEBUG Set SASL client state to RECEIVE_APIVERSIONS_RESPONSE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 03:37:37,587] DEBUG [Consumer clientId=test_con, groupId=chow] Completed connection to node -6. Fetching API versions. (org.apache.kafka.clients.NetworkClient) [2019-02-27 03:38:02,488] DEBUG [Consumer clientId=test_con, groupId=chow] Coordinator discovery failed, refreshing metadata (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:38:02,488] DEBUG [Consumer clientId=test_con, groupId=chow] Sending FindCoordinator request to broker mwkafka-prod-02.tbd.deshaw.com:9092 (id: -6 rack: null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:38:08,618] DEBUG Set SASL client state to INITIAL (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 03:38:08,619] DEBUG Set SASL client state to INTERMEDIATE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 03:38:29,288] DEBUG Set SASL client state to SEND_HANDSHAKE_REQUEST (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 03:38:29,288] DEBUG Set SASL client state to RECEIVE_HANDSHAKE_RESPONSE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 03:38:32,493] DEBUG [Consumer clientId=test_con, groupId=chow] Coordinator discovery failed, refreshing metadata (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:38:32,493] DEBUG [Consumer clientId=test_con, groupId=chow] Sending FindCoordinator request to broker mwkafka-prod-02.dr.deshaw.com:9092 (id: -4 rack: null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:39:02,498] DEBUG [Consumer clientId=test_con, groupId=chow] Coordinator discovery failed, refreshing metadata (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:39:02,498] DEBUG [Consumer clientId=test_con, groupId=chow] Sending FindCoordinator request to broker mwkafka-prod-02.dr.deshaw.com:9092 (id: -3 rack: null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:39:32,501] DEBUG [Consumer clientId=test_con, groupId=chow] Coordinator discovery failed, refreshing metadata (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:39:32,501] DEBUG [Consumer clientId=test_con, groupId=chow] Sending FindCoordinator request to broker mwkafka-prod-01.nyc.deshaw.com:9092 (id: -1 rack: null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:39:33,050] DEBUG Set SASL client state to SEND_HANDSHAKE_REQUEST (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 03:39:33,051] DEBUG Set SASL client state to RECEIVE_HANDSHAKE_RESPONSE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 03:39:34,683] DEBUG Set SASL client state to CLIENT_COMPLETE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 03:39:41,887] DEBUG Set SASL client state to CLIENT_COMPLETE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 03:40:02,503] DEBUG [Consumer clientId=test_con, groupId=chow] Coordinator discovery failed, refreshing metadata (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:40:02,503] DEBUG [Consumer clientId=test_con, groupId=chow] Sending FindCoordinator request to broker mwkafka-prod-02.tbd.deshaw.com:9092 (id: -6 rack: null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:40:31,269] DEBUG Set SASL client state to COMPLETE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 03:40:31,269] DEBUG [Consumer clientId=test_con, groupId=chow] Initiating API versions fetch from node -1. (org.apache.kafka.clients.NetworkClient) [2019-02-27 03:40:32,507] DEBUG [Consumer clientId=test_con, groupId=chow] Coordinator discovery failed, refreshing metadata (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:40:32,507] DEBUG [Consumer clientId=test_con, groupId=chow] Sending FindCoordinator request to broker mwkafka-prod-02.dr.deshaw.com:9092 (id: -4 rack: null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:40:42,438] DEBUG Set SASL client state to COMPLETE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) was: Hi, I upgrade to Kafka v2.1.1 in hope that issue https://issues.apache.org/jira/browse/KAFKA-7925 will be fixed. However, I am still seeing the similar issue in my kafka cluster. I am seeing the same exceptions in all the servers. The kafka-network-thread-1-ListenerName are all consuming full cpu cycles. Lots of TCP connections are in CLOSE_WAIT state. My broker setup is using kerberos authentication with -Dsun.security.jgss.native=true. I am not sure how to handle this? Will increasing the kafka-network thread count help if it is possible? Does this seem like a bug? I am happy to help in anyway I can as this issue blocking our production usage and would like to get it resolved as early as possible. *server.log snippet from one of the servers: *[2019-02-27 00:00:02,948] DEBUG [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=3] Built full fetch (sessionId=1488865423, epoch=INITIAL) for node 2 with 3 partition(s). (org.apache.kafka.clients.FetchSessionHandler) [2019-02-27 00:00:02,949] DEBUG [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=3] Initiating connection to node mwkafka-prod-02.nyc.foo.com:9092 (id: 2 rack: null) using address mwkafka-prod-02.nyc.foo.com/10.219.247.26 (org.apache.kafka.clients.NetworkClient) [2019-02-27 00:00:02,949] DEBUG Set SASL client state to SEND_APIVERSIONS_REQUEST (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 00:00:02,949] DEBUG Creating SaslClient: client=kafka/mwkafka-prod-01.nyc.foo....@unix.foo.com;service=kafka;serviceHostname=mwkafka-prod-02.nyc.foo.com;mechs=[GSSAPI] (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 00:00:02,949] DEBUG [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=3] Created socket with SO_RCVBUF = 65536, SO_SNDBUF = 166400, SO_TIMEOUT = 0 to node 2 (org.apache.kafka.common.network.Selector) [2019-02-27 00:00:02,949] DEBUG Set SASL client state to RECEIVE_APIVERSIONS_RESPONSE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 00:00:02,949] DEBUG [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=3] Completed connection to node 2. Ready. (org.apache.kafka.clients.NetworkClient) [2019-02-27 00:00:03,007] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=0] Built full fetch (sessionId=2039987243, epoch=INITIAL) for node 5 with 0 partition(s). (org.apache.kafka.clients.FetchSessionHandler) [2019-02-27 00:00:03,317] INFO [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=1] Error sending fetch request (sessionId=397037945, epoch=INITIAL) to node 5: java.net.SocketTimeoutException: Failed to connect within 30000 ms. (org.apache.kafka.clients.FetchSessionHandler) [2019-02-27 00:00:03,317] WARN [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=1] Error in response for fetch request (type=FetchRequest, replicaId=1, maxWait=10000, minBytes=1, maxBytes=10485760, fetchData={reddyvel-159-0=(fetchOffset=3173198, logStartOffset=3173198, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-331-0=(fetchOffset=3173197, logStartOffset=3173197, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-newtp-5-64-0=(fetchOffset=8936, logStartOffset=8936, maxBytes=1048576, currentLeaderEpoch=Optional[18]), reddyvel-tp9-78-0=(fetchOffset=247943, logStartOffset=247943, maxBytes=1048576, currentLeaderEpoch=Optional[19]), reddyvel-tp3-58-0=(fetchOffset=264495, logStartOffset=264495, maxBytes=1048576, currentLeaderEpoch=Optional[19]), fps.trsy.fe_prvt-0=(fetchOffset=24, logStartOffset=8, maxBytes=1048576, currentLeaderEpoch=Optional[3]), reddyvel-7-0=(fetchOffset=3173199, logStartOffset=3173199, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-298-0=(fetchOffset=3173197, logStartOffset=3173197, maxBytes=1048576, currentLeaderEpoch=Optional[23]), fps.guas.peeq.fe_marb_us-0=(fetchOffset=2, logStartOffset=2, maxBytes=1048576, currentLeaderEpoch=Optional[6]), reddyvel-108-0=(fetchOffset=3173198, logStartOffset=3173198, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-988-0=(fetchOffset=3173185, logStartOffset=3173185, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-111-0=(fetchOffset=3173198, logStartOffset=3173198, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-409-0=(fetchOffset=3173194, logStartOffset=3173194, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-104-0=(fetchOffset=3173198, logStartOffset=3173198, maxBytes=1048576, currentLeaderEpoch=Optional[23]), fps.priveq.reins-0=(fetchOffset=12, logStartOffset=6, maxBytes=1048576, currentLeaderEpoch=Optional[5]), reddyvel-353-0=(fetchOffset=3173197, logStartOffset=3173197, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-tp10-63-0=(fetchOffset=220652, logStartOffset=220652, maxBytes=1048576, currentLeaderEpoch=Optional[19]), reddyvel-newtp-5-86-0=(fetchOffset=8935, logStartOffset=8935, maxBytes=1048576, currentLeaderEpoch=Optional[18]), reddyvel-878-0=(fetchOffset=3173187, logStartOffset=3173187, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-621-0=(fetchOffset=3173190, logStartOffset=3173190, maxBytes=1048576, currentLeaderEpoch=Optional[23]), fps.agg.uopt.opt-0=(fetchOffset=28297, logStartOffset=28297, maxBytes=1048576, currentLeaderEpoch=Optional[8]), reddyvel-661-0=(fetchOffset=3173190, logStartOffset=3173190, maxBytes=1048576, currentLeaderEpoch=Optional[23]), fps.guas.jpeq.fe_marb-0=(fetchOffset=532, logStartOffset=10, maxBytes=1048576, currentLeaderEpoch=Optional[3]), reddyvel-607-0=(fetchOffset=3173191, logStartOffset=3173191, maxBytes=1048576, currentLeaderEpoch=Optional[23]), fps.seed.ornt.desim_ornt-0=(fetchOffset=4060, logStartOffset=2433, maxBytes=1048576, currentLeaderEpoch=Optional[5]), reddyvel-962-0=(fetchOffset=3173185, logStartOffset=3173185, maxBytes=1048576, currentLeaderEpoch=Optional[23]), fps.agg.ornt.desim_ornt-0=(fetchOffset=1177, logStartOffset=1177, maxBytes=1048576, currentLeaderEpoch=Optional[8]), reddyvel-tp6-71-0=(fetchOffset=256309, logStartOffset=256309, maxBytes=1048576, currentLeaderEpoch=Optional[19]), fps.trsy.macro-0=(fetchOffset=324, logStartOffset=106, maxBytes=1048576, currentLeaderEpoch=Optional[3]), fps.agg.dist.treas-0=(fetchOffset=0, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[8]), reddyvel-newtp-8-111-0=(fetchOffset=1, logStartOffset=1, maxBytes=1048576, currentLeaderEpoch=Optional[18]), reddyvel-94-0=(fetchOffset=3173198, logStartOffset=3173198, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-544-0=(fetchOffset=3173192, logStartOffset=3173192, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-newtp-2-72-0=(fetchOffset=8679, logStartOffset=8679, maxBytes=1048576, currentLeaderEpoch=Optional[18]), fps.guas.useq.eq_stat_arb_useq-0=(fetchOffset=6071629, logStartOffset=4470145, maxBytes=1048576, currentLeaderEpoch=Optional[4]), reddyvel-newtp-10-75-0=(fetchOffset=9768, logStartOffset=9768, maxBytes=1048576, currentLeaderEpoch=Optional[18]), reddyvel-tp8-51-0=(fetchOffset=249873, logStartOffset=249873, maxBytes=1048576, currentLeaderEpoch=Optional[19]), reddyvel-newtp-8-122-0=(fetchOffset=1, logStartOffset=1, maxBytes=1048576, currentLeaderEpoch=Optional[18]), fps.seed.trsy.pe_china-0=(fetchOffset=24, logStartOffset=8, maxBytes=1048576, currentLeaderEpoch=Optional[3]), fps.seed.trsy.jcas-0=(fetchOffset=93, logStartOffset=31, maxBytes=1048576, currentLeaderEpoch=Optional[3]), reddyvel-tp8-99-0=(fetchOffset=249871, logStartOffset=249871, maxBytes=1048576, currentLeaderEpoch=Optional[19]), reddyvel-643-0=(fetchOffset=3173190, logStartOffset=3173190, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-577-0=(fetchOffset=3173191, logStartOffset=3173191, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-tp6-97-0=(fetchOffset=256307, logStartOffset=256307, maxBytes=1048576, currentLeaderEpoch=Optional[19]), reddyvel-newtp-6-72-0=(fetchOffset=7652, logStartOffset=7652, maxBytes=1048576, currentLeaderEpoch=Optional[18]), reddyvel-959-0=(fetchOffset=3173185, logStartOffset=3173185, maxBytes=1048576, currentLeaderEpoch=Optional[23]), reddyvel-newtp-4-70-0=(fetchOffset=8828, logStartOffset=8828, maxBytes=1048576, currentLeaderEpoch=Optional[18]), fps.seed.trsy.opt-0=(fetchOffset=114, logStartOffset=38, maxBytes=1048576, currentLeaderEpoch=Optional[4])}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=397037945, epoch=INITIAL)) (kafka.server.ReplicaFetcherThread) java.net.SocketTimeoutException: Failed to connect within 30000 ms at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:95) at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:192) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:274) at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:132) at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:131) at scala.Option.foreach(Option.scala:257) at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:131) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:113) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) [2019-02-27 00:00:04,007] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=0] Built full fetch (sessionId=2039987243, epoch=INITIAL) for node 5 with 45 partition(s). (org.apache.kafka.clients.FetchSessionHandler) [2019-02-27 00:00:04,007] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=0] Initiating connection to node mwkafka-prod-01.tbd.foo.com:9092 (id: 5 rack: null) using address mwkafka-prod-01.tbd.foo.com/10.236.30.30 (org.apache.kafka.clients.NetworkClient) [2019-02-27 00:00:04,008] DEBUG Set SASL client state to SEND_APIVERSIONS_REQUEST (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 00:00:04,008] DEBUG Creating SaslClient: client=kafka/mwkafka-prod-01.nyc.foo....@unix.foo.com;service=kafka;serviceHostname=mwkafka-prod-01.tbd.foo.com;mechs=[GSSAPI] (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 00:00:04,008] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=0] Created socket with SO_RCVBUF = 65536, SO_SNDBUF = 166400, SO_TIMEOUT = 0 to node 5 (org.apache.kafka.common.network.Selector) [2019-02-27 00:00:04,008] DEBUG Set SASL client state to RECEIVE_APIVERSIONS_RESPONSE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 00:00:04,008] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=0] Completed connection to node 5. Ready. (org.apache.kafka.clients.NetworkClient) [2019-02-27 00:00:04,318] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=1] Built full fetch (sessionId=397037945, epoch=INITIAL) for node 5 with 0 partition(s). (org.apache.kafka.clients.FetchSessionHandler) [2019-02-27 00:00:05,318] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=1] Built full fetch (sessionId=397037945, epoch=INITIAL) for node 5 with 48 partition(s). (org.apache.kafka.clients.FetchSessionHandler) [2019-02-27 00:00:05,318] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=1] Initiating connection to node mwkafka-prod-01.tbd.foo.com:9092 (id: 5 rack: null) using address mwkafka-prod-01.tbd.foo.com/10.236.30.30 (org.apache.kafka.clients.NetworkClient) [2019-02-27 00:00:05,318] DEBUG Set SASL client state to SEND_APIVERSIONS_REQUEST (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 00:00:05,318] DEBUG Creating SaslClient: client=kafka/mwkafka-prod-01.nyc.foo....@unix.foo.com;service=kafka;serviceHostname=mwkafka-prod-01.tbd.foo.com;mechs=[GSSAPI] (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 00:00:05,319] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=1] Created socket with SO_RCVBUF = 65536, SO_SNDBUF = 166400, SO_TIMEOUT = 0 to node 5 (org.apache.kafka.common.network.Selector) [2019-02-27 00:00:05,319] DEBUG Set SASL client state to RECEIVE_APIVERSIONS_RESPONSE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 00:00:05,319] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, fetcherId=1] Completed connection to node 5. Ready. (org.apache.kafka.clients.NetworkClient) [2019-02-27 00:00:05,477] DEBUG Set SASL server state to AUTHENTICATE (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator) [2019-02-27 00:00:05,477] DEBUG [SocketServer brokerId=1] Connection with /10.236.30.31 disconnected (org.apache.kafka.common.network.Selector) java.io.EOFException at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:96) at org.apache.kafka.common.security.authenticator.SaslServerAuthenticator.authenticate(SaslServerAuthenticator.java:237) at org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:132) at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:532) at org.apache.kafka.common.network.Selector.poll(Selector.java:467) at kafka.network.Processor.poll(SocketServer.scala:689) at kafka.network.Processor.run(SocketServer.scala:594) at java.base/java.lang.Thread.run(Thread.java:834) [2019-02-27 00:00:05,477] DEBUG Handling Kafka request SASL_HANDSHAKE (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator) [2019-02-27 00:00:05,477] DEBUG Using SASL mechanism 'GSSAPI' provided by client (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator) [2019-02-27 00:00:05,477] DEBUG Creating SaslServer for kafka/mwkafka-prod-01.nyc.foo....@unix.foo.com with mechanism GSSAPI (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator) [2019-02-27 00:00:06,056] INFO [ReplicaFetcher replicaId=1, leaderId=4, fetcherId=2] Error sending fetch request (sessionId=373847113, epoch=INITIAL) to node 4: java.net.SocketTimeoutException: Failed to connect within 30000 ms. (org.apache.kafka.clients.FetchSessionHandler) *Lsof output: *kafka...@mwkafka-prod-01.nyc[toa]:/local/kafka/logs> lsof -P -p 103485 | grep TCP | grep CLOSE java 103485 kafkagod 635u IPv4 86522305 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.nyc.foo.com:46014 (CLOSE_WAIT) java 103485 kafkagod 639u IPv4 86519040 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.nyc.foo.com:45926 (CLOSE_WAIT) java 103485 kafkagod 642u IPv4 86519057 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-01.dr.foo.com:47424 (CLOSE_WAIT) java 103485 kafkagod 643u IPv4 86519058 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-01.dr.foo.com:47428 (CLOSE_WAIT) java 103485 kafkagod 683u IPv4 86509505 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-01.tbd.foo.com:57856 (CLOSE_WAIT) java 103485 kafkagod 684u IPv4 86522910 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-01.tbd.foo.com:57894 (CLOSE_WAIT) java 103485 kafkagod 688u IPv4 86522176 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.nyc.foo.com:45966 (CLOSE_WAIT) java 103485 kafkagod 690u IPv4 86522306 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.tbd.foo.com:35326 (CLOSE_WAIT) java 103485 kafkagod 695u IPv4 86522192 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.nyc.foo.com:45968 (CLOSE_WAIT) java 103485 kafkagod 696u IPv4 86509516 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.dr.foo.com:39676 (CLOSE_WAIT) java 103485 kafkagod 697u IPv4 86522307 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.tbd.foo.com:35328 (CLOSE_WAIT) java 103485 kafkagod 705u IPv4 86519026 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-01.tbd.foo.com:57838 (CLOSE_WAIT) java 103485 kafkagod 726u IPv4 86509517 0t0 TCP mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.tbd.foo.com:35258 (CLOSE_WAIT) *contoller logs below show that no broker is able connect to each other: * [2019-02-27 03:39:16,135] WARN [RequestSendThread controllerId=1] Controller 1's connection to broker mwkafka-prod-02.dr.foo.com:9092 (id: 4 rack: dr.foo.com) was unsuccessful (kafka.controller.RequestSendThread) java.net.SocketTimeoutException: Failed to connect within 30000 ms at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:280) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) [2019-02-27 03:39:20,915] WARN [RequestSendThread controllerId=1] Controller 1's connection to broker mwkafka-prod-02.tbd.foo.com:9092 (id: 6 rack: tbd.foo.com) was unsuccessful (kafka.controller.RequestSendThread) java.net.SocketTimeoutException: Failed to connect within 30000 ms at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:280) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) [2019-02-27 03:39:25,839] WARN [RequestSendThread controllerId=1] Controller 1's connection to broker mwkafka-prod-01.dr.foo.com:9092 (id: 3 rack: dr.foo.com) was unsuccessful (kafka.controller.RequestSendThread) java.net.SocketTimeoutException: Failed to connect within 30000 ms at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:280) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) [2019-02-27 03:39:31,371] WARN [RequestSendThread controllerId=1] Controller 1's connection to broker mwkafka-prod-01.tbd.foo.com:9092 (id: 5 rack: tbd.foo.com) was unsuccessful (kafka.controller.RequestSendThread) java.net.SocketTimeoutException: Failed to connect within 30000 ms at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:280) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) [2019-02-27 03:39:40,440] WARN [RequestSendThread controllerId=1] Controller 1's connection to broker mwkafka-prod-01.nyc.foo.com:9092 (id: 1 rack: nyc.foo.com) was unsuccessful (kafka.controller.RequestSendThread) java.net.SocketTimeoutException: Failed to connect within 30000 ms at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:280) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) [2019-02-27 03:39:42,149] WARN [RequestSendThread controllerId=1] Controller 1's connection to broker mwkafka-prod-02.nyc.foo.com:9092 (id: 2 rack: nyc.foo.com) was unsuccessful (kafka.controller.RequestSendThread) java.net.SocketTimeoutException: Failed to connect within 30000 ms at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:280) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) *Client consumer logs: consumer is not able to connect to the brokers* [2019-02-27 03:37:37,587] DEBUG [Consumer clientId=test_con, groupId=chow] Created socket with SO_RCVBUF = 65536, SO_SNDBUF = 131072, SO_TIMEOUT = 0 to node -6 (org.apache.kafka.common.network.Selector) [2019-02-27 03:37:37,587] DEBUG Set SASL client state to RECEIVE_APIVERSIONS_RESPONSE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 03:37:37,587] DEBUG [Consumer clientId=test_con, groupId=chow] Completed connection to node -6. Fetching API versions. (org.apache.kafka.clients.NetworkClient) [2019-02-27 03:38:02,488] DEBUG [Consumer clientId=test_con, groupId=chow] Coordinator discovery failed, refreshing metadata (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:38:02,488] DEBUG [Consumer clientId=test_con, groupId=chow] Sending FindCoordinator request to broker mwkafka-prod-02.tbd.deshaw.com:9092 (id: -6 rack: null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:38:08,618] DEBUG Set SASL client state to INITIAL (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 03:38:08,619] DEBUG Set SASL client state to INTERMEDIATE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 03:38:29,288] DEBUG Set SASL client state to SEND_HANDSHAKE_REQUEST (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 03:38:29,288] DEBUG Set SASL client state to RECEIVE_HANDSHAKE_RESPONSE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 03:38:32,493] DEBUG [Consumer clientId=test_con, groupId=chow] Coordinator discovery failed, refreshing metadata (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:38:32,493] DEBUG [Consumer clientId=test_con, groupId=chow] Sending FindCoordinator request to broker mwkafka-prod-02.dr.deshaw.com:9092 (id: -4 rack: null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:39:02,498] DEBUG [Consumer clientId=test_con, groupId=chow] Coordinator discovery failed, refreshing metadata (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:39:02,498] DEBUG [Consumer clientId=test_con, groupId=chow] Sending FindCoordinator request to broker mwkafka-prod-02.dr.deshaw.com:9092 (id: -3 rack: null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:39:32,501] DEBUG [Consumer clientId=test_con, groupId=chow] Coordinator discovery failed, refreshing metadata (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:39:32,501] DEBUG [Consumer clientId=test_con, groupId=chow] Sending FindCoordinator request to broker mwkafka-prod-01.nyc.deshaw.com:9092 (id: -1 rack: null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:39:33,050] DEBUG Set SASL client state to SEND_HANDSHAKE_REQUEST (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 03:39:33,051] DEBUG Set SASL client state to RECEIVE_HANDSHAKE_RESPONSE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 03:39:34,683] DEBUG Set SASL client state to CLIENT_COMPLETE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 03:39:41,887] DEBUG Set SASL client state to CLIENT_COMPLETE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 03:40:02,503] DEBUG [Consumer clientId=test_con, groupId=chow] Coordinator discovery failed, refreshing metadata (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:40:02,503] DEBUG [Consumer clientId=test_con, groupId=chow] Sending FindCoordinator request to broker mwkafka-prod-02.tbd.deshaw.com:9092 (id: -6 rack: null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:40:31,269] DEBUG Set SASL client state to COMPLETE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) [2019-02-27 03:40:31,269] DEBUG [Consumer clientId=test_con, groupId=chow] Initiating API versions fetch from node -1. (org.apache.kafka.clients.NetworkClient) [2019-02-27 03:40:32,507] DEBUG [Consumer clientId=test_con, groupId=chow] Coordinator discovery failed, refreshing metadata (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:40:32,507] DEBUG [Consumer clientId=test_con, groupId=chow] Sending FindCoordinator request to broker mwkafka-prod-02.dr.deshaw.com:9092 (id: -4 rack: null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2019-02-27 03:40:42,438] DEBUG Set SASL client state to COMPLETE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) > Clients unable to connect and replicas are not able to connect to each other > ---------------------------------------------------------------------------- > > Key: KAFKA-8008 > URL: https://issues.apache.org/jira/browse/KAFKA-8008 > Project: Kafka > Issue Type: Bug > Components: controller, core > Affects Versions: 2.1.0, 2.1.1 > Environment: Java 11 > Reporter: Abhi > Priority: Critical > > Hi, > I upgrade to Kafka v2.1.1 recently and seeing the below exceptions in all the > servers. The kafka-network-thread-1-ListenerName are all consuming full cpu > cycles. Lots of TCP connections are in CLOSE_WAIT state. > My broker setup is using kerberos authentication with > -Dsun.security.jgss.native=true. > I am not sure how to handle this? Will increasing the kafka-network thread > count help if it is possible? > Does this seem like a bug? I am happy to help in anyway I can as this issue > blocking our production usage and would like to get it resolved as early as > possible. > *server.log snippet from one of the servers: > *[2019-02-27 00:00:02,948] DEBUG [ReplicaFetcher replicaId=1, leaderId=2, > fetcherId=3] Built full fetch (sessionId=1488865423, epoch=INITIAL) for node > 2 with 3 partition(s). (org.apache.kafka.clients.FetchSessionHandler) > [2019-02-27 00:00:02,949] DEBUG [ReplicaFetcher replicaId=1, leaderId=2, > fetcherId=3] Initiating connection to node mwkafka-prod-02.nyc.foo.com:9092 > (id: 2 rack: null) using address mwkafka-prod-02.nyc.foo.com/10.219.247.26 > (org.apache.kafka.clients.NetworkClient) > [2019-02-27 00:00:02,949] DEBUG Set SASL client state to > SEND_APIVERSIONS_REQUEST > (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) > [2019-02-27 00:00:02,949] DEBUG Creating SaslClient: > client=kafka/mwkafka-prod-01.nyc.foo....@unix.foo.com;service=kafka;serviceHostname=mwkafka-prod-02.nyc.foo.com;mechs=[GSSAPI] > (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) > [2019-02-27 00:00:02,949] DEBUG [ReplicaFetcher replicaId=1, leaderId=2, > fetcherId=3] Created socket with SO_RCVBUF = 65536, SO_SNDBUF = 166400, > SO_TIMEOUT = 0 to node 2 (org.apache.kafka.common.network.Selector) > [2019-02-27 00:00:02,949] DEBUG Set SASL client state to > RECEIVE_APIVERSIONS_RESPONSE > (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) > [2019-02-27 00:00:02,949] DEBUG [ReplicaFetcher replicaId=1, leaderId=2, > fetcherId=3] Completed connection to node 2. Ready. > (org.apache.kafka.clients.NetworkClient) > [2019-02-27 00:00:03,007] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, > fetcherId=0] Built full fetch (sessionId=2039987243, epoch=INITIAL) for node > 5 with 0 partition(s). (org.apache.kafka.clients.FetchSessionHandler) > [2019-02-27 00:00:03,317] INFO [ReplicaFetcher replicaId=1, leaderId=5, > fetcherId=1] Error sending fetch request (sessionId=397037945, epoch=INITIAL) > to node 5: java.net.SocketTimeoutException: Failed to connect within 30000 > ms. (org.apache.kafka.clients.FetchSessionHandler) > [2019-02-27 00:00:03,317] WARN [ReplicaFetcher replicaId=1, leaderId=5, > fetcherId=1] Error in response for fetch request (type=FetchRequest, > replicaId=1, maxWait=10000, minBytes=1, maxBytes=10485760, > fetchData={reddyvel-159-0=(fetchOffset=3173198, logStartOffset=3173198, > maxBytes=1048576, currentLeaderEpoch=Optional[23]), > reddyvel-331-0=(fetchOffset=3173197, logStartOffset=3173197, > maxBytes=1048576, currentLeaderEpoch=Optional[23]), > reddyvel-newtp-5-64-0=(fetchOffset=8936, logStartOffset=8936, > maxBytes=1048576, currentLeaderEpoch=Optional[18]), > reddyvel-tp9-78-0=(fetchOffset=247943, logStartOffset=247943, > maxBytes=1048576, currentLeaderEpoch=Optional[19]), > reddyvel-tp3-58-0=(fetchOffset=264495, logStartOffset=264495, > maxBytes=1048576, currentLeaderEpoch=Optional[19]), > fps.trsy.fe_prvt-0=(fetchOffset=24, logStartOffset=8, maxBytes=1048576, > currentLeaderEpoch=Optional[3]), reddyvel-7-0=(fetchOffset=3173199, > logStartOffset=3173199, maxBytes=1048576, currentLeaderEpoch=Optional[23]), > reddyvel-298-0=(fetchOffset=3173197, logStartOffset=3173197, > maxBytes=1048576, currentLeaderEpoch=Optional[23]), > fps.guas.peeq.fe_marb_us-0=(fetchOffset=2, logStartOffset=2, > maxBytes=1048576, currentLeaderEpoch=Optional[6]), > reddyvel-108-0=(fetchOffset=3173198, logStartOffset=3173198, > maxBytes=1048576, currentLeaderEpoch=Optional[23]), > reddyvel-988-0=(fetchOffset=3173185, logStartOffset=3173185, > maxBytes=1048576, currentLeaderEpoch=Optional[23]), > reddyvel-111-0=(fetchOffset=3173198, logStartOffset=3173198, > maxBytes=1048576, currentLeaderEpoch=Optional[23]), > reddyvel-409-0=(fetchOffset=3173194, logStartOffset=3173194, > maxBytes=1048576, currentLeaderEpoch=Optional[23]), > reddyvel-104-0=(fetchOffset=3173198, logStartOffset=3173198, > maxBytes=1048576, currentLeaderEpoch=Optional[23]), > fps.priveq.reins-0=(fetchOffset=12, logStartOffset=6, maxBytes=1048576, > currentLeaderEpoch=Optional[5]), reddyvel-353-0=(fetchOffset=3173197, > logStartOffset=3173197, maxBytes=1048576, currentLeaderEpoch=Optional[23]), > reddyvel-tp10-63-0=(fetchOffset=220652, logStartOffset=220652, > maxBytes=1048576, currentLeaderEpoch=Optional[19]), > reddyvel-newtp-5-86-0=(fetchOffset=8935, logStartOffset=8935, > maxBytes=1048576, currentLeaderEpoch=Optional[18]), > reddyvel-878-0=(fetchOffset=3173187, logStartOffset=3173187, > maxBytes=1048576, currentLeaderEpoch=Optional[23]), > reddyvel-621-0=(fetchOffset=3173190, logStartOffset=3173190, > maxBytes=1048576, currentLeaderEpoch=Optional[23]), > fps.agg.uopt.opt-0=(fetchOffset=28297, logStartOffset=28297, > maxBytes=1048576, currentLeaderEpoch=Optional[8]), > reddyvel-661-0=(fetchOffset=3173190, logStartOffset=3173190, > maxBytes=1048576, currentLeaderEpoch=Optional[23]), > fps.guas.jpeq.fe_marb-0=(fetchOffset=532, logStartOffset=10, > maxBytes=1048576, currentLeaderEpoch=Optional[3]), > reddyvel-607-0=(fetchOffset=3173191, logStartOffset=3173191, > maxBytes=1048576, currentLeaderEpoch=Optional[23]), > fps.seed.ornt.desim_ornt-0=(fetchOffset=4060, logStartOffset=2433, > maxBytes=1048576, currentLeaderEpoch=Optional[5]), > reddyvel-962-0=(fetchOffset=3173185, logStartOffset=3173185, > maxBytes=1048576, currentLeaderEpoch=Optional[23]), > fps.agg.ornt.desim_ornt-0=(fetchOffset=1177, logStartOffset=1177, > maxBytes=1048576, currentLeaderEpoch=Optional[8]), > reddyvel-tp6-71-0=(fetchOffset=256309, logStartOffset=256309, > maxBytes=1048576, currentLeaderEpoch=Optional[19]), > fps.trsy.macro-0=(fetchOffset=324, logStartOffset=106, maxBytes=1048576, > currentLeaderEpoch=Optional[3]), fps.agg.dist.treas-0=(fetchOffset=0, > logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[8]), > reddyvel-newtp-8-111-0=(fetchOffset=1, logStartOffset=1, maxBytes=1048576, > currentLeaderEpoch=Optional[18]), reddyvel-94-0=(fetchOffset=3173198, > logStartOffset=3173198, maxBytes=1048576, currentLeaderEpoch=Optional[23]), > reddyvel-544-0=(fetchOffset=3173192, logStartOffset=3173192, > maxBytes=1048576, currentLeaderEpoch=Optional[23]), > reddyvel-newtp-2-72-0=(fetchOffset=8679, logStartOffset=8679, > maxBytes=1048576, currentLeaderEpoch=Optional[18]), > fps.guas.useq.eq_stat_arb_useq-0=(fetchOffset=6071629, > logStartOffset=4470145, maxBytes=1048576, currentLeaderEpoch=Optional[4]), > reddyvel-newtp-10-75-0=(fetchOffset=9768, logStartOffset=9768, > maxBytes=1048576, currentLeaderEpoch=Optional[18]), > reddyvel-tp8-51-0=(fetchOffset=249873, logStartOffset=249873, > maxBytes=1048576, currentLeaderEpoch=Optional[19]), > reddyvel-newtp-8-122-0=(fetchOffset=1, logStartOffset=1, maxBytes=1048576, > currentLeaderEpoch=Optional[18]), fps.seed.trsy.pe_china-0=(fetchOffset=24, > logStartOffset=8, maxBytes=1048576, currentLeaderEpoch=Optional[3]), > fps.seed.trsy.jcas-0=(fetchOffset=93, logStartOffset=31, maxBytes=1048576, > currentLeaderEpoch=Optional[3]), reddyvel-tp8-99-0=(fetchOffset=249871, > logStartOffset=249871, maxBytes=1048576, currentLeaderEpoch=Optional[19]), > reddyvel-643-0=(fetchOffset=3173190, logStartOffset=3173190, > maxBytes=1048576, currentLeaderEpoch=Optional[23]), > reddyvel-577-0=(fetchOffset=3173191, logStartOffset=3173191, > maxBytes=1048576, currentLeaderEpoch=Optional[23]), > reddyvel-tp6-97-0=(fetchOffset=256307, logStartOffset=256307, > maxBytes=1048576, currentLeaderEpoch=Optional[19]), > reddyvel-newtp-6-72-0=(fetchOffset=7652, logStartOffset=7652, > maxBytes=1048576, currentLeaderEpoch=Optional[18]), > reddyvel-959-0=(fetchOffset=3173185, logStartOffset=3173185, > maxBytes=1048576, currentLeaderEpoch=Optional[23]), > reddyvel-newtp-4-70-0=(fetchOffset=8828, logStartOffset=8828, > maxBytes=1048576, currentLeaderEpoch=Optional[18]), > fps.seed.trsy.opt-0=(fetchOffset=114, logStartOffset=38, maxBytes=1048576, > currentLeaderEpoch=Optional[4])}, isolationLevel=READ_UNCOMMITTED, toForget=, > metadata=(sessionId=397037945, epoch=INITIAL)) > (kafka.server.ReplicaFetcherThread) > java.net.SocketTimeoutException: Failed to connect within 30000 ms > at > kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:95) > at > kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:192) > at > kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:274) > at > kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:132) > at > kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:131) > at scala.Option.foreach(Option.scala:257) > at > kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:131) > at > kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:113) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > [2019-02-27 00:00:04,007] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, > fetcherId=0] Built full fetch (sessionId=2039987243, epoch=INITIAL) for node > 5 with 45 partition(s). (org.apache.kafka.clients.FetchSessionHandler) > [2019-02-27 00:00:04,007] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, > fetcherId=0] Initiating connection to node mwkafka-prod-01.tbd.foo.com:9092 > (id: 5 rack: null) using address mwkafka-prod-01.tbd.foo.com/10.236.30.30 > (org.apache.kafka.clients.NetworkClient) > [2019-02-27 00:00:04,008] DEBUG Set SASL client state to > SEND_APIVERSIONS_REQUEST > (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) > [2019-02-27 00:00:04,008] DEBUG Creating SaslClient: > client=kafka/mwkafka-prod-01.nyc.foo....@unix.foo.com;service=kafka;serviceHostname=mwkafka-prod-01.tbd.foo.com;mechs=[GSSAPI] > (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) > [2019-02-27 00:00:04,008] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, > fetcherId=0] Created socket with SO_RCVBUF = 65536, SO_SNDBUF = 166400, > SO_TIMEOUT = 0 to node 5 (org.apache.kafka.common.network.Selector) > [2019-02-27 00:00:04,008] DEBUG Set SASL client state to > RECEIVE_APIVERSIONS_RESPONSE > (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) > [2019-02-27 00:00:04,008] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, > fetcherId=0] Completed connection to node 5. Ready. > (org.apache.kafka.clients.NetworkClient) > [2019-02-27 00:00:04,318] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, > fetcherId=1] Built full fetch (sessionId=397037945, epoch=INITIAL) for node 5 > with 0 partition(s). (org.apache.kafka.clients.FetchSessionHandler) > [2019-02-27 00:00:05,318] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, > fetcherId=1] Built full fetch (sessionId=397037945, epoch=INITIAL) for node 5 > with 48 partition(s). (org.apache.kafka.clients.FetchSessionHandler) > [2019-02-27 00:00:05,318] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, > fetcherId=1] Initiating connection to node mwkafka-prod-01.tbd.foo.com:9092 > (id: 5 rack: null) using address mwkafka-prod-01.tbd.foo.com/10.236.30.30 > (org.apache.kafka.clients.NetworkClient) > [2019-02-27 00:00:05,318] DEBUG Set SASL client state to > SEND_APIVERSIONS_REQUEST > (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) > [2019-02-27 00:00:05,318] DEBUG Creating SaslClient: > client=kafka/mwkafka-prod-01.nyc.foo....@unix.foo.com;service=kafka;serviceHostname=mwkafka-prod-01.tbd.foo.com;mechs=[GSSAPI] > (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) > [2019-02-27 00:00:05,319] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, > fetcherId=1] Created socket with SO_RCVBUF = 65536, SO_SNDBUF = 166400, > SO_TIMEOUT = 0 to node 5 (org.apache.kafka.common.network.Selector) > [2019-02-27 00:00:05,319] DEBUG Set SASL client state to > RECEIVE_APIVERSIONS_RESPONSE > (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) > [2019-02-27 00:00:05,319] DEBUG [ReplicaFetcher replicaId=1, leaderId=5, > fetcherId=1] Completed connection to node 5. Ready. > (org.apache.kafka.clients.NetworkClient) > [2019-02-27 00:00:05,477] DEBUG Set SASL server state to AUTHENTICATE > (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator) > [2019-02-27 00:00:05,477] DEBUG [SocketServer brokerId=1] Connection with > /10.236.30.31 disconnected (org.apache.kafka.common.network.Selector) > java.io.EOFException > at > org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:96) > at > org.apache.kafka.common.security.authenticator.SaslServerAuthenticator.authenticate(SaslServerAuthenticator.java:237) > at > org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:132) > at > org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:532) > at org.apache.kafka.common.network.Selector.poll(Selector.java:467) > at kafka.network.Processor.poll(SocketServer.scala:689) > at kafka.network.Processor.run(SocketServer.scala:594) > at java.base/java.lang.Thread.run(Thread.java:834) > [2019-02-27 00:00:05,477] DEBUG Handling Kafka request SASL_HANDSHAKE > (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator) > [2019-02-27 00:00:05,477] DEBUG Using SASL mechanism 'GSSAPI' provided by > client > (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator) > [2019-02-27 00:00:05,477] DEBUG Creating SaslServer for > kafka/mwkafka-prod-01.nyc.foo....@unix.foo.com with mechanism GSSAPI > (org.apache.kafka.common.security.authenticator.SaslServerAuthenticator) > [2019-02-27 00:00:06,056] INFO [ReplicaFetcher replicaId=1, leaderId=4, > fetcherId=2] Error sending fetch request (sessionId=373847113, epoch=INITIAL) > to node 4: java.net.SocketTimeoutException: Failed to connect within 30000 > ms. (org.apache.kafka.clients.FetchSessionHandler) > *Lsof output: > *kafka...@mwkafka-prod-01.nyc[toa]:/local/kafka/logs> lsof -P -p 103485 | > grep TCP | grep CLOSE > java 103485 kafkagod 635u IPv4 86522305 0t0 TCP > mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.nyc.foo.com:46014 > (CLOSE_WAIT) > java 103485 kafkagod 639u IPv4 86519040 0t0 TCP > mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.nyc.foo.com:45926 > (CLOSE_WAIT) > java 103485 kafkagod 642u IPv4 86519057 0t0 TCP > mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-01.dr.foo.com:47424 > (CLOSE_WAIT) > java 103485 kafkagod 643u IPv4 86519058 0t0 TCP > mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-01.dr.foo.com:47428 > (CLOSE_WAIT) > java 103485 kafkagod 683u IPv4 86509505 0t0 TCP > mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-01.tbd.foo.com:57856 > (CLOSE_WAIT) > java 103485 kafkagod 684u IPv4 86522910 0t0 TCP > mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-01.tbd.foo.com:57894 > (CLOSE_WAIT) > java 103485 kafkagod 688u IPv4 86522176 0t0 TCP > mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.nyc.foo.com:45966 > (CLOSE_WAIT) > java 103485 kafkagod 690u IPv4 86522306 0t0 TCP > mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.tbd.foo.com:35326 > (CLOSE_WAIT) > java 103485 kafkagod 695u IPv4 86522192 0t0 TCP > mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.nyc.foo.com:45968 > (CLOSE_WAIT) > java 103485 kafkagod 696u IPv4 86509516 0t0 TCP > mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.dr.foo.com:39676 > (CLOSE_WAIT) > java 103485 kafkagod 697u IPv4 86522307 0t0 TCP > mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.tbd.foo.com:35328 > (CLOSE_WAIT) > java 103485 kafkagod 705u IPv4 86519026 0t0 TCP > mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-01.tbd.foo.com:57838 > (CLOSE_WAIT) > java 103485 kafkagod 726u IPv4 86509517 0t0 TCP > mwkafka-prod-01.nyc.foo.com:9092->mwkafka-prod-02.tbd.foo.com:35258 > (CLOSE_WAIT) > *contoller logs below show that no broker is able connect to each other: > * > [2019-02-27 03:39:16,135] WARN [RequestSendThread controllerId=1] Controller > 1's connection to broker mwkafka-prod-02.dr.foo.com:9092 (id: 4 rack: > dr.foo.com) was unsuccessful (kafka.controller.RequestSendThread) > java.net.SocketTimeoutException: Failed to connect within 30000 ms > at > kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:280) > at > kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > [2019-02-27 03:39:20,915] WARN [RequestSendThread controllerId=1] Controller > 1's connection to broker mwkafka-prod-02.tbd.foo.com:9092 (id: 6 rack: > tbd.foo.com) was unsuccessful (kafka.controller.RequestSendThread) > java.net.SocketTimeoutException: Failed to connect within 30000 ms > at > kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:280) > at > kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > [2019-02-27 03:39:25,839] WARN [RequestSendThread controllerId=1] Controller > 1's connection to broker mwkafka-prod-01.dr.foo.com:9092 (id: 3 rack: > dr.foo.com) was unsuccessful (kafka.controller.RequestSendThread) > java.net.SocketTimeoutException: Failed to connect within 30000 ms > at > kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:280) > at > kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > [2019-02-27 03:39:31,371] WARN [RequestSendThread controllerId=1] Controller > 1's connection to broker mwkafka-prod-01.tbd.foo.com:9092 (id: 5 rack: > tbd.foo.com) was unsuccessful (kafka.controller.RequestSendThread) > java.net.SocketTimeoutException: Failed to connect within 30000 ms > at > kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:280) > at > kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > [2019-02-27 03:39:40,440] WARN [RequestSendThread controllerId=1] Controller > 1's connection to broker mwkafka-prod-01.nyc.foo.com:9092 (id: 1 rack: > nyc.foo.com) was unsuccessful (kafka.controller.RequestSendThread) > java.net.SocketTimeoutException: Failed to connect within 30000 ms > at > kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:280) > at > kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > [2019-02-27 03:39:42,149] WARN [RequestSendThread controllerId=1] Controller > 1's connection to broker mwkafka-prod-02.nyc.foo.com:9092 (id: 2 rack: > nyc.foo.com) was unsuccessful (kafka.controller.RequestSendThread) > java.net.SocketTimeoutException: Failed to connect within 30000 ms > at > kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:280) > at > kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:233) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > *Client consumer logs: consumer is not able to connect to the brokers* > [2019-02-27 03:37:37,587] DEBUG [Consumer clientId=test_con, groupId=chow] > Created socket with SO_RCVBUF = 65536, SO_SNDBUF = 131072, SO_TIMEOUT = 0 to > node -6 (org.apache.kafka.common.network.Selector) > [2019-02-27 03:37:37,587] DEBUG Set SASL client state to > RECEIVE_APIVERSIONS_RESPONSE > (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) > [2019-02-27 03:37:37,587] DEBUG [Consumer clientId=test_con, groupId=chow] > Completed connection to node -6. Fetching API versions. > (org.apache.kafka.clients.NetworkClient) > [2019-02-27 03:38:02,488] DEBUG [Consumer clientId=test_con, groupId=chow] > Coordinator discovery failed, refreshing metadata > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2019-02-27 03:38:02,488] DEBUG [Consumer clientId=test_con, groupId=chow] > Sending FindCoordinator request to broker mwkafka-prod-02.tbd.deshaw.com:9092 > (id: -6 rack: null) > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2019-02-27 03:38:08,618] DEBUG Set SASL client state to INITIAL > (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) > [2019-02-27 03:38:08,619] DEBUG Set SASL client state to INTERMEDIATE > (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) > [2019-02-27 03:38:29,288] DEBUG Set SASL client state to > SEND_HANDSHAKE_REQUEST > (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) > [2019-02-27 03:38:29,288] DEBUG Set SASL client state to > RECEIVE_HANDSHAKE_RESPONSE > (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) > [2019-02-27 03:38:32,493] DEBUG [Consumer clientId=test_con, groupId=chow] > Coordinator discovery failed, refreshing metadata > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2019-02-27 03:38:32,493] DEBUG [Consumer clientId=test_con, groupId=chow] > Sending FindCoordinator request to broker mwkafka-prod-02.dr.deshaw.com:9092 > (id: -4 rack: null) > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2019-02-27 03:39:02,498] DEBUG [Consumer clientId=test_con, groupId=chow] > Coordinator discovery failed, refreshing metadata > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2019-02-27 03:39:02,498] DEBUG [Consumer clientId=test_con, groupId=chow] > Sending FindCoordinator request to broker mwkafka-prod-02.dr.deshaw.com:9092 > (id: -3 rack: null) > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2019-02-27 03:39:32,501] DEBUG [Consumer clientId=test_con, groupId=chow] > Coordinator discovery failed, refreshing metadata > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2019-02-27 03:39:32,501] DEBUG [Consumer clientId=test_con, groupId=chow] > Sending FindCoordinator request to broker mwkafka-prod-01.nyc.deshaw.com:9092 > (id: -1 rack: null) > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2019-02-27 03:39:33,050] DEBUG Set SASL client state to > SEND_HANDSHAKE_REQUEST > (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) > [2019-02-27 03:39:33,051] DEBUG Set SASL client state to > RECEIVE_HANDSHAKE_RESPONSE > (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) > [2019-02-27 03:39:34,683] DEBUG Set SASL client state to CLIENT_COMPLETE > (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) > [2019-02-27 03:39:41,887] DEBUG Set SASL client state to CLIENT_COMPLETE > (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) > [2019-02-27 03:40:02,503] DEBUG [Consumer clientId=test_con, groupId=chow] > Coordinator discovery failed, refreshing metadata > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2019-02-27 03:40:02,503] DEBUG [Consumer clientId=test_con, groupId=chow] > Sending FindCoordinator request to broker mwkafka-prod-02.tbd.deshaw.com:9092 > (id: -6 rack: null) > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2019-02-27 03:40:31,269] DEBUG Set SASL client state to COMPLETE > (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) > [2019-02-27 03:40:31,269] DEBUG [Consumer clientId=test_con, groupId=chow] > Initiating API versions fetch from node -1. > (org.apache.kafka.clients.NetworkClient) > [2019-02-27 03:40:32,507] DEBUG [Consumer clientId=test_con, groupId=chow] > Coordinator discovery failed, refreshing metadata > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2019-02-27 03:40:32,507] DEBUG [Consumer clientId=test_con, groupId=chow] > Sending FindCoordinator request to broker mwkafka-prod-02.dr.deshaw.com:9092 > (id: -4 rack: null) > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > [2019-02-27 03:40:42,438] DEBUG Set SASL client state to COMPLETE > (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator) -- This message was sent by Atlassian JIRA (v7.6.3#76005)