[ 
https://issues.apache.org/jira/browse/KAFKA-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Lazam updated KAFKA-7876:
---------------------------------
    Description: 
 

We have 3 node cluster setup. There are scenarios that one of the broker 
suddenly got disconnected from the cluster but no underlying system issue is 
found. The node that got dc'ed wasn't able to release the partition it holds as 
the leader, hence clients (spring-boot) was unable to send/receive data from 
the issued broker.

We noticed that it always happen to the active controller count.

Environment details:

Provider: AWS
 Kernel: 3.10.0-693.21.1.el7.x86_64
 OS: CentOS Linux release 7.5.1804 (Core)
 Scala version: 2.11
 Kafka version: 2.1.0

 Kafka config:
{code:java}
############################# Socket Server Settings 
#############################

num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600

############################# Log Basics #############################

num.partitions=1
num.recovery.threads.per.data.dir=1

############################# Internal Topic Settings  
#############################

offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=2

############################# Log Retention Policy #############################
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000


############################# Group Coordinator Settings 
#############################
group.initial.rebalance.delay.ms=0

############################# Zookeeper #############################

zookeeper.connection.timeout.ms=6000

broker.id=1
zookeeper.connect=zk1:2181,zk2:2181,zk3:2181
log.dirs=/data/kafka-node
advertised.listeners=PLAINTEXT://node1:9092
{code}
Broker disconnected controller log:
{code:java}
[2019-01-26 05:03:52,512] TRACE [Controller id=2] Checking need to trigger auto 
leader balancing (kafka.controller.KafkaController)
[2019-01-26 05:03:52,513] DEBUG [Controller id=2] Preferred replicas by broker 
Map(TOPICS->MAP) (kafka.controller.KafkaController)
[2019-01-26 05:03:52,513] DEBUG [Controller id=2] Topics not in preferred 
replica for broker 2 Map() (kafka.controller.KafkaController)
[2019-01-26 05:03:52,513] TRACE [Controller id=2] Leader imbalance ratio for 
broker 2 is 0.0 (kafka.controller.KafkaController)
[2019-01-26 05:03:52,513] DEBUG [Controller id=2] Topics not in preferred 
replica for broker 1 Map() (kafka.controller.KafkaController)
[2019-01-26 05:03:52,513] TRACE [Controller id=2] Leader imbalance ratio for 
broker 1 is 0.0 (kafka.controller.KafkaController)
[2019-01-26 05:03:52,513] DEBUG [Controller id=2] Topics not in preferred 
replica for broker 3 Map() (kafka.controller.KafkaController)
[2019-01-26 05:03:52,513] TRACE [Controller id=2] Leader imbalance ratio for 
broker 3 is 0.0 (kafka.controller.KafkaController)
[2019-01-26 05:08:52,513] TRACE [Controller id=2] Checking need to trigger auto 
leader balancing (kafka.controller.KafkaController)

{code}
Broker working server.log:
{code:java}
[2019-01-26 05:02:05,564] INFO [ReplicaFetcher replicaId=3, leaderId=2, 
fetcherId=0] Error sending fetch request (sessionId=1637095899, epoch=21379644) 
to node 2: java.io.IOException: Connection to 2 was disconnected before the 
response was read. (org.apache.kafka.clients.FetchSessionHandler)
[2019-01-26 05:02:05,573] WARN [ReplicaFetcher replicaId=3, leaderId=2, 
fetcherId=0] Error in response for fetch request (type=FetchRequest, 
replicaId=3, maxWait=500, minBytes=1, maxBytes=10485760, 
fetchData={PlayerGameRounds-8=(offset=2171960, logStartOffset=1483356, 
maxBytes=1048576, currentLeaderEpoch=Optional[2])}, 
isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessio
nId=1637095899, epoch=21379644)) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 2 was disconnected before the response was 
read
        at 
org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
        at 
kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97)
        at 
kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190)
        at 
kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)
        at 
kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)
        at 
kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)
        at scala.Option.foreach(Option.scala:257)
        at 
kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
        at 
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
[2019-01-26 05:02:35,723] WARN Attempting to send response via channel for 
which there is no open connection, connection id node3:9092-node2:59988-1550 
(kafka.network.Processor)
[2019-01-26 05:02:35,731] WARN Attempting to send response via channel for 
which there is no open connection, connection id node3:9092-node2:59986-1550 
(kafka.network.Processor)
[2019-01-26 05:02:35,797] WARN Attempting to send response via channel for 
which there is no open connection, connection id node3:9092-node2:59494-1549 
(kafka.network.Processor)
[2019-01-26 05:02:35,816] WARN Attempting to send response via channel for 
which there is no open connection, connection id node3:9092-node2:53268-1530 
(kafka.network.Processor)
[2019-01-26 05:02:37,603] INFO [ReplicaFetcher replicaId=3, leaderId=2, 
fetcherId=0] Error sending fetch request (sessionId=1637095899, epoch=INITIAL) 
to node 2: java.io.IOException: Connection to 2 was disconnected before the 
response was read. (org.apache.kafka.clients.FetchSessionHandler)

{code}
 

The request handler idle metrics dropped during this issue:

  !kafka-issue.png!  

 

  was:
 

We have 3 node cluster setup. There are scenarios that one of the broker 
suddenly got disconnected from the cluster but no underlying system issue is 
found. The node that got dc'ed wasn't able to release the partition it holds as 
the leader, hence clients (spring-boot) was unable to send/receive data from 
the issued broker.

We noticed that it always happen to the active controller count.

Environment details:

Provider: AWS
 Kernel: 3.10.0-693.21.1.el7.x86_64
 OS: CentOS Linux release 7.5.1804 (Core)
 Scala version: 2.11
 Kafka version: 2.1.0

 Kafka config:
{code:java}
############################# Socket Server Settings 
#############################

num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600

############################# Log Basics #############################

num.partitions=1
num.recovery.threads.per.data.dir=1

############################# Internal Topic Settings  
#############################

offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=2

############################# Log Retention Policy #############################
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000


############################# Group Coordinator Settings 
#############################
group.initial.rebalance.delay.ms=0

############################# Zookeeper #############################

zookeeper.connection.timeout.ms=6000

broker.id=1
zookeeper.connect=zk1:2181,zk2:2181,zk3:2181
log.dirs=/data/kafka-node
advertised.listeners=PLAINTEXT://node1:9092
{code}
Broker disconnected controller log:
{code:java}
[2019-01-26 05:03:52,512] TRACE [Controller id=2] Checking need to trigger auto 
leader balancing (kafka.controller.KafkaController)
[2019-01-26 05:03:52,513] DEBUG [Controller id=2] Preferred replicas by broker 
Map(TOPICS->MAP) (kafka.controller.KafkaController)
[2019-01-26 05:03:52,513] DEBUG [Controller id=2] Topics not in preferred 
replica for broker 2 Map() (kafka.controller.KafkaController)
[2019-01-26 05:03:52,513] TRACE [Controller id=2] Leader imbalance ratio for 
broker 2 is 0.0 (kafka.controller.KafkaController)
[2019-01-26 05:03:52,513] DEBUG [Controller id=2] Topics not in preferred 
replica for broker 1 Map() (kafka.controller.KafkaController)
[2019-01-26 05:03:52,513] TRACE [Controller id=2] Leader imbalance ratio for 
broker 1 is 0.0 (kafka.controller.KafkaController)
[2019-01-26 05:03:52,513] DEBUG [Controller id=2] Topics not in preferred 
replica for broker 3 Map() (kafka.controller.KafkaController)
[2019-01-26 05:03:52,513] TRACE [Controller id=2] Leader imbalance ratio for 
broker 3 is 0.0 (kafka.controller.KafkaController)
[2019-01-26 05:08:52,513] TRACE [Controller id=2] Checking need to trigger auto 
leader balancing (kafka.controller.KafkaController)

{code}
Broker working server.log:
{code:java}
[2019-01-26 05:02:05,564] INFO [ReplicaFetcher replicaId=3, leaderId=2, 
fetcherId=0] Error sending fetch request (sessionId=1637095899, epoch=21379644) 
to node 2: java.io.IOException: Connection to 2 was disconnected before the 
response was read. (org.apache.kafka.clients.FetchSessionHandler)
[2019-01-26 05:02:05,573] WARN [ReplicaFetcher replicaId=3, leaderId=2, 
fetcherId=0] Error in response for fetch request (type=FetchRequest, 
replicaId=3, maxWait=500, minBytes=1, maxBytes=10485760, 
fetchData={PlayerGameRounds-8=(offset=2171960, logStartOffset=1483356, 
maxBytes=1048576, currentLeaderEpoch=Optional[2])}, 
isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessio
nId=1637095899, epoch=21379644)) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 2 was disconnected before the response was 
read
        at 
org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
        at 
kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97)
        at 
kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190)
        at 
kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)
        at 
kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)
        at 
kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)
        at scala.Option.foreach(Option.scala:257)
        at 
kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
        at 
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
[2019-01-26 05:02:35,723] WARN Attempting to send response via channel for 
which there is no open connection, connection id node3:9092-node2:59988-1550 
(kafka.network.Processor)
[2019-01-26 05:02:35,731] WARN Attempting to send response via channel for 
which there is no open connection, connection id node3:9092-node2:59986-1550 
(kafka.network.Processor)
[2019-01-26 05:02:35,797] WARN Attempting to send response via channel for 
which there is no open connection, connection id node3:9092-node2:59494-1549 
(kafka.network.Processor)
[2019-01-26 05:02:35,816] WARN Attempting to send response via channel for 
which there is no open connection, connection id node3:9092-node2:53268-1530 
(kafka.network.Processor)
[2019-01-26 05:02:37,603] INFO [ReplicaFetcher replicaId=3, leaderId=2, 
fetcherId=0] Error sending fetch request (sessionId=1637095899, epoch=INITIAL) 
to node 2: java.io.IOException: Connection to 2 was disconnected before the 
response was read. (org.apache.kafka.clients.FetchSessionHandler)

{code}


> Broker suddenly got disconnected 
> ---------------------------------
>
>                 Key: KAFKA-7876
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7876
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller, network
>    Affects Versions: 2.1.0
>            Reporter: Anthony Lazam
>            Priority: Critical
>         Attachments: kafka-issue.png
>
>
>  
> We have 3 node cluster setup. There are scenarios that one of the broker 
> suddenly got disconnected from the cluster but no underlying system issue is 
> found. The node that got dc'ed wasn't able to release the partition it holds 
> as the leader, hence clients (spring-boot) was unable to send/receive data 
> from the issued broker.
> We noticed that it always happen to the active controller count.
> Environment details:
> Provider: AWS
>  Kernel: 3.10.0-693.21.1.el7.x86_64
>  OS: CentOS Linux release 7.5.1804 (Core)
>  Scala version: 2.11
>  Kafka version: 2.1.0
>  Kafka config:
> {code:java}
> ############################# Socket Server Settings 
> #############################
> num.network.threads=3
> num.io.threads=8
> socket.send.buffer.bytes=102400
> socket.receive.buffer.bytes=102400
> socket.request.max.bytes=104857600
> ############################# Log Basics #############################
> num.partitions=1
> num.recovery.threads.per.data.dir=1
> ############################# Internal Topic Settings  
> #############################
> offsets.topic.replication.factor=3
> transaction.state.log.replication.factor=3
> transaction.state.log.min.isr=2
> ############################# Log Retention Policy 
> #############################
> log.retention.hours=168
> log.segment.bytes=1073741824
> log.retention.check.interval.ms=300000
> ############################# Group Coordinator Settings 
> #############################
> group.initial.rebalance.delay.ms=0
> ############################# Zookeeper #############################
> zookeeper.connection.timeout.ms=6000
> broker.id=1
> zookeeper.connect=zk1:2181,zk2:2181,zk3:2181
> log.dirs=/data/kafka-node
> advertised.listeners=PLAINTEXT://node1:9092
> {code}
> Broker disconnected controller log:
> {code:java}
> [2019-01-26 05:03:52,512] TRACE [Controller id=2] Checking need to trigger 
> auto leader balancing (kafka.controller.KafkaController)
> [2019-01-26 05:03:52,513] DEBUG [Controller id=2] Preferred replicas by 
> broker Map(TOPICS->MAP) (kafka.controller.KafkaController)
> [2019-01-26 05:03:52,513] DEBUG [Controller id=2] Topics not in preferred 
> replica for broker 2 Map() (kafka.controller.KafkaController)
> [2019-01-26 05:03:52,513] TRACE [Controller id=2] Leader imbalance ratio for 
> broker 2 is 0.0 (kafka.controller.KafkaController)
> [2019-01-26 05:03:52,513] DEBUG [Controller id=2] Topics not in preferred 
> replica for broker 1 Map() (kafka.controller.KafkaController)
> [2019-01-26 05:03:52,513] TRACE [Controller id=2] Leader imbalance ratio for 
> broker 1 is 0.0 (kafka.controller.KafkaController)
> [2019-01-26 05:03:52,513] DEBUG [Controller id=2] Topics not in preferred 
> replica for broker 3 Map() (kafka.controller.KafkaController)
> [2019-01-26 05:03:52,513] TRACE [Controller id=2] Leader imbalance ratio for 
> broker 3 is 0.0 (kafka.controller.KafkaController)
> [2019-01-26 05:08:52,513] TRACE [Controller id=2] Checking need to trigger 
> auto leader balancing (kafka.controller.KafkaController)
> {code}
> Broker working server.log:
> {code:java}
> [2019-01-26 05:02:05,564] INFO [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error sending fetch request (sessionId=1637095899, 
> epoch=21379644) to node 2: java.io.IOException: Connection to 2 was 
> disconnected before the response was read. 
> (org.apache.kafka.clients.FetchSessionHandler)
> [2019-01-26 05:02:05,573] WARN [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error in response for fetch request (type=FetchRequest, 
> replicaId=3, maxWait=500, minBytes=1, maxBytes=10485760, 
> fetchData={PlayerGameRounds-8=(offset=2171960, logStartOffset=1483356, 
> maxBytes=1048576, currentLeaderEpoch=Optional[2])}, 
> isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessio
> nId=1637095899, epoch=21379644)) (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 2 was disconnected before the response was 
> read
>         at 
> org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
>         at 
> kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97)
>         at 
> kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190)
>         at 
> kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)
>         at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)
>         at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)
>         at scala.Option.foreach(Option.scala:257)
>         at 
> kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
>         at 
> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
>         at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> [2019-01-26 05:02:35,723] WARN Attempting to send response via channel for 
> which there is no open connection, connection id node3:9092-node2:59988-1550 
> (kafka.network.Processor)
> [2019-01-26 05:02:35,731] WARN Attempting to send response via channel for 
> which there is no open connection, connection id node3:9092-node2:59986-1550 
> (kafka.network.Processor)
> [2019-01-26 05:02:35,797] WARN Attempting to send response via channel for 
> which there is no open connection, connection id node3:9092-node2:59494-1549 
> (kafka.network.Processor)
> [2019-01-26 05:02:35,816] WARN Attempting to send response via channel for 
> which there is no open connection, connection id node3:9092-node2:53268-1530 
> (kafka.network.Processor)
> [2019-01-26 05:02:37,603] INFO [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error sending fetch request (sessionId=1637095899, 
> epoch=INITIAL) to node 2: java.io.IOException: Connection to 2 was 
> disconnected before the response was read. 
> (org.apache.kafka.clients.FetchSessionHandler)
> {code}
>  
> The request handler idle metrics dropped during this issue:
>   !kafka-issue.png!  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to