Re: Kafka 10 Stability Issue

2017-01-20 Thread Jason Gustafson
Hi there,

This sounds similar to https://issues.apache.org/jira/browse/KAFKA-4477.
Have you tried 0.10.1.1?

-Jason

On Fri, Jan 20, 2017 at 5:27 PM, Hui Yang  wrote:

> Hi, Kafka Team
>
> This is Hui Yang from Expedia engineer team and want to ask a question
> about Kafka 10 issue.
> Our team use Kafka as our core infrastructure and recently upgrade from
> Kafka 0.8.2.2 to Kafka 0.10.1.0 but get a issue after the upgrade.
>
> The issue is as below:
> Kafka 10 works well after the upgrade for couple days but then we started
> to see "java.io.IOException: Connection to 3 was disconnected before the
> response was read” on each Kafka broker when trying to communicate to
> controller (as you may know, one of the Kafka broker is acting as a
> controller to handle the topic/partition assignment and state change task,
> in our case, it is the broker 3).
> Even on the controller log, I found "[Controller-3-to-broker-3-send-thread],
> Controller 3 epoch 3 fails to send request,java.io.IOException: Connection
> to 3 was disconnected before the response was read”, looks it is even not
> able to sent message to itself.
> After we saw those exception on brokers for a while, we started to see
> timeout exception from our producer side that our producer is not able to
> send messages to brokers.
>
> When I check the JMX metrics, I found the CPU usage for controller is
> always higher than other brokers after we upgrade to Kafka 10(brokers have
> similar CPU usage when Kafka 8) and memory increased for a spike
> specifically for the controller during the issue. I assume the controller
> may not have enough memory left to create new connections for the producer
> and other brokers.
>
> One more need to mention is we use the Kafka 0.8 protocol and format on
> Kafka 0.10 brokers that we can still use 0.8 clients.
>
> Details for the exception:
> " WARN [ReplicaFetcherThread-0-3], Error in fetch kafka.server.
> ReplicaFetcherThread$FetchRequest@87d8e00 (kafka.server.
> ReplicaFetcherThread)
> java.io.IOException: Connection to 3 was disconnected before the response
> was read
> at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$
> extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
> at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$
> extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
> at scala.Option.foreach(Option.scala:257)
> at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$
> extension$1.apply(NetworkClientBlockingOps.scala:112)
> at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$
> extension$1.apply(NetworkClientBlockingOps.scala:108)
> at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(
> NetworkClientBlockingOps.scala:137)
> at kafka.utils.NetworkClientBlockingOps$.kafka$utils$
> NetworkClientBlockingOps$$pollContinuously$extension(
> NetworkClientBlockingOps.scala:143)
> at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(
> NetworkClientBlockingOps.scala:108)
> at kafka.server.ReplicaFetcherThread.sendRequest(
> ReplicaFetcherThread.scala:253)
> at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
> at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
> at kafka.server.AbstractFetcherThread.processFetchRequest(
> AbstractFetcherThread.scala:118)
> at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:
> 103)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)"
>
> "WARN [Controller-3-to-broker-3-send-thread], Controller 3 epoch 1 fails
> to send request
> java.io.IOException: Connection to 2 was disconnected before the response
> was read
> at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$
> extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
> at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$
> extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
> at scala.Option.foreach(Option.scala:257)
> at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$
> extension$1.apply(NetworkClientBlockingOps.scala:112)
> at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$
> extension$1.apply(NetworkClientBlockingOps.scala:108)
> at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(
> NetworkClientBlockingOps.scala:137)
> at kafka.utils.NetworkClientBlockingOps$.kafka$utils$
> NetworkClientBlockingOps$$pollContinuously$extension(
> NetworkClientBlockingOps.scala:143)
> at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(
> NetworkClientBlockingOps.scala:108)
> at kafka.controller.RequestSendThread.liftedTree1$
> 1(ControllerChannelManager.scala:190)
> at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.
> scala:181)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)”
>
> In production, we build 6 

Kafka 10 Stability Issue

2017-01-20 Thread Hui Yang
Hi, Kafka Team

This is Hui Yang from Expedia engineer team and want to ask a question about 
Kafka 10 issue.
Our team use Kafka as our core infrastructure and recently upgrade from Kafka 
0.8.2.2 to Kafka 0.10.1.0 but get a issue after the upgrade.

The issue is as below:
Kafka 10 works well after the upgrade for couple days but then we started to 
see "java.io.IOException: Connection to 3 was disconnected before the response 
was read” on each Kafka broker when trying to communicate to controller (as you 
may know, one of the Kafka broker is acting as a controller to handle the 
topic/partition assignment and state change task, in our case, it is the broker 
3).
Even on the controller log, I found "[Controller-3-to-broker-3-send-thread], 
Controller 3 epoch 3 fails to send request,java.io.IOException: Connection to 3 
was disconnected before the response was read”, looks it is even not able to 
sent message to itself.
After we saw those exception on brokers for a while, we started to see timeout 
exception from our producer side that our producer is not able to send messages 
to brokers.

When I check the JMX metrics, I found the CPU usage for controller is always 
higher than other brokers after we upgrade to Kafka 10(brokers have similar CPU 
usage when Kafka 8) and memory increased for a spike specifically for the 
controller during the issue. I assume the controller may not have enough memory 
left to create new connections for the producer and other brokers.

One more need to mention is we use the Kafka 0.8 protocol and format on Kafka 
0.10 brokers that we can still use 0.8 clients.

Details for the exception:
" WARN [ReplicaFetcherThread-0-3], Error in fetch 
kafka.server.ReplicaFetcherThread$FetchRequest@87d8e00 
(kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 3 was disconnected before the response was 
read
at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
at scala.Option.foreach(Option.scala:257)
at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
at 
kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
at 
kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
at 
kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:253)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
at 
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)"

"WARN [Controller-3-to-broker-3-send-thread], Controller 3 epoch 1 fails to 
send request
java.io.IOException: Connection to 2 was disconnected before the response was 
read
at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:115)
at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:112)
at scala.Option.foreach(Option.scala:257)
at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:112)
at 
kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:108)
at 
kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:137)
at 
kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$extension(NetworkClientBlockingOps.scala:143)
at 
kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
at 
kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:190)
at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:181)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)”

In production, we build 6 Kafka brokers with 3 zookeeper nodes on the AWS using 
C3.xlarge type.
Our JVM settings is as follow: -Xmx1G -Xms1G –server -XX:+UseCompressedOops 
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled 
-XX:+CMSScavengeBeforeRemark.
Our traffic is 500 TPS and each message has average 100KB size.

I am appreciate for your time to give