[ https://issues.apache.org/jira/browse/KAFKA-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16170824#comment-16170824 ]
Alexey Pervushin commented on KAFKA-5195: ----------------------------------------- We have the same issue. It happened once. In server.log I have enormous number of messages: {noformat} [2017-09-02 14:02:19,074] ERROR {ReplicaFetcherThread-0-123} [ReplicaFetcherThread-0-123], Error for partition [partition_name, 0] to broker 123:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) {noformat} Slightly prior that I noticed pretty significant GC collection time spike and a lot of messages about network issues to/from this broker(probably because of GC freeze), like: {noformat} [2017-09-02 14:04:53,550] WARN {main-SendThread(10.69.102.249:2181)} Client session timed out, have not h eard from server in 4371ms for sessionid 0x35de55b9852d48b (org.apache.zookeeper.ClientCnxn) ... [2017-09-02 14:05:28,739] WARN {main-SendThread(10.69.145.152:2181)} Unable to reconnect to ZooKeeper ser vice, session 0x35de55b9852d48b has expired (org.apache.zookeeper.ClientCnxn) {noformat} and {noformat} [2017-09-02 14:02:18,655] WARN {ReplicaFetcherThread-0-124} [ReplicaFetcherThread-0-124], Err or in fetch kafka.server.ReplicaFetcherThread$FetchRequest@33884d9f (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 124 was disconnected before the response was read at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$appl y$1.apply(NetworkClientBlockingOps.scala:114) at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$appl y$1.apply(NetworkClientBlockingOps.scala:112) at scala.Option.foreach(Option.scala:257) at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(Network ClientBlockingOps.scala:112) at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(Network ClientBlockingOps.scala:108) at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:136) at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollContinuously$e xtension(NetworkClientBlockingOps.scala:142) at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOp s.scala:108) at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:249) at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:234) at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63) {noformat} We use Kafka = 0.10.2.1 + backported patch from https://issues.apache.org/jira/browse/KAFKA-5413 java-8-oracle-1.8.0.92 with G1 as GC. Broker configuration: {noformat} broker.id=<ID> log.dirs=<PATH> zookeeper.connect=... auto.create.topics.enable=true connections.max.idle.ms=3600000 default.replication.factor=3 delete.topic.enable=true group.max.session.timeout.ms=300000 inter.broker.protocol.version=0.10.2.0 log.cleaner.dedupe.buffer.size=536870912 log.cleaner.enable=true log.message.format.version=0.9.0.1 log.retention.hours=72 log.segment.bytes=268435456 message.max.bytes=1000000 min.insync.replicas=2 num.io.threads=5 offsets.retention.minutes=4320 offsets.topic.segment.bytes=104857600 replica.fetch.max.bytes=10485760 request.timeout.ms=300001 reserved.broker.max.id=2113929216 unclean.leader.election.enable=false {noformat} > Endless NotLeaderForPartitionException for ReplicaFetcherThread > --------------------------------------------------------------- > > Key: KAFKA-5195 > URL: https://issues.apache.org/jira/browse/KAFKA-5195 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.10.1.1 > Environment: 3 Kafka brokers on top of Kubernetes, using Docker image > wurstmeister/kafka:0.10.1.1. > Environment variables: > KAFKA_ADVERTISED_HOST_NAME: kafka-ypimp-2 > KAFKA_ADVERTISED_PORT: 9092 > KAFKA_ZOOKEEPER_CONNECT: > zookeeper-ypimp-0:2181,zookeeper-ypimp-1:2181,zookeeper-ypimp-2:2181 > KAFKA_DELETE_TOPIC_ENABLE: true > KAFKA_BROKER_ID: 2 > JMX_PORT: 1099 > KAFKA_JMX_OPTS: -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Djava.rmi.server.hostname=kafka-ypimp-2.default.svc.cluster.local > -Dcom.sun.management.jmxremote.rmi.port=1099 > KAFKA_LOG_RETENTION_HOURS: 96 > KAFKA_AUTO_CREATE_TOPICS_ENABLE: false > Zookeeper version: 3.4.8. > Number of Zk nodes: 3. > Reporter: Andrea Gardiman > > One of the 3 brokers is suddenly in a bad state. It endlessly prints out the > following message, for every partition: > [2017-05-08 13:51:16,748] ERROR [ReplicaFetcherThread-0-0], Error for > partition [partition_name,5] to broker > 0:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server > is not the leader for that topic-partition. > (kafka.server.ReplicaFetcherThread) > In zookeeper, under /brokers/ids, I can't find the zkNode for broker 2. There > are only the zkNodes 0 and 1. > What kind of error this can be? > Please, let me know if you need some more informaton, I don't know hot to > properly debug it. > Many thanks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)