Re: Getting NotLeaderForPartitionException for a very long time
Thanks Manoj for the comment. When i say it is taking long time, it is really long, i.e more than a day!!! thats why i am not able to understand the behavior... Regards, Gopal On Tue, Apr 28, 2020 at 2:22 PM wrote: > Follower take some time to become the leader in case leader is down . you > can build retry logic to around this to handle this situation . > > On 4/28/20, 1:08 AM, "M.Gopala Krishnan" wrote: > > [External] > > > Hi, > > I have a 3 node kafka cluster (replication-factor : 3), suddenly one > of the > node in the cluster was down and i started seeing the > NotLeaderForPartitionException exception in my application logs when > sending the message to one of the topics, however for some of the > topics i > am able post and consume messages. > > I could see this problem lasting until all the kafka servers are > restarted, > after the restart things are all ok. > > Now, my question is, why not the new leader not elected for those > topics > but keep throwing the same NotLeaderForPartitionException exception > and how > to get the new leader election happen for these topics. > > *Exception Trace:* > > 2020-04-11 22:05:21,747 ERROR [pool-15-thread-297] > [KafkaMessageProducer:92] Message send failed: > java.util.concurrent.ExecutionException: > org.apache.kafka.common.errors.NotLeaderForPartitionException: This > server > is not the leader for that topic-partition. at > > org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:94) > at > > org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:64) > at > > org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:29) > > Regards, > Gopal > > > This e-mail and any files transmitted with it are for the sole use of the > intended recipient(s) and may contain confidential and privileged > information. If you are not the intended recipient(s), please reply to the > sender and destroy all copies of the original message. Any unauthorized > review, use, disclosure, dissemination, forwarding, printing or copying of > this email, and/or any action taken in reliance on the contents of this > e-mail is strictly prohibited and may be unlawful. Where permitted by > applicable law, this e-mail and other e-mail communications sent to and > from Cognizant e-mail addresses may be monitored. >
Re: Getting NotLeaderForPartitionException for a very long time
Follower take some time to become the leader in case leader is down . you can build retry logic to around this to handle this situation . On 4/28/20, 1:08 AM, "M.Gopala Krishnan" wrote: [External] Hi, I have a 3 node kafka cluster (replication-factor : 3), suddenly one of the node in the cluster was down and i started seeing the NotLeaderForPartitionException exception in my application logs when sending the message to one of the topics, however for some of the topics i am able post and consume messages. I could see this problem lasting until all the kafka servers are restarted, after the restart things are all ok. Now, my question is, why not the new leader not elected for those topics but keep throwing the same NotLeaderForPartitionException exception and how to get the new leader election happen for these topics. *Exception Trace:* 2020-04-11 22:05:21,747 ERROR [pool-15-thread-297] [KafkaMessageProducer:92] Message send failed: java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:94) at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:64) at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:29) Regards, Gopal This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
Getting NotLeaderForPartitionException for a very long time
Hi, I have a 3 node kafka cluster (replication-factor : 3), suddenly one of the node in the cluster was down and i started seeing the NotLeaderForPartitionException exception in my application logs when sending the message to one of the topics, however for some of the topics i am able post and consume messages. I could see this problem lasting until all the kafka servers are restarted, after the restart things are all ok. Now, my question is, why not the new leader not elected for those topics but keep throwing the same NotLeaderForPartitionException exception and how to get the new leader election happen for these topics. *Exception Trace:* 2020-04-11 22:05:21,747 ERROR [pool-15-thread-297] [KafkaMessageProducer:92] Message send failed: java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:94) at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:64) at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:29) Regards, Gopal
UnknownTopicOrPartitionException & NotLeaderForPartitionException upon replication of topics.
Hi Team, Running kafka & zookeeper in kubernetes cluster. No of brokers : 3 No of partitions per topic : 3 creating topic with 3 partitions, and looks like all the partitions are up. Below is the snapshot to confirm the same, Topic:applestore PartitionCount:3 ReplicationFactor:3 Configs: Topic: applestore Partition: 0Leader: 1001Replicas: 1001,1003,1002 Isr: 1001,1003,1002 Topic: applestore Partition: 1Leader: 1002Replicas: 1002,1001,1003 Isr: 1002,1001,1003 Topic: applestore Partition: 2Leader: 1003Replicas: 1003,1002,1001 Isr: 1003,1002,1001 But, we see in the brokers as soon as the topics are created below stack traces appears, error 1: [2018-05-28 08:00:31,875] ERROR [ReplicaFetcher replicaId=1001, leaderId=1003, fetcherId=7] Error for partition applestore-2 to broker 1003:org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition. (kafka.server.ReplicaFetcherThread) error 2 : [2018-05-28 00:43:20,993] ERROR [ReplicaFetcher replicaId=1003, leaderId=1001, fetcherId=0] Error for partition apple-6 to broker 1001:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) When we tries producing records to each specific partition, it works fine and also log size across the replicated brokers appears to be equal, which means replication is happening fine. Attaching the two stack trace files. Why are these stack traces appearing ? can we ignore these stack traces if its some spam messages ? Best regards, kaushik [2018-05-28 00:39:46,195] TRACE [Broker id=1003] Cached leader info PartitionState(controllerEpoch=1, leader=1001, leaderEpoch=5, isr=[1001, 1002, 1003], zkVersion=8, replicas=[1003, 1001, 1002], offlineReplicas=[]) for partition steve-0 in response to UpdateMetadata request sent by controller 1001 epoch 1 with correlation id 6 (state.change.logger) [2018-05-28 00:39:46,195] TRACE [Broker id=1003] Cached leader info PartitionState(controllerEpoch=1, leader=1001, leaderEpoch=3, isr=[1001, 1002, 1003], zkVersion=6, replicas=[1001, 1002, 1003], offlineReplicas=[]) for partition apple-7 in response to UpdateMetadata request sent by controller 1001 epoch 1 with correlation id 6 (state.change.logger) [2018-05-28 00:43:20,093] TRACE [Broker id=1003] Received LeaderAndIsr request PartitionState(controllerEpoch=1, leader=1003, leaderEpoch=6, isr=1001,1002,1003, zkVersion=9, replicas=1003,1001,1002, isNew=false) correlation id 7 from controller 1001 epoch 1 for partition apple-0 (state.change.logger) [2018-05-28 00:43:20,093] TRACE [Broker id=1003] Handling LeaderAndIsr request correlationId 7 from controller 1001 epoch 1 starting the become-leader transition for partition apple-0 (state.change.logger) [2018-05-28 00:43:20,193] INFO [ReplicaFetcherManager on broker 1003] Removed fetcher for partitions apple-0 (kafka.server.ReplicaFetcherManager) [2018-05-28 00:43:20,193] INFO [Partition apple-0 broker=1003] apple-0 starts at Leader Epoch 6 from offset 0. Previous Leader Epoch was: 5 (kafka.cluster.Partition) [2018-05-28 00:43:20,193] TRACE [Broker id=1003] Stopped fetchers as part of become-leader request from controller 1001 epoch 1 with correlation id 7 for partition apple-0 (last update controller epoch 1) (state.change.logger) [2018-05-28 00:43:20,194] TRACE [Broker id=1003] Completed LeaderAndIsr request correlationId 7 from controller 1001 epoch 1 for the become-leader transition for partition apple-0 (state.change.logger) [2018-05-28 00:43:20,993] ERROR [ReplicaFetcher replicaId=1003, leaderId=1001, fetcherId=0] Error for partition apple-6 to broker 1001:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) [2018-05-28 00:43:20,993] ERROR [ReplicaFetcher replicaId=1003, leaderId=1001, fetcherId=0] Error for partition steve-0 to broker 1001:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) [2018-05-28 00:43:20,993] TRACE [Broker id=1003] Cached leader info PartitionState(controllerEpoch=1, leader=1003, leaderEpoch=6, isr=[1001, 1002, 1003], zkVersion=9, replicas=[1003, 1001, 1002], offlineReplicas=[]) for partition apple-0 in response to UpdateMetadata request sent by controller 1001 epoch 1 with correlation id 8 (state.change.logger) [2018-05-28 00:43:20,993] ERROR [ReplicaFetcher replicaId=1003, leaderId=1001, fetcherId=0] Error for partition mango2-0 to broker 1001:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) [2018-05-28 00:43:21,268] ERROR [ReplicaFetcher replicaId=1003, leaderId=1001, fetcherId=0] Error for partition
NotLeaderForPartitionException while restaring kafka brokers
Hello, I have a Kafka Streams application that is consuming from two topics and internally aggregating, transforming and joining data. Recently I was doing a rolling restart of the production Kafka brokers and the following errors appeared in my Kafka Streams application: 2018-04-12 10:39:13,097 [WARN ] (1-producer) org.apache.kafka.clients.producer.internals.Sender- [Producer clientId=-f063f0c2-e05f-443a-8ac2-e1298b1a274c-StreamThread-1-producer] Got error produce response with correlation id 2892774 on topic-partition -KSTREAM-KEY-SELECT-14-repartition-2, retrying (9 attempts left). Error: NOT_LEADER_FOR_PARTITION After the 9 attempts the following error was logged: 2018-04-11 10:39:14,125 [ERROR] (1-producer) org.apache.kafka.streams.processor.internals.RecordCollectorImpl- task [0_2] Error sending record (key value timestamp 1523522324694) to topic -KSTREAM-KEY-SELECT-14-repartition due to {}; No more records will be sent and no more offsets will be recorded for this task.org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. Followed by: 2018-04-12 10:39:14,140 [ERROR] (amThread-1) org.apache.kafka.streams.processor.internals.ProcessorStateManager - task [0_2] Failed to flush state store KSTREAM-AGGREGATE-STATE-STORE-03: org.apache.kafka.streams.errors.StreamsException: task [0_2] Abort sending since an error caught with a previous record (key value timestamp 1523522324694) to topic -KSTREAM-KEY-SELECT-14-repartition due to org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.. at org.apache.kafka.streams.processor.internals.RecordCollectorImpl$1.onCompletion(RecordCollectorImpl.java:118) ~[arb-arpa.jar:1.4.0.0] at org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:204) ~[arb-arpa.jar:1.4.0.0] at org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:187) ~[arb-arpa.jar:1.4.0.0] at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:627) ~[arb-arpa.jar:1.4.0.0] at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:596) ~[arb-arpa.jar:1.4.0.0] at org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:557) ~[arb-arpa.jar:1.4.0.0] at org.apache.kafka.clients.producer.internals.Sender.handleProduceResponse(Sender.java:481) ~[arb-arpa.jar:1.4.0.0] at org.apache.kafka.clients.producer.internals.Sender.access$100(Sender.java:74) ~[arb-arpa.jar:1.4.0.0] at org.apache.kafka.clients.producer.internals.Sender$1.onComplete(Sender.java:692) ~[arb-arpa.jar:1.4.0.0] at org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:101) ~[arb-arpa.jar:1.4.0.0] at org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:482) ~[arb-arpa.jar:1.4.0.0] at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:474) ~[arb-arpa.jar:1.4.0.0] at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:239) ~[arb-arpa.jar:1.4.0.0] at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163) ~[arb-arpa.jar:1.4.0.0] at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_131]org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. The application died and stopped processing any new messages and had to be restarted. In production there are 5 brokers and the replication factor is 3. The version of kafka streams that I use is 1.0.1 and the kafka version of the broker is 1.0.1. My question is if this is an expected behavior? Also is there any way to deal with it? Regards, Mihaela Stoycheva
NotLeaderForPartitionException in 0.8.x
Hi All, I understand that we get a NotLeaderForPartitionException when the client tries to publish a message to a server which is not the Leader for a given partition. I also understand that once leadership does change the client will learn that and start publishing messages to the right server. I am however seeing that sometimes even after leadership has changed the client continues to send messages to the wrong server resulting in it continuously getting the NotLeaderForPartitionException. I am not sure how to reproduce this problem because we have only seen this sporadically in certain production deployments. When we get into this situation one step which usually works is to restart zookeeper, kafka and the client. Obviously this is not acceptable. Is there a workaround that I could implement which would help to remediate this situation. For example would re-creating the KafkaProducer help in this scenario? Thanks, -Moiz
Re: Getting NotLeaderForPartitionException in kafka broker
Force on This。 宋鑫/NemoVisky
Re: NotLeaderForPartitionException while doing repartitioning
I had this issue lately. On broker 9. Check what errors you got from a change log: kafka-run-class kafka.tools.StateChangeLogMerger --logs /var/log/kafka/state-change.log --topic [__consumer_offsets If it complains about connection, it maybe this broker does not read data from zookeeper Check zookeeper for this topic – number of partition. If it there correctly, restart the broker. On 12/21/16, 11:49 PM, "Stephane Maarek" wrote: Hi, I’m doing a repartitioning from broker 4 5 6 to broker 7 8 9. I’m getting a LOT of the following errors (for all topics): [2016-12-22 04:47:21,957] ERROR [ReplicaFetcherThread-0-9], Error for partition [__consumer_offsets,29] to broker 9:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) 1) Is that bad? 2) why is it happening? I’m just running the normal reassign partition tool… I’m getting this with kafka 0.10.1.0 Regards, Stephane -- This message and any attachments are intended only for the use of the addressee and may contain information that is privileged and confidential. If the reader of the message is not the intended recipient or an authorized representative of the intended recipient, you are hereby notified that any dissemination of this communication is strictly prohibited. If you have received this communication in error, notify the sender immediately by return email and delete the message and any attachments from your system.
NotLeaderForPartitionException while doing repartitioning
Hi, I’m doing a repartitioning from broker 4 5 6 to broker 7 8 9. I’m getting a LOT of the following errors (for all topics): [2016-12-22 04:47:21,957] ERROR [ReplicaFetcherThread-0-9], Error for partition [__consumer_offsets,29] to broker 9:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) 1) Is that bad? 2) why is it happening? I’m just running the normal reassign partition tool… I’m getting this with kafka 0.10.1.0 Regards, Stephane
Re: The connection between kafka and zookeeper is often closed byzookeeper, lead to NotLeaderForPartitionException: This server is not theleader for that topic-partition.
Xiaoyuan, I am not an expert in ZK so here is what I can tell: 1. "NotLeaderForPartitionException" is not usually thrown when ZK connection timed out, it is thrown when the produce requests has arrived the broker but the brokers think themselves as not the leader (any more) for the requested partitions. So it usually indicates that the partition leaders has migrated. 2. It is generally suggested to not co-located the Kafka brokers and Zookeepers on the same machines, since both are memory and IO-consumption heavy applications and hence could increase long GCs, which can also causing ZK or Kafka brokers "stalls". See this for more details: https://kafka.apache.org/documentation/#zk 2. Without the version information of the brokers and clients I cannot tell further what could be the issue. Guozhang On Thu, Dec 15, 2016 at 5:17 PM, Xiaoyuan Chen <253441...@qq.com> wrote: > Any solution? > > > > > -- 原始邮件 -- > 发件人: "Xiaoyuan Chen"<253441...@qq.com>; > 发送时间: 2016年12月9日(星期五) 上午10:15 > 收件人: "users"; > 主题: The connection between kafka and zookeeper is often closed > byzookeeper, lead to NotLeaderForPartitionException: This server is not > theleader for that topic-partition. > > > > Hi guys, > > Situation: > 3 nodes, each 32G memory, CPU 24 cores, 1T hd. > 3 brokers on 3 nodes, and 3 zookeeper on these 3 nodes too, all the > properties are default, start the zookeeper cluster and kafka cluster. > Create a topic (3 replications, 6 partions), like below: > bin/kafka-topics.sh --create --zookeeper hw168:2181 > --replication-factor 3 --partitions 6 --topic test > And run the ProducerPerformance given by kafka on the two nodes at the > same time, it means we have two producers, command like below: > bin/kafka-run-class.sh org.apache.kafka.tools.ProducerPerformance > --topic test --num-records 1 --record-size 100 --throughput -1 > --producer-props bootstrap.servers=hw168:9092 buffer.memory=67108864 > batch.size=65536 acks=1 > > Problem: > We can see from the producer, a lot of NotLeaderForPartitionException: > org.apache.kafka.common.errors.NotLeaderForPartitionException: This > server is not the leader for that topic-partition. > org.apache.kafka.common.errors.NotLeaderForPartitionException: This > server is not the leader for that topic-partition. > org.apache.kafka.common.errors.NotLeaderForPartitionException: This > server is not the leader for that topic-partition. > … > > Track the process (by using DEBUG): > There is a INFO: > INFO Client session timed out, have not heard from server in 11647ms > for sessionid 0x258de4a26a4, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > > And We found that the connection between zkClient (kafka holds) and > zookeeper server is closed by zookeeper server, the reason is that time is > out, for details: > [2016-12-08 20:24:00,547] DEBUG Partition [test,5] on broker 1: > Skipping update high watermark since Old hw 15986847 [8012779 : 1068525112] > is larger than new hw 15986847 [8012779 : 1068525112] for partition > [test,5]. All leo's are 16566175 [16025299 : 72477384],15986847 [8012779 : > 1068525112],16103549 [16025299 : 10485500] (kafka.cluster.Partition) > [2016-12-08 20:24:00,547] DEBUG Adding index entry 16566161 => > 72475508 to 16025299.index. (kafka.log.OffsetIndex) > [2016-12-08 20:24:11,368] DEBUG [Replica Manager on Broker 1]: Request > key test-2 unblocked 0 fetch requests. (kafka.server.ReplicaManager) > [2016-12-08 20:24:11,368] DEBUG Partition [test,2] on broker 1: > Skipping update high watermark since Old hw 16064424 [16025299 : 5242750] > is larger than new hw 16064424 [16025299 : 5242750] for partition [test,2]. > All leo's are 16566175 [16025299 : 72477384],16205274 [16025299 : > 24116650],16064424 [16025299 : 5242750] (kafka.cluster.Partition) > [2016-12-08 20:24:11,369] DEBUG [Replica Manager on Broker 1]: Produce > to local log in 10821 ms (kafka.server.ReplicaManager) > [2016-12-08 20:24:11,369] INFO Client session timed out, have not > heard from server in 11647ms for sessionid 0x258de4a26a4, closing > socket connection and attempting reconnect (org.apache.zookeeper. > ClientCnxn) > > Please watch the time, the there is no DEBUG between 20:24:00,547 and > 20:24:11,368, it already exceeded the time for session timeout (6000ms), so > it causes this disconnection. We keep digging: > We found that it got stuck in the function: > selector.select(waitTimeOut); — in the method doTransport(…) in > class org.apache.zookeeper.ClientCnxnSocketNIO > and that is the time ww got no
Aw: Re: NotLeaderForPartitionException
Hi, thank you for the nice clarification. This is also described indirectly here, which I had not found before: https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-Partitioningandbootstrapping The Log-Level of this could perhaps be considered to be reduced to WARN, if it is quite common and usually recovered. Kind Regards Sven Gesendet: Dienstag, 06. Dezember 2016 um 21:47 Uhr Von: "Apurva Mehta" An: users@kafka.apache.org Betreff: Re: NotLeaderForPartitionException Hi Sven, You will see this exception during leader election. When the leader for a partition moves to another broker, there is a period during which the replicas would still connect to the original leader, at which point they will raise this exception. This should be a very short period, after which they will connect to and replicate from the new leader correctly. This is not a fatal error, and you will see it if you are bouncing brokers (since all the leaders on that broker will have to move after the bounce). You may also see it if some brokers have connectivity issues: they may be considered dead, and their partitions would be moved elsewhere. Hope this helps, Apurva On Tue, Dec 6, 2016 at 10:06 AM, Sven Ludwig wrote: > Hello, > > in our Kafka clusters we sometimes observe a specific ERROR log-statement, > and therefore we have doubts whether it is already running sable in our > configuration. This occurs every now and then, like two or three times in a > day. It is actually the predominant ERROR log-statement in our cluster. > Example: > > [2016-12-06 17:14:50,909] ERROR [ReplicaFetcherThread-0-3], Error for > partition [,] to broker > 3:org.apache.kafka.common.errors.NotLeaderForPartitionException: This > server is > not the leader for that topic-partition. (kafka.server. > ReplicaFetcherThread) > > We already asked Google, but we did not find sufficient answers to our > questions, therefore I am asking on the mailing list: > > 1. What are the possible reasons for this particular error? > > 2. What are the implications of it? > > 3. What can be done to prevent it? > > Best Regards, > Sven >
??????The connection between kafka and zookeeper is often closed byzookeeper, lead to NotLeaderForPartitionException: This server is not theleader for that topic-partition.
Any solution? -- -- ??: "Xiaoyuan Chen"<253441...@qq.com>; : 2016??12??9??(??) 10:15 ??: "users"; : The connection between kafka and zookeeper is often closed byzookeeper, lead to NotLeaderForPartitionException: This server is not theleader for that topic-partition. Hi guys, Situation: 3 nodes, each 32G memory, CPU 24 cores, 1T hd. 3 brokers on 3 nodes, and 3 zookeeper on these 3 nodes too, all the properties are default, start the zookeeper cluster and kafka cluster. Create a topic (3 replications, 6 partions), like below: bin/kafka-topics.sh --create --zookeeper hw168:2181 --replication-factor 3 --partitions 6 --topic test And run the ProducerPerformance given by kafka on the two nodes at the same time, it means we have two producers, command like below: bin/kafka-run-class.sh org.apache.kafka.tools.ProducerPerformance --topic test --num-records 1 --record-size 100 --throughput -1 --producer-props bootstrap.servers=hw168:9092 buffer.memory=67108864 batch.size=65536 acks=1 Problem: We can see from the producer, a lot of NotLeaderForPartitionException: org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. ?? Track the process (by using DEBUG): There is a INFO: INFO Client session timed out, have not heard from server in 11647ms for sessionid 0x258de4a26a4, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) And We found that the connection between zkClient (kafka holds) and zookeeper server is closed by zookeeper server, the reason is that time is out, for details: [2016-12-08 20:24:00,547] DEBUG Partition [test,5] on broker 1: Skipping update high watermark since Old hw 15986847 [8012779 : 1068525112] is larger than new hw 15986847 [8012779 : 1068525112] for partition [test,5]. All leo's are 16566175 [16025299 : 72477384],15986847 [8012779 : 1068525112],16103549 [16025299 : 10485500] (kafka.cluster.Partition) [2016-12-08 20:24:00,547] DEBUG Adding index entry 16566161 => 72475508 to 16025299.index. (kafka.log.OffsetIndex) [2016-12-08 20:24:11,368] DEBUG [Replica Manager on Broker 1]: Request key test-2 unblocked 0 fetch requests. (kafka.server.ReplicaManager) [2016-12-08 20:24:11,368] DEBUG Partition [test,2] on broker 1: Skipping update high watermark since Old hw 16064424 [16025299 : 5242750] is larger than new hw 16064424 [16025299 : 5242750] for partition [test,2]. All leo's are 16566175 [16025299 : 72477384],16205274 [16025299 : 24116650],16064424 [16025299 : 5242750] (kafka.cluster.Partition) [2016-12-08 20:24:11,369] DEBUG [Replica Manager on Broker 1]: Produce to local log in 10821 ms (kafka.server.ReplicaManager) [2016-12-08 20:24:11,369] INFO Client session timed out, have not heard from server in 11647ms for sessionid 0x258de4a26a4, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) Please watch the time, the there is no DEBUG between 20:24:00,547 and 20:24:11,368, it already exceeded the time for session timeout (6000ms), so it causes this disconnection. We keep digging: We found that it got stuck in the function: selector.select(waitTimeOut); ?? in the method doTransport(??) in class org.apache.zookeeper.ClientCnxnSocketNIO and that is the time ww got no DEBUG. For more details, Call procedure (zookeeper client): org.apache.zookeeper.ClientCnxn -> run() -> doTransport(..) In the function run(), every time it will check whether there is a timeout, if not, it will run doTransport, but the doTransport costs about 10s, so next loop, it will find the timeout. Keep going, I thought there could be a deadlock at that time, so I keep printing the jstack of the kafka and zookeeper. Using the shell like below: while true; do echo -e "\n\n"`date "+%Y-%m-%d %H:%M:%S,%N"`"\n"`jstack 12165` >> zkjstack; echo -e "\n\n"`date "+%Y-%m-%d %H:%M:%S,%N"`"\n"`jstack 12425` >> kafkajstack; done And I check the period we got NO DEBUG in the stack logs, surprise, there also NO LOG at that time!!! Why? So I??m confused that why it got stuck in that function? Why there is no DEBUG or LOG in that weird period? Please help me. Thank you all. Xiaoyuan
The connection between kafka and zookeeper is often closed by zookeeper, lead to NotLeaderForPartitionException: This server is not the leader for that topic-partition.
Hi guys, Situation: 3 nodes, each 32G memory, CPU 24 cores, 1T hd. 3 brokers on 3 nodes, and 3 zookeeper on these 3 nodes too, all the properties are default, start the zookeeper cluster and kafka cluster. Create a topic (3 replications, 6 partions), like below: bin/kafka-topics.sh --create --zookeeper hw168:2181 --replication-factor 3 --partitions 6 --topic test And run the ProducerPerformance given by kafka on the two nodes at the same time, it means we have two producers, command like below: bin/kafka-run-class.sh org.apache.kafka.tools.ProducerPerformance --topic test --num-records 1 --record-size 100 --throughput -1 --producer-props bootstrap.servers=hw168:9092 buffer.memory=67108864 batch.size=65536 acks=1 Problem: We can see from the producer, a lot of NotLeaderForPartitionException: org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. … Track the process (by using DEBUG): There is a INFO: INFO Client session timed out, have not heard from server in 11647ms for sessionid 0x258de4a26a4, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) And We found that the connection between zkClient (kafka holds) and zookeeper server is closed by zookeeper server, the reason is that time is out, for details: [2016-12-08 20:24:00,547] DEBUG Partition [test,5] on broker 1: Skipping update high watermark since Old hw 15986847 [8012779 : 1068525112] is larger than new hw 15986847 [8012779 : 1068525112] for partition [test,5]. All leo's are 16566175 [16025299 : 72477384],15986847 [8012779 : 1068525112],16103549 [16025299 : 10485500] (kafka.cluster.Partition) [2016-12-08 20:24:00,547] DEBUG Adding index entry 16566161 => 72475508 to 16025299.index. (kafka.log.OffsetIndex) [2016-12-08 20:24:11,368] DEBUG [Replica Manager on Broker 1]: Request key test-2 unblocked 0 fetch requests. (kafka.server.ReplicaManager) [2016-12-08 20:24:11,368] DEBUG Partition [test,2] on broker 1: Skipping update high watermark since Old hw 16064424 [16025299 : 5242750] is larger than new hw 16064424 [16025299 : 5242750] for partition [test,2]. All leo's are 16566175 [16025299 : 72477384],16205274 [16025299 : 24116650],16064424 [16025299 : 5242750] (kafka.cluster.Partition) [2016-12-08 20:24:11,369] DEBUG [Replica Manager on Broker 1]: Produce to local log in 10821 ms (kafka.server.ReplicaManager) [2016-12-08 20:24:11,369] INFO Client session timed out, have not heard from server in 11647ms for sessionid 0x258de4a26a4, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) Please watch the time, the there is no DEBUG between 20:24:00,547 and 20:24:11,368, it already exceeded the time for session timeout (6000ms), so it causes this disconnection. We keep digging: We found that it got stuck in the function: selector.select(waitTimeOut); — in the method doTransport(…) in class org.apache.zookeeper.ClientCnxnSocketNIO and that is the time ww got no DEBUG. For more details, Call procedure (zookeeper client): org.apache.zookeeper.ClientCnxn -> run() -> doTransport(..) In the function run(), every time it will check whether there is a timeout, if not, it will run doTransport, but the doTransport costs about 10s, so next loop, it will find the timeout. Keep going, I thought there could be a deadlock at that time, so I keep printing the jstack of the kafka and zookeeper. Using the shell like below: while true; do echo -e "\n\n"`date "+%Y-%m-%d %H:%M:%S,%N"`"\n"`jstack 12165` >> zkjstack; echo -e "\n\n"`date "+%Y-%m-%d %H:%M:%S,%N"`"\n"`jstack 12425` >> kafkajstack; done And I check the period we got NO DEBUG in the stack logs, surprise, there also NO LOG at that time!!! Why? So I’m confused that why it got stuck in that function? Why there is no DEBUG or LOG in that weird period? Please help me. Thank you all. Xiaoyuan
Re: NotLeaderForPartitionException
Hi Sven, You will see this exception during leader election. When the leader for a partition moves to another broker, there is a period during which the replicas would still connect to the original leader, at which point they will raise this exception. This should be a very short period, after which they will connect to and replicate from the new leader correctly. This is not a fatal error, and you will see it if you are bouncing brokers (since all the leaders on that broker will have to move after the bounce). You may also see it if some brokers have connectivity issues: they may be considered dead, and their partitions would be moved elsewhere. Hope this helps, Apurva On Tue, Dec 6, 2016 at 10:06 AM, Sven Ludwig wrote: > Hello, > > in our Kafka clusters we sometimes observe a specific ERROR log-statement, > and therefore we have doubts whether it is already running sable in our > configuration. This occurs every now and then, like two or three times in a > day. It is actually the predominant ERROR log-statement in our cluster. > Example: > > [2016-12-06 17:14:50,909] ERROR [ReplicaFetcherThread-0-3], Error for > partition [,] to broker > 3:org.apache.kafka.common.errors.NotLeaderForPartitionException: This > server is > not the leader for that topic-partition. (kafka.server. > ReplicaFetcherThread) > > We already asked Google, but we did not find sufficient answers to our > questions, therefore I am asking on the mailing list: > > 1. What are the possible reasons for this particular error? > > 2. What are the implications of it? > > 3. What can be done to prevent it? > > Best Regards, > Sven >
NotLeaderForPartitionException
Hello, in our Kafka clusters we sometimes observe a specific ERROR log-statement, and therefore we have doubts whether it is already running sable in our configuration. This occurs every now and then, like two or three times in a day. It is actually the predominant ERROR log-statement in our cluster. Example: [2016-12-06 17:14:50,909] ERROR [ReplicaFetcherThread-0-3], Error for partition [,] to broker 3:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) We already asked Google, but we did not find sufficient answers to our questions, therefore I am asking on the mailing list: 1. What are the possible reasons for this particular error? 2. What are the implications of it? 3. What can be done to prevent it? Best Regards, Sven
I am getting NotLeaderForPartitionException exception on internal changelog table topcis
Hi, In my application I have replicated internal changelog topics. >From time to time I get this exception and I am not able to figure out why. [2016-12-05 11:05:10,635] ERROR Error sending record to topic test-stream-key-table-changelog (org.apache.kafka.streams.processor.internals.RecordCollector) org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. [2016-12-05 11:05:10,755] ERROR Error sending record to topic test-stream-key-table-changelog (org.apache.kafka.streams.processor.internals.RecordCollector) Please let me know what could be causing this. Thanks Sachin
Re: NotLeaderForPartitionException causing event loss
Increasing number of retries and/or retry.backoff.ms will help reduce the data loss. Figure out how long NLFPE occurs (this happens only as long as metadata is obsolete), and configure below props accordingly. message.send.max.retries=3 (default) retry.backoff.ms=100 (default) > On Aug 12, 2016, at 1:59 AM, sunil kalva wrote: > > We are seeing data loss, whenever we see "NotLeaderForPartitionException" > exception, > We are using 0.8.2 java client publisher API with callback, when i get > callback with the error i am logging them in a file and retrying them > later. > So number of errors = number of logged events are matching but overal,l few > events are missing. > Observations > Timestamp of missing events looks like just before the timestamp of the > error occurrence, meaning every time when we get this error we are loosing > couple events which are sent just before the error time. > > Is there any way to handle this, should it be handle at client side or some > tuning required at kafka side. > > t > SunilKalva Fit Mom Daily Australians Shocked: Don't Use Botox, Do This Instead http://thirdpartyoffers.netzero.net/TGL3231/57ae0e55f24a2e555d79st01vuc
NotLeaderForPartitionException causing event loss
We are seeing data loss, whenever we see "NotLeaderForPartitionException" exception, We are using 0.8.2 java client publisher API with callback, when i get callback with the error i am logging them in a file and retrying them later. So number of errors = number of logged events are matching but overal,l few events are missing. Observations Timestamp of missing events looks like just before the timestamp of the error occurrence, meaning every time when we get this error we are loosing couple events which are sent just before the error time. Is there any way to handle this, should it be handle at client side or some tuning required at kafka side. t SunilKalva
NotLeaderForPartitionException -- how to recover?
Hi everyone, Has anyone seen this error and/or is there a good way to recover from it? I have two brokers running, and after attempting to produce to a topic (using the library kafka-python), and they are both spitting out lots of messages that look like this: ERROR [ReplicaFetcherThread-0-1], Error for partition [__consumer_offsets,2] to broker 1:org.apache.kafka.common.errors. NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) (The partition number varies. And on broker 1, the message says "to broker 2" instead of "to broker 1") Listing the topics reveals that broker 1 is the leader for all 50 partitions of __consumer_offsets. Both brokers are complaining about the same partitions, which is confusing to me because it looks like broker 1 should be the leader. I have tried: - restarting Kafka on each broker - deleting the kafka-logs directory on each broker (and restarting) - deleting the zookeeper directory on each server (and restarting) Very grateful for any help! Thanks, Samuel
Re: NotLeaderForPartitionException: This server is not the leader for that topic-partition.
Sorry for reviving this old mailing list discussion. I'm facing a similar issue while running a load test with many small topics (100 topics) with 4 partitions each. There is also a Flink user who's facing the issue: https://issues.apache.org/jira/browse/FLINK-3066 Are you also writing into many topics at the same time? On Sat, Nov 14, 2015 at 4:07 PM, Sean Morris (semorris) wrote: > I have suddenly starting having issues where when I am producing data I > occasionally get "NotLeaderForPartitionException: This server is not the > leader for that topic-partition". I am using Kafka 2.10-0.8.2.1 with the > new Producer class and no retries. If I add retries to the Producer > properties I am ending up with duplicates on the consumer side. > > Any idea what can be the cause of the NotLeaderForPartitionException? > > Thanks. > >
NotLeaderForPartitionException: This server is not the leader for that topic-partition.
I have suddenly starting having issues where when I am producing data I occasionally get "NotLeaderForPartitionException: This server is not the leader for that topic-partition". I am using Kafka 2.10-0.8.2.1 with the new Producer class and no retries. If I add retries to the Producer properties I am ending up with duplicates on the consumer side. Any idea what can be the cause of the NotLeaderForPartitionException? Thanks.
Produce request failed due to NotLeaderForPartitionException (leader epoch is old)
Hi Ivan, I face the same issue you had. Were you able to figure out why it’s happening? If so, could you please share it. Thanks, Chaitanya GSK
Produce request failed due to NotLeaderForPartitionException (leader epoch is old)
Hi, During a round of kafka data discrepancy investigation I came across a bunch of recurring errors below: producer.log > 2015-06-14 13:06:25,591 WARN [task-thread-9] > (k.p.a.DefaultEventHandler:83) - Produce request with correlation id 624 > failed due to [mytopic,21]: kafka.common.NotLeaderForPartitionException kafka.log [2015-06-14 13:05:13,025] 418953499 [request-expiration-task] WARN > kafka.server.ReplicaManager > - [Replica Manager on Broker 61]: Fetch request with correlation id 1 from > client fetchReq on partition [mytopic,21] failed due to Leader not local for > partition [mytopic,21] on broker 61 > > state-change.log > [2015-06-14 13:05:11,495] WARN Broker 29 ignoring LeaderAndIsr request > from controller 45 with correlation id 41799 epoch 27 for partition > [mytopic,21] since its associated leader epoch 191 is old. Current leader > epoch is 191 (state.change.logger) The warnings keep repeating several times during a day, and sometimes they coincide with timestamps of presumably missing records. As far as I understand occasional NotLeaderForPartitionException are fine, but does the same apply to "old leader epoch" warning? Could it be caused by any zk issue? However, I don't seem to find anything particularly interesting in zk logs, except "likely client has closed socket" or "Unexpected Exception: java.nio.channels.CancelledKeyException". The former is all over the log (must be the client issue) and the latter are rare and not correlated with the original warnings. Thanks, Here are some bits of configuration: Kafka 0.8.1.2, 3 brokers + 3 zk, 2x replication request.required.acks=1 retry.backoff.ms=1000
Re: Getting NotLeaderForPartitionException in kafka broker
i will try to reproduce this problem later this week. Bouncing the broker fixed the issue but the issue surfaced again after a period of time. A little more context about this is that the cluster was deployed to VMs and I discovered that the issue appeared whenever CPU wait time was extremely high like 90% CPU time spent on I/O wait. I am more interesting in understanding under what circumstance this issue would happen so that I can take appropriate actions On Fri, May 15, 2015 at 8:04 AM, Jiangjie Qin wrote: > > If you can reproduce this problem steadily, once you see this issue, can > you grep the controller log for topic partition in question and see if > there is anything interesting? > > Thanks. > > Jiangjie (Becket) Qin > > On 5/14/15, 3:43 AM, "tao xiao" wrote: > > >Yes, it does exist in ZK and the node that had the > >NotLeaderForPartitionException > >is the leader of the topic > > > >On Thu, May 14, 2015 at 6:12 AM, Jiangjie Qin > >wrote: > > > >> Does this topic exist in Zookeeper? > >> > >> On 5/12/15, 11:35 PM, "tao xiao" wrote: > >> > >> >Hi, > >> > > >> >Any updates on this issue? I keep seeing this issue happening over and > >> >over > >> >again > >> > > >> >On Thu, May 7, 2015 at 7:28 PM, tao xiao wrote: > >> > > >> >> Hi team, > >> >> > >> >> I have a 12 nodes cluster that has 800 topics and each of which has > >> >>only 1 > >> >> partition. I observed that one of the node keeps generating > >> >> NotLeaderForPartitionException that causes the node to be > >>unresponsive > >> >>to > >> >> all requests. Below is the exception > >> >> > >> >> [2015-05-07 04:16:01,014] ERROR [ReplicaFetcherThread-1-12], Error > >>for > >> >> partition [topic1,0] to broker 12:class > >> >> kafka.common.NotLeaderForPartitionException > >> >> (kafka.server.ReplicaFetcherThread) > >> >> > >> >> All other nodes in the cluster generate lots of replication error > >>too as > >> >> shown below due to unresponsiveness of above node. > >> >> > >> >> [2015-05-07 04:17:34,917] WARN [Replica Manager on Broker 1]: Fetch > >> >> request with correlation id 3630911 from client > >> >>ReplicaFetcherThread-0-1 on > >> >> partition [topic1,0] failed due to Leader not local for partition > >> >> [cg22_user.item_attr_info.lcr,0] on broker 1 > >> >>(kafka.server.ReplicaManager) > >> >> > >> >> Any suggestion why the node runs into the unstable stage and any > >> >> configuration I can set to prevent this? > >> >> > >> >> I use kafka 0.8.2.1 > >> >> > >> >> And here is the server.properties > >> >> > >> >> > >> >> broker.id=5 > >> >> port=9092 > >> >> num.network.threads=3 > >> >> num.io.threads=8 > >> >> socket.send.buffer.bytes=1048576 > >> >> socket.receive.buffer.bytes=1048576 > >> >> socket.request.max.bytes=104857600 > >> >> log.dirs=/mnt/kafka > >> >> num.partitions=1 > >> >> num.recovery.threads.per.data.dir=1 > >> >> log.retention.hours=1 > >> >> log.segment.bytes=1073741824 > >> >> log.retention.check.interval.ms=30 > >> >> log.cleaner.enable=false > >> >> zookeeper.connect=ip:2181 > >> >> zookeeper.connection.timeout.ms=6000 > >> >> unclean.leader.election.enable=false > >> >> delete.topic.enable=true > >> >> default.replication.factor=3 > >> >> num.replica.fetchers=3 > >> >> delete.topic.enable=true > >> >> kafka.metrics.reporters=report.KafkaMetricsCollector > >> >> straas.hubble.conf.file=/etc/kafka/report.conf > >> >> > >> >> > >> >> > >> >> > >> >> -- > >> >> Regards, > >> >> Tao > >> >> > >> > > >> > > >> > > >> >-- > >> >Regards, > >> >Tao > >> > >> > > > > > >-- > >Regards, > >Tao > > -- Regards, Tao
Re: Getting NotLeaderForPartitionException in kafka broker
If you can reproduce this problem steadily, once you see this issue, can you grep the controller log for topic partition in question and see if there is anything interesting? Thanks. Jiangjie (Becket) Qin On 5/14/15, 3:43 AM, "tao xiao" wrote: >Yes, it does exist in ZK and the node that had the >NotLeaderForPartitionException >is the leader of the topic > >On Thu, May 14, 2015 at 6:12 AM, Jiangjie Qin >wrote: > >> Does this topic exist in Zookeeper? >> >> On 5/12/15, 11:35 PM, "tao xiao" wrote: >> >> >Hi, >> > >> >Any updates on this issue? I keep seeing this issue happening over and >> >over >> >again >> > >> >On Thu, May 7, 2015 at 7:28 PM, tao xiao wrote: >> > >> >> Hi team, >> >> >> >> I have a 12 nodes cluster that has 800 topics and each of which has >> >>only 1 >> >> partition. I observed that one of the node keeps generating >> >> NotLeaderForPartitionException that causes the node to be >>unresponsive >> >>to >> >> all requests. Below is the exception >> >> >> >> [2015-05-07 04:16:01,014] ERROR [ReplicaFetcherThread-1-12], Error >>for >> >> partition [topic1,0] to broker 12:class >> >> kafka.common.NotLeaderForPartitionException >> >> (kafka.server.ReplicaFetcherThread) >> >> >> >> All other nodes in the cluster generate lots of replication error >>too as >> >> shown below due to unresponsiveness of above node. >> >> >> >> [2015-05-07 04:17:34,917] WARN [Replica Manager on Broker 1]: Fetch >> >> request with correlation id 3630911 from client >> >>ReplicaFetcherThread-0-1 on >> >> partition [topic1,0] failed due to Leader not local for partition >> >> [cg22_user.item_attr_info.lcr,0] on broker 1 >> >>(kafka.server.ReplicaManager) >> >> >> >> Any suggestion why the node runs into the unstable stage and any >> >> configuration I can set to prevent this? >> >> >> >> I use kafka 0.8.2.1 >> >> >> >> And here is the server.properties >> >> >> >> >> >> broker.id=5 >> >> port=9092 >> >> num.network.threads=3 >> >> num.io.threads=8 >> >> socket.send.buffer.bytes=1048576 >> >> socket.receive.buffer.bytes=1048576 >> >> socket.request.max.bytes=104857600 >> >> log.dirs=/mnt/kafka >> >> num.partitions=1 >> >> num.recovery.threads.per.data.dir=1 >> >> log.retention.hours=1 >> >> log.segment.bytes=1073741824 >> >> log.retention.check.interval.ms=30 >> >> log.cleaner.enable=false >> >> zookeeper.connect=ip:2181 >> >> zookeeper.connection.timeout.ms=6000 >> >> unclean.leader.election.enable=false >> >> delete.topic.enable=true >> >> default.replication.factor=3 >> >> num.replica.fetchers=3 >> >> delete.topic.enable=true >> >> kafka.metrics.reporters=report.KafkaMetricsCollector >> >> straas.hubble.conf.file=/etc/kafka/report.conf >> >> >> >> >> >> >> >> >> >> -- >> >> Regards, >> >> Tao >> >> >> > >> > >> > >> >-- >> >Regards, >> >Tao >> >> > > >-- >Regards, >Tao
Re: Getting NotLeaderForPartitionException in kafka broker
Can you try bouncing that broker? Thanks, Mayuresh On Thu, May 14, 2015 at 3:43 AM, tao xiao wrote: > Yes, it does exist in ZK and the node that had the > NotLeaderForPartitionException > is the leader of the topic > > On Thu, May 14, 2015 at 6:12 AM, Jiangjie Qin > wrote: > > > Does this topic exist in Zookeeper? > > > > On 5/12/15, 11:35 PM, "tao xiao" wrote: > > > > >Hi, > > > > > >Any updates on this issue? I keep seeing this issue happening over and > > >over > > >again > > > > > >On Thu, May 7, 2015 at 7:28 PM, tao xiao wrote: > > > > > >> Hi team, > > >> > > >> I have a 12 nodes cluster that has 800 topics and each of which has > > >>only 1 > > >> partition. I observed that one of the node keeps generating > > >> NotLeaderForPartitionException that causes the node to be unresponsive > > >>to > > >> all requests. Below is the exception > > >> > > >> [2015-05-07 04:16:01,014] ERROR [ReplicaFetcherThread-1-12], Error for > > >> partition [topic1,0] to broker 12:class > > >> kafka.common.NotLeaderForPartitionException > > >> (kafka.server.ReplicaFetcherThread) > > >> > > >> All other nodes in the cluster generate lots of replication error too > as > > >> shown below due to unresponsiveness of above node. > > >> > > >> [2015-05-07 04:17:34,917] WARN [Replica Manager on Broker 1]: Fetch > > >> request with correlation id 3630911 from client > > >>ReplicaFetcherThread-0-1 on > > >> partition [topic1,0] failed due to Leader not local for partition > > >> [cg22_user.item_attr_info.lcr,0] on broker 1 > > >>(kafka.server.ReplicaManager) > > >> > > >> Any suggestion why the node runs into the unstable stage and any > > >> configuration I can set to prevent this? > > >> > > >> I use kafka 0.8.2.1 > > >> > > >> And here is the server.properties > > >> > > >> > > >> broker.id=5 > > >> port=9092 > > >> num.network.threads=3 > > >> num.io.threads=8 > > >> socket.send.buffer.bytes=1048576 > > >> socket.receive.buffer.bytes=1048576 > > >> socket.request.max.bytes=104857600 > > >> log.dirs=/mnt/kafka > > >> num.partitions=1 > > >> num.recovery.threads.per.data.dir=1 > > >> log.retention.hours=1 > > >> log.segment.bytes=1073741824 > > >> log.retention.check.interval.ms=30 > > >> log.cleaner.enable=false > > >> zookeeper.connect=ip:2181 > > >> zookeeper.connection.timeout.ms=6000 > > >> unclean.leader.election.enable=false > > >> delete.topic.enable=true > > >> default.replication.factor=3 > > >> num.replica.fetchers=3 > > >> delete.topic.enable=true > > >> kafka.metrics.reporters=report.KafkaMetricsCollector > > >> straas.hubble.conf.file=/etc/kafka/report.conf > > >> > > >> > > >> > > >> > > >> -- > > >> Regards, > > >> Tao > > >> > > > > > > > > > > > >-- > > >Regards, > > >Tao > > > > > > > -- > Regards, > Tao > -- -Regards, Mayuresh R. Gharat (862) 250-7125
Re: Getting NotLeaderForPartitionException in kafka broker
Yes, it does exist in ZK and the node that had the NotLeaderForPartitionException is the leader of the topic On Thu, May 14, 2015 at 6:12 AM, Jiangjie Qin wrote: > Does this topic exist in Zookeeper? > > On 5/12/15, 11:35 PM, "tao xiao" wrote: > > >Hi, > > > >Any updates on this issue? I keep seeing this issue happening over and > >over > >again > > > >On Thu, May 7, 2015 at 7:28 PM, tao xiao wrote: > > > >> Hi team, > >> > >> I have a 12 nodes cluster that has 800 topics and each of which has > >>only 1 > >> partition. I observed that one of the node keeps generating > >> NotLeaderForPartitionException that causes the node to be unresponsive > >>to > >> all requests. Below is the exception > >> > >> [2015-05-07 04:16:01,014] ERROR [ReplicaFetcherThread-1-12], Error for > >> partition [topic1,0] to broker 12:class > >> kafka.common.NotLeaderForPartitionException > >> (kafka.server.ReplicaFetcherThread) > >> > >> All other nodes in the cluster generate lots of replication error too as > >> shown below due to unresponsiveness of above node. > >> > >> [2015-05-07 04:17:34,917] WARN [Replica Manager on Broker 1]: Fetch > >> request with correlation id 3630911 from client > >>ReplicaFetcherThread-0-1 on > >> partition [topic1,0] failed due to Leader not local for partition > >> [cg22_user.item_attr_info.lcr,0] on broker 1 > >>(kafka.server.ReplicaManager) > >> > >> Any suggestion why the node runs into the unstable stage and any > >> configuration I can set to prevent this? > >> > >> I use kafka 0.8.2.1 > >> > >> And here is the server.properties > >> > >> > >> broker.id=5 > >> port=9092 > >> num.network.threads=3 > >> num.io.threads=8 > >> socket.send.buffer.bytes=1048576 > >> socket.receive.buffer.bytes=1048576 > >> socket.request.max.bytes=104857600 > >> log.dirs=/mnt/kafka > >> num.partitions=1 > >> num.recovery.threads.per.data.dir=1 > >> log.retention.hours=1 > >> log.segment.bytes=1073741824 > >> log.retention.check.interval.ms=30 > >> log.cleaner.enable=false > >> zookeeper.connect=ip:2181 > >> zookeeper.connection.timeout.ms=6000 > >> unclean.leader.election.enable=false > >> delete.topic.enable=true > >> default.replication.factor=3 > >> num.replica.fetchers=3 > >> delete.topic.enable=true > >> kafka.metrics.reporters=report.KafkaMetricsCollector > >> straas.hubble.conf.file=/etc/kafka/report.conf > >> > >> > >> > >> > >> -- > >> Regards, > >> Tao > >> > > > > > > > >-- > >Regards, > >Tao > > -- Regards, Tao
Re: Getting NotLeaderForPartitionException in kafka broker
Does this topic exist in Zookeeper? On 5/12/15, 11:35 PM, "tao xiao" wrote: >Hi, > >Any updates on this issue? I keep seeing this issue happening over and >over >again > >On Thu, May 7, 2015 at 7:28 PM, tao xiao wrote: > >> Hi team, >> >> I have a 12 nodes cluster that has 800 topics and each of which has >>only 1 >> partition. I observed that one of the node keeps generating >> NotLeaderForPartitionException that causes the node to be unresponsive >>to >> all requests. Below is the exception >> >> [2015-05-07 04:16:01,014] ERROR [ReplicaFetcherThread-1-12], Error for >> partition [topic1,0] to broker 12:class >> kafka.common.NotLeaderForPartitionException >> (kafka.server.ReplicaFetcherThread) >> >> All other nodes in the cluster generate lots of replication error too as >> shown below due to unresponsiveness of above node. >> >> [2015-05-07 04:17:34,917] WARN [Replica Manager on Broker 1]: Fetch >> request with correlation id 3630911 from client >>ReplicaFetcherThread-0-1 on >> partition [topic1,0] failed due to Leader not local for partition >> [cg22_user.item_attr_info.lcr,0] on broker 1 >>(kafka.server.ReplicaManager) >> >> Any suggestion why the node runs into the unstable stage and any >> configuration I can set to prevent this? >> >> I use kafka 0.8.2.1 >> >> And here is the server.properties >> >> >> broker.id=5 >> port=9092 >> num.network.threads=3 >> num.io.threads=8 >> socket.send.buffer.bytes=1048576 >> socket.receive.buffer.bytes=1048576 >> socket.request.max.bytes=104857600 >> log.dirs=/mnt/kafka >> num.partitions=1 >> num.recovery.threads.per.data.dir=1 >> log.retention.hours=1 >> log.segment.bytes=1073741824 >> log.retention.check.interval.ms=30 >> log.cleaner.enable=false >> zookeeper.connect=ip:2181 >> zookeeper.connection.timeout.ms=6000 >> unclean.leader.election.enable=false >> delete.topic.enable=true >> default.replication.factor=3 >> num.replica.fetchers=3 >> delete.topic.enable=true >> kafka.metrics.reporters=report.KafkaMetricsCollector >> straas.hubble.conf.file=/etc/kafka/report.conf >> >> >> >> >> -- >> Regards, >> Tao >> > > > >-- >Regards, >Tao
Re: Getting NotLeaderForPartitionException in kafka broker
Hi, Any updates on this issue? I keep seeing this issue happening over and over again On Thu, May 7, 2015 at 7:28 PM, tao xiao wrote: > Hi team, > > I have a 12 nodes cluster that has 800 topics and each of which has only 1 > partition. I observed that one of the node keeps generating > NotLeaderForPartitionException that causes the node to be unresponsive to > all requests. Below is the exception > > [2015-05-07 04:16:01,014] ERROR [ReplicaFetcherThread-1-12], Error for > partition [topic1,0] to broker 12:class > kafka.common.NotLeaderForPartitionException > (kafka.server.ReplicaFetcherThread) > > All other nodes in the cluster generate lots of replication error too as > shown below due to unresponsiveness of above node. > > [2015-05-07 04:17:34,917] WARN [Replica Manager on Broker 1]: Fetch > request with correlation id 3630911 from client ReplicaFetcherThread-0-1 on > partition [topic1,0] failed due to Leader not local for partition > [cg22_user.item_attr_info.lcr,0] on broker 1 (kafka.server.ReplicaManager) > > Any suggestion why the node runs into the unstable stage and any > configuration I can set to prevent this? > > I use kafka 0.8.2.1 > > And here is the server.properties > > > broker.id=5 > port=9092 > num.network.threads=3 > num.io.threads=8 > socket.send.buffer.bytes=1048576 > socket.receive.buffer.bytes=1048576 > socket.request.max.bytes=104857600 > log.dirs=/mnt/kafka > num.partitions=1 > num.recovery.threads.per.data.dir=1 > log.retention.hours=1 > log.segment.bytes=1073741824 > log.retention.check.interval.ms=30 > log.cleaner.enable=false > zookeeper.connect=ip:2181 > zookeeper.connection.timeout.ms=6000 > unclean.leader.election.enable=false > delete.topic.enable=true > default.replication.factor=3 > num.replica.fetchers=3 > delete.topic.enable=true > kafka.metrics.reporters=report.KafkaMetricsCollector > straas.hubble.conf.file=/etc/kafka/report.conf > > > > > -- > Regards, > Tao > -- Regards, Tao
Getting NotLeaderForPartitionException in kafka broker
Hi team, I have a 12 nodes cluster that has 800 topics and each of which has only 1 partition. I observed that one of the node keeps generating NotLeaderForPartitionException that causes the node to be unresponsive to all requests. Below is the exception [2015-05-07 04:16:01,014] ERROR [ReplicaFetcherThread-1-12], Error for partition [topic1,0] to broker 12:class kafka.common.NotLeaderForPartitionException (kafka.server.ReplicaFetcherThread) All other nodes in the cluster generate lots of replication error too as shown below due to unresponsiveness of above node. [2015-05-07 04:17:34,917] WARN [Replica Manager on Broker 1]: Fetch request with correlation id 3630911 from client ReplicaFetcherThread-0-1 on partition [topic1,0] failed due to Leader not local for partition [cg22_user.item_attr_info.lcr,0] on broker 1 (kafka.server.ReplicaManager) Any suggestion why the node runs into the unstable stage and any configuration I can set to prevent this? I use kafka 0.8.2.1 And here is the server.properties broker.id=5 port=9092 num.network.threads=3 num.io.threads=8 socket.send.buffer.bytes=1048576 socket.receive.buffer.bytes=1048576 socket.request.max.bytes=104857600 log.dirs=/mnt/kafka num.partitions=1 num.recovery.threads.per.data.dir=1 log.retention.hours=1 log.segment.bytes=1073741824 log.retention.check.interval.ms=30 log.cleaner.enable=false zookeeper.connect=ip:2181 zookeeper.connection.timeout.ms=6000 unclean.leader.election.enable=false delete.topic.enable=true default.replication.factor=3 num.replica.fetchers=3 delete.topic.enable=true kafka.metrics.reporters=report.KafkaMetricsCollector straas.hubble.conf.file=/etc/kafka/report.conf -- Regards, Tao
Re: NotLeaderForPartitionException while doing performance test
Thanks, Jaikiran I was trying to duplicate the same issue by running the same performance test on master node of cluster , say exemplary-birds.master, and I did see such error again org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. At the same time, I did "lsof -I" here is the screenshot java 6991 root 28u IPv6 4334833 0t0 TCP *:50320 (LISTEN) java 6991 root 29u IPv6 4351835 0t0 TCP exemplary-birds.master:9092->exemplary-birds.master:50472 (ESTABLISHED) java 6991 root 38u IPv6 4366588 0t0 TCP exemplary-birds.master:59536->complicated-laugh.master:2181 (ESTABLISHED) java 6991 root 150u IPv6 4361502 0t0 TCP *:9092 (LISTEN) java 6991 root 151u IPv6 4368439 0t0 TCP exemplary-birds.master:9092->harmful-jar.master:51131 (ESTABLISHED) java 6991 root 154u IPv6 4365924 0t0 TCP exemplary-birds.master:55248->voluminous-mass.master:9092 (ESTABLISHED) java 6991 root 155u IPv6 4366591 0t0 TCP exemplary-birds.master:55245->voluminous-mass.master:9092 (ESTABLISHED) java 6991 root 157u IPv6 4365923 0t0 TCP exemplary-birds.master:41946->harmful-jar.master:9092 (ESTABLISHED) java 6991 root 176u IPv6 4358833 0t0 TCP exemplary-birds.master:55251->voluminous-mass.master:9092 (ESTABLISHED) java 6991 root 179u IPv6 4292501 0t0 TCP exemplary-birds.master:41951->harmful-jar.master:9092 (ESTABLISHED) java 6991 root 180u IPv6 4338331 0t0 TCP exemplary-birds.master:9092->harmful-jar.master:51133 (ESTABLISHED) java 6991 root 181u IPv6 4364530 0t0 TCP exemplary-birds.master:9092->voluminous-mass.master:42897 (ESTABLISHED) java 6991 root 182u IPv6 4358834 0t0 TCP exemplary-birds.master:9092->harmful-jar.master:51134 (ESTABLISHED) java 6991 root 183u IPv6 4354353 0t0 TCP exemplary-birds.master:9092->voluminous-mass.master:42898 (ESTABLISHED) java 6991 root 190u IPv6 4351836 0t0 TCP exemplary-birds.master:9092->localhost:40786 (ESTABLISHED) java 6991 root 201u IPv6 4364543 0t0 TCP exemplary-birds.master:9092->harmful-jar.master:51135 (ESTABLISHED) java 6991 root 202u IPv6 4364544 0t0 TCP exemplary-birds.master:9092->voluminous-mass.master:42899 (ESTABLISHED) java 7218 root 44u IPv6 4366240 0t0 TCP *:46256 (LISTEN) java 7218 root 48u IPv6 4366602 0t0 TCP exemplary-birds.master:50472->exemplary-birds.master:9092 (ESTABLISHED) java 7218 root 50u IPv6 4350446 0t0 TCP exemplary-birds.master:41960->harmful-jar.master:9092 (ESTABLISHED) java 7218 root 51u IPv6 4350447 0t0 TCP localhost:40786->exemplary-birds.master:9092 (ESTABLISHED) java 7218 root 52u IPv6 4350448 0t0 TCP exemplary-birds.master:55263->voluminous-mass.master:9092 (ESTABLISHED) java 17582 root 44u IPv6 4326187 0t0 TCP *:46316 (LISTEN) ntpd 18649ntp 16u IPv4 656334 0t0 UDP *:ntp ntpd 18649ntp 17u IPv6 656335 0t0 UDP *:ntp ntpd 18649ntp 18u IPv4 656341 0t0 UDP localhost:ntp ntpd 18649ntp 19u IPv4 656342 0t0 UDP exemplary-birds.master:ntp ntpd 18649ntp 20u IPv6 656343 0t0 UDP localhost:ntp ntpd 18649ntp 21u IPv6 656344 0t0 UDP [fe80::7a2b:cbff:fe1f:2e77]:ntp sshd 21995 root3u IPv4 4277546 0t0 TCP exemplary-birds.master:ssh->10.100.68.15:60642 (ESTABLISHED) sshd 22091 fitsum3u IPv4 4277546 0t0 TCP exemplary-birds.master:ssh->10.100.68.15:60642 (ESTABLISHED) java 22152 root 21u IPv6 213140 0t0 TCP *:52411 (LISTEN) jav
Re: NotLeaderForPartitionException while doing performance test
There are different ways to find the connection count and each one depends on the operating system that's being used. "lsof -i" is one option, for example, on *nix systems. -Jaikiran On Thursday 08 January 2015 11:40 AM, Sa Li wrote: Yes, it is weird hostname, ;), that is what our system guys name it. How to take a note to measure the connections open to 10.100.98.102? Thanks AL On Jan 7, 2015 9:42 PM, "Jaikiran Pai" wrote: On Thursday 08 January 2015 01:51 AM, Sa Li wrote: see this type of error again, back to normal in few secs [2015-01-07 20:19:49,744] WARN Error in I/O with harmful-jar.master/ 10.100.98.102 That's a really weird hostname, the "harmful-jar.master". Is that really your hostname? You mention that this happens during performance testing. Have you taken a note of how many connection are open to that 10.100.98.102 IP when this "Connection refused" exception happens? -Jaikiran (org.apache.kafka.common.network.Selector) java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.kafka.common.network.Selector.poll( Selector.java:232) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) at java.lang.Thread.run(Thread.java:745) [2015-01-07 20:19:49,754] WARN Error in I/O with harmful-jar.master/ 10.100.98.102 (org.apache.kafka.common.network.Selector) java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.kafka.common.network.Selector.poll( Selector.java:232) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) at java.lang.Thread.run(Thread.java:745) [2015-01-07 20:19:49,764] WARN Error in I/O with harmful-jar.master/ 10.100.98.102 (org.apache.kafka.common.network.Selector) java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.kafka.common.network.Selector.poll( Selector.java:232) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) at java.lang.Thread.run(Thread.java:745) 160403 records sent, 32080.6 records/sec (91.78 MB/sec), 507.0 ms avg latency, 2418.0 max latency. 109882 records sent, 21976.4 records/sec (62.87 MB/sec), 672.7 ms avg latency, 3529.0 max latency. 100315 records sent, 19995.0 records/sec (57.21 MB/sec), 774.8 ms avg latency, 3858.0 max latency. On Wed, Jan 7, 2015 at 12:07 PM, Sa Li wrote: Hi, All I am doing performance test by bin/kafka-run-class.sh org.apache.kafka.clients. tools.ProducerPerformance test-rep-three 5 100 -1 acks=1 bootstrap.servers= 10.100.98.100:9092,10.100.98.101:9092,10.100.98.102:9092 buffer.memory=67108864 batch.size=8196 where the topic test-rep-three is described as follow: bin/kafka-topics.sh --describe --zookeeper 10.100.98.101:2181 --topic test-rep-three Topic:test-rep-threePartitionCount:8ReplicationFactor:3 Configs: Topic: test-rep-three Partition: 0Leader: 100 Replicas: 100,102,101 Isr: 102,101,100 Topic: test-rep-three Partition: 1Leader: 101 Replicas: 101,100,102 Isr: 102,101,100 Topic: test-rep-three Partition: 2Leader: 102 Replicas: 102,101,100 Isr: 101,102,100 Topic: test-rep-three Partition: 3Leader: 100 Replicas: 100,101,102 Isr: 101,100,102 Topic: test-rep-three Partition: 4Leader: 101 Replicas: 101,102,100 Isr: 102,100,101 Topic: test-rep-three Partition: 5Leader: 102 Replicas: 102,100,101 Isr: 100,102,101 Topic: test-rep-three Partition: 6Leader: 102 Replicas: 100,102,101 Isr: 102,101,100 Topic: test-rep-three Partition: 7Leader: 101 Replicas: 101,100,102 Isr: 101,100,102 Apparently, it produces the messages and run for a while, but it periodically have such exceptions: org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.
Re: NotLeaderForPartitionException while doing performance test
Yes, it is weird hostname, ;), that is what our system guys name it. How to take a note to measure the connections open to 10.100.98.102? Thanks AL On Jan 7, 2015 9:42 PM, "Jaikiran Pai" wrote: > On Thursday 08 January 2015 01:51 AM, Sa Li wrote: > >> see this type of error again, back to normal in few secs >> >> [2015-01-07 20:19:49,744] WARN Error in I/O with harmful-jar.master/ >> 10.100.98.102 >> > > That's a really weird hostname, the "harmful-jar.master". Is that really > your hostname? You mention that this happens during performance testing. > Have you taken a note of how many connection are open to that 10.100.98.102 > IP when this "Connection refused" exception happens? > > -Jaikiran > > >(org.apache.kafka.common.network.Selector) >> java.net.ConnectException: Connection refused >> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) >> at >> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) >> at org.apache.kafka.common.network.Selector.poll( >> Selector.java:232) >> at >> org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191) >> at >> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) >> at >> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) >> at java.lang.Thread.run(Thread.java:745) >> [2015-01-07 20:19:49,754] WARN Error in I/O with harmful-jar.master/ >> 10.100.98.102 (org.apache.kafka.common.network.Selector) >> java.net.ConnectException: Connection refused >> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) >> at >> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) >> at org.apache.kafka.common.network.Selector.poll( >> Selector.java:232) >> at >> org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191) >> at >> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) >> at >> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) >> at java.lang.Thread.run(Thread.java:745) >> [2015-01-07 20:19:49,764] WARN Error in I/O with harmful-jar.master/ >> 10.100.98.102 (org.apache.kafka.common.network.Selector) >> java.net.ConnectException: Connection refused >> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) >> at >> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) >> at org.apache.kafka.common.network.Selector.poll( >> Selector.java:232) >> at >> org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191) >> at >> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) >> at >> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) >> at java.lang.Thread.run(Thread.java:745) >> 160403 records sent, 32080.6 records/sec (91.78 MB/sec), 507.0 ms avg >> latency, 2418.0 max latency. >> 109882 records sent, 21976.4 records/sec (62.87 MB/sec), 672.7 ms avg >> latency, 3529.0 max latency. >> 100315 records sent, 19995.0 records/sec (57.21 MB/sec), 774.8 ms avg >> latency, 3858.0 max latency. >> >> On Wed, Jan 7, 2015 at 12:07 PM, Sa Li wrote: >> >> Hi, All >>> >>> I am doing performance test by >>> >>> bin/kafka-run-class.sh org.apache.kafka.clients. >>> tools.ProducerPerformance >>> test-rep-three 5 100 -1 acks=1 bootstrap.servers= >>> 10.100.98.100:9092,10.100.98.101:9092,10.100.98.102:9092 >>> buffer.memory=67108864 batch.size=8196 >>> >>> where the topic test-rep-three is described as follow: >>> >>> bin/kafka-topics.sh --describe --zookeeper 10.100.98.101:2181 --topic >>> test-rep-three >>> Topic:test-rep-threePartitionCount:8ReplicationFactor:3 >>> Configs: >>> Topic: test-rep-three Partition: 0Leader: 100 >>> Replicas: >>> 100,102,101 Isr: 102,101,100 >>> Topic: test-rep-three Partition: 1Leader: 101 >>> Replicas: >>> 101,100,102 Isr: 102,101,100 >>> Topic: test-rep-three Partition: 2Leader: 102 >>> Replicas: >>> 102,101,100 Isr: 101,102,100 >>> Topic: test-rep-three Partition: 3Leader: 100 >>> Replicas: >>> 100,101,102 Isr: 101,100,102 >>> Topic: test-rep-three Partition: 4Leader: 101 >>> Replicas: >>> 101,102,100 Isr: 102,100,101 >>> Topic: test-rep-three Partition: 5Leader: 102 >>> Replicas: >>> 102,100,101 Isr: 100,102,101 >>> Topic: test-rep-three Partition: 6Leader: 102 >>> Replicas: >>> 100,102,101 Isr: 102,101,100 >>> Topic: test-rep-three Partition: 7Leader: 101 >>> Replicas: >>> 101,100,102 Isr: 101,100,102 >>> >>> Apparently, it produces the messages and run for a while, but it >>> periodically have such exceptions: >>> >>> org.apache.kafka.common.errors.NotLeaderForPartitionException: This >>> server >>> is not the leader for that topic-partition. >>> org.apache.kafka.common.errors.NotLeaderForPar
Re: NotLeaderForPartitionException while doing performance test
On Thursday 08 January 2015 01:51 AM, Sa Li wrote: see this type of error again, back to normal in few secs [2015-01-07 20:19:49,744] WARN Error in I/O with harmful-jar.master/ 10.100.98.102 That's a really weird hostname, the "harmful-jar.master". Is that really your hostname? You mention that this happens during performance testing. Have you taken a note of how many connection are open to that 10.100.98.102 IP when this "Connection refused" exception happens? -Jaikiran (org.apache.kafka.common.network.Selector) java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.kafka.common.network.Selector.poll(Selector.java:232) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) at java.lang.Thread.run(Thread.java:745) [2015-01-07 20:19:49,754] WARN Error in I/O with harmful-jar.master/ 10.100.98.102 (org.apache.kafka.common.network.Selector) java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.kafka.common.network.Selector.poll(Selector.java:232) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) at java.lang.Thread.run(Thread.java:745) [2015-01-07 20:19:49,764] WARN Error in I/O with harmful-jar.master/ 10.100.98.102 (org.apache.kafka.common.network.Selector) java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.kafka.common.network.Selector.poll(Selector.java:232) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) at java.lang.Thread.run(Thread.java:745) 160403 records sent, 32080.6 records/sec (91.78 MB/sec), 507.0 ms avg latency, 2418.0 max latency. 109882 records sent, 21976.4 records/sec (62.87 MB/sec), 672.7 ms avg latency, 3529.0 max latency. 100315 records sent, 19995.0 records/sec (57.21 MB/sec), 774.8 ms avg latency, 3858.0 max latency. On Wed, Jan 7, 2015 at 12:07 PM, Sa Li wrote: Hi, All I am doing performance test by bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test-rep-three 5 100 -1 acks=1 bootstrap.servers= 10.100.98.100:9092,10.100.98.101:9092,10.100.98.102:9092 buffer.memory=67108864 batch.size=8196 where the topic test-rep-three is described as follow: bin/kafka-topics.sh --describe --zookeeper 10.100.98.101:2181 --topic test-rep-three Topic:test-rep-threePartitionCount:8ReplicationFactor:3 Configs: Topic: test-rep-three Partition: 0Leader: 100 Replicas: 100,102,101 Isr: 102,101,100 Topic: test-rep-three Partition: 1Leader: 101 Replicas: 101,100,102 Isr: 102,101,100 Topic: test-rep-three Partition: 2Leader: 102 Replicas: 102,101,100 Isr: 101,102,100 Topic: test-rep-three Partition: 3Leader: 100 Replicas: 100,101,102 Isr: 101,100,102 Topic: test-rep-three Partition: 4Leader: 101 Replicas: 101,102,100 Isr: 102,100,101 Topic: test-rep-three Partition: 5Leader: 102 Replicas: 102,100,101 Isr: 100,102,101 Topic: test-rep-three Partition: 6Leader: 102 Replicas: 100,102,101 Isr: 102,101,100 Topic: test-rep-three Partition: 7Leader: 101 Replicas: 101,100,102 Isr: 101,100,102 Apparently, it produces the messages and run for a while, but it periodically have such exceptions: org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partit
Re: NotLeaderForPartitionException while doing performance test
I checked topic config, isr changes dynamically. root@voluminous-mass:/srv/kafka# bin/kafka-topics.sh --describe --zookeeper 10.100.98.101:2181 --topic test-rep-three Topic:test-rep-threePartitionCount:8ReplicationFactor:3 Configs: Topic: test-rep-three Partition: 0Leader: 100 Replicas: 100,102,101 Isr: 100 Topic: test-rep-three Partition: 1Leader: 100 Replicas: 101,100,102 Isr: 100,101,102 Topic: test-rep-three Partition: 2Leader: 102 Replicas: 102,101,100 Isr: 101,102 Topic: test-rep-three Partition: 3Leader: 100 Replicas: 100,101,102 Isr: 100 Topic: test-rep-three Partition: 4Leader: 100 Replicas: 101,102,100 Isr: 100 Topic: test-rep-three Partition: 5Leader: 102 Replicas: 102,100,101 Isr: 100,102,101 Topic: test-rep-three Partition: 6Leader: 100 Replicas: 100,102,101 Isr: 100,102,101 Topic: test-rep-three Partition: 7Leader: 100 Replicas: 101,100,102 Isr: 100 root@voluminous-mass:/srv/kafka# bin/kafka-topics.sh --describe --zookeeper 10.100.98.101:2181 --topic test-rep-three Topic:test-rep-threePartitionCount:8ReplicationFactor:3 Configs: Topic: test-rep-three Partition: 0Leader: 100 Replicas: 100,102,101 Isr: 102,100,101 Topic: test-rep-three Partition: 1Leader: 101 Replicas: 101,100,102 Isr: 101,102,100 Topic: test-rep-three Partition: 2Leader: 102 Replicas: 102,101,100 Isr: 101,102 Topic: test-rep-three Partition: 3Leader: 100 Replicas: 100,101,102 Isr: 101,100,102 Topic: test-rep-three Partition: 4Leader: 101 Replicas: 101,102,100 Isr: 101,102,100 Topic: test-rep-three Partition: 5Leader: 102 Replicas: 102,100,101 Isr: 102,101,100 Topic: test-rep-three Partition: 6Leader: 102 Replicas: 100,102,101 Isr: 102,101 Topic: test-rep-three Partition: 7Leader: 101 Replicas: 101,100,102 Isr: 101,100,102 Why that happen? thanks On Wed, Jan 7, 2015 at 12:21 PM, Sa Li wrote: > see this type of error again, back to normal in few secs > > [2015-01-07 20:19:49,744] WARN Error in I/O with harmful-jar.master/ > 10.100.98.102 (org.apache.kafka.common.network.Selector) > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) > at org.apache.kafka.common.network.Selector.poll(Selector.java:232) > at > org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191) > at > org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) > at > org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:745) > [2015-01-07 20:19:49,754] WARN Error in I/O with harmful-jar.master/ > 10.100.98.102 (org.apache.kafka.common.network.Selector) > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) > at org.apache.kafka.common.network.Selector.poll(Selector.java:232) > at > org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191) > at > org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) > at > org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:745) > [2015-01-07 20:19:49,764] WARN Error in I/O with harmful-jar.master/ > 10.100.98.102 (org.apache.kafka.common.network.Selector) > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) > at org.apache.kafka.common.network.Selector.poll(Selector.java:232) > at > org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191) > at > org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) > at > org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) > at java.lang.Thread.run(Thread.java:745) > 160403 records sent, 32080.6 records/sec (91.78 MB/sec), 507.0 ms avg > latency, 2418.0 max latency. > 109882 records sent, 21976.4 records/sec (62.87 MB/sec), 672.7 ms avg > latency, 3529.0 max latency. > 100315 records sent, 19995.0 records/sec (57.21 MB/sec), 774.8 ms avg > latency, 3858.0 max latency. > > On Wed, Jan 7, 2015 at 12:07 PM, Sa Li wrote: > >> Hi, All >> >> I am doing performance test by >> >> bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance >> test-rep-three 5 100 -1 acks=1 bootstrap.servers= >> 10.100.98.100:9092,10.100.98.101:9092,10.100.98.102:9092 >> buffer.memor
Re: NotLeaderForPartitionException while doing performance test
see this type of error again, back to normal in few secs [2015-01-07 20:19:49,744] WARN Error in I/O with harmful-jar.master/ 10.100.98.102 (org.apache.kafka.common.network.Selector) java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.kafka.common.network.Selector.poll(Selector.java:232) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) at java.lang.Thread.run(Thread.java:745) [2015-01-07 20:19:49,754] WARN Error in I/O with harmful-jar.master/ 10.100.98.102 (org.apache.kafka.common.network.Selector) java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.kafka.common.network.Selector.poll(Selector.java:232) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) at java.lang.Thread.run(Thread.java:745) [2015-01-07 20:19:49,764] WARN Error in I/O with harmful-jar.master/ 10.100.98.102 (org.apache.kafka.common.network.Selector) java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.kafka.common.network.Selector.poll(Selector.java:232) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) at java.lang.Thread.run(Thread.java:745) 160403 records sent, 32080.6 records/sec (91.78 MB/sec), 507.0 ms avg latency, 2418.0 max latency. 109882 records sent, 21976.4 records/sec (62.87 MB/sec), 672.7 ms avg latency, 3529.0 max latency. 100315 records sent, 19995.0 records/sec (57.21 MB/sec), 774.8 ms avg latency, 3858.0 max latency. On Wed, Jan 7, 2015 at 12:07 PM, Sa Li wrote: > Hi, All > > I am doing performance test by > > bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance > test-rep-three 5 100 -1 acks=1 bootstrap.servers= > 10.100.98.100:9092,10.100.98.101:9092,10.100.98.102:9092 > buffer.memory=67108864 batch.size=8196 > > where the topic test-rep-three is described as follow: > > bin/kafka-topics.sh --describe --zookeeper 10.100.98.101:2181 --topic > test-rep-three > Topic:test-rep-threePartitionCount:8ReplicationFactor:3 > Configs: > Topic: test-rep-three Partition: 0Leader: 100 Replicas: > 100,102,101 Isr: 102,101,100 > Topic: test-rep-three Partition: 1Leader: 101 Replicas: > 101,100,102 Isr: 102,101,100 > Topic: test-rep-three Partition: 2Leader: 102 Replicas: > 102,101,100 Isr: 101,102,100 > Topic: test-rep-three Partition: 3Leader: 100 Replicas: > 100,101,102 Isr: 101,100,102 > Topic: test-rep-three Partition: 4Leader: 101 Replicas: > 101,102,100 Isr: 102,100,101 > Topic: test-rep-three Partition: 5Leader: 102 Replicas: > 102,100,101 Isr: 100,102,101 > Topic: test-rep-three Partition: 6Leader: 102 Replicas: > 100,102,101 Isr: 102,101,100 > Topic: test-rep-three Partition: 7Leader: 101 Replicas: > 101,100,102 Isr: 101,100,102 > > Apparently, it produces the messages and run for a while, but it > periodically have such exceptions: > > org.apache.kafka.common.errors.NotLeaderForPartitionException: This server > is not the leader for that topic-partition. > org.apache.kafka.common.errors.NotLeaderForPartitionException: This server > is not the leader for that topic-partition. > org.apache.kafka.common.errors.NotLeaderForPartitionException: This server > is not the leader for that topic-partition. > org.apache.kafka.common.errors.NotLeaderForPartitionException: This server > is not the leader for that topic-partition. > org.apache.kafka.common.errors.NotLeaderForPartitionException: This server > is not the leader for that topic-partition. > org.apache.kafka.common.errors.NotLeaderForPartitionException: This server > is not the leader for that topic-partition. > org.apache.kafka.common.errors.NotLeaderForPartitionException: This server > is not the leader for that topic-partition. > org.apache.kafka.common.errors.NotLeaderForPartitionException: This server > is not the leader for that topic-partition. > org.apache.kafka.common.errors.NotLeade
NotLeaderForPartitionException while doing performance test
Hi, All I am doing performance test by bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test-rep-three 5 100 -1 acks=1 bootstrap.servers=10.100.98.100:9092, 10.100.98.101:9092,10.100.98.102:9092 buffer.memory=67108864 batch.size=8196 where the topic test-rep-three is described as follow: bin/kafka-topics.sh --describe --zookeeper 10.100.98.101:2181 --topic test-rep-three Topic:test-rep-threePartitionCount:8ReplicationFactor:3 Configs: Topic: test-rep-three Partition: 0Leader: 100 Replicas: 100,102,101 Isr: 102,101,100 Topic: test-rep-three Partition: 1Leader: 101 Replicas: 101,100,102 Isr: 102,101,100 Topic: test-rep-three Partition: 2Leader: 102 Replicas: 102,101,100 Isr: 101,102,100 Topic: test-rep-three Partition: 3Leader: 100 Replicas: 100,101,102 Isr: 101,100,102 Topic: test-rep-three Partition: 4Leader: 101 Replicas: 101,102,100 Isr: 102,100,101 Topic: test-rep-three Partition: 5Leader: 102 Replicas: 102,100,101 Isr: 100,102,101 Topic: test-rep-three Partition: 6Leader: 102 Replicas: 100,102,101 Isr: 102,101,100 Topic: test-rep-three Partition: 7Leader: 101 Replicas: 101,100,102 Isr: 101,100,102 Apparently, it produces the messages and run for a while, but it periodically have such exceptions: org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. 141292 records sent, 28258.4 records/sec (80.85 MB/sec), 551.2 ms avg latency, 1494.0 max latency. 142526 records sent, 28505.2 records/sec (81.55 MB/sec), 580.8 ms avg latency, 1513.0 max latency. 146564 records sent, 29312.8 records/sec (83.86 MB/sec), 557.9 ms avg latency, 1431.0 max latency. 146755 records sent, 29351.0 records/sec (83.97 MB/sec), 556.7 ms avg latency, 1480.0 max latency. 147963 records sent, 29592.6 records/sec (84.67 MB/sec), 556.7 ms avg latency, 1546.0 max latency. 146931 records sent, 29386.2 records/sec (84.07 MB/sec), 550.9 ms avg latency, 1715.0 max latency. 146947 records sent, 29389.4 records/sec (84.08 MB/sec), 555.1 ms avg latency, 1750.0 max latency. 146422 records sent, 29284.4 records/sec (83.78 MB/sec), 557.9 ms avg latency, 1818.0 max latency. 147516 records sent, 29503.2 records/sec (84.41 MB/sec), 555.6 ms avg latency, 1806.0 max latency. 147877 records sent, 29575.4 records/sec (84.62 MB/sec), 552.1 ms avg latency, 1821.0 max latency. 147201 records sent, 29440.2 records/sec (84.23 MB/sec), 554.5 ms avg latency, 1826.0 max latency. 148317 records sent, 29663.4 records/sec (84.87 MB/sec), 558.1 ms avg latency, 1792.0 max latency. 147756 records sent, 29551.2 records/sec (84.55 MB/sec), 550.9 ms avg latency, 1806.0 max latency then back into correct process state, is that because rebalance? thanks -- Alec Li
Re: Kafka - NotLeaderForPartitionException / LeaderNotAvailableException
Will definitely take a thread dump! So, far its been running fine. -Jacob On Wed, Oct 15, 2014 at 8:40 PM, Jun Rao wrote: > If you see the hanging again, it would be great if you can take a thread > dump so that we know where it is hanging. > > Thanks, > > Jun > > On Tue, Oct 14, 2014 at 10:35 PM, Abraham Jacob > wrote: > > > Hi Jun, > > > > Thanks for responding... > > > > I am using Kafka 2.9.2-0.8.1.1 > > > > I looked through the controller logs on a couple of nodes and did not > find > > any exceptions or error. > > > > However in the state change log I see a bunch of the following > exceptions - > > > > [2014-10-13 14:39:12,475] TRACE Controller 3 epoch 116 started leader > > election for partition [wordcount,1] (state.change.logger) > > [2014-10-13 14:39:12,479] ERROR Controller 3 epoch 116 initiated state > > change for partition [wordcount,1] from OfflinePartition to > OnlinePartition > > failed (state.change.logger) > > kafka.common.NoReplicaOnlineException: No replica for partition > > [wordcount,1] is alive. Live brokers are: [Set()], Assigned replicas are: > > [List(8, 7, 1)] > > at > > > > > kafka.controller.OfflinePartitionLeaderSelector.selectLeader(PartitionLeaderSelector.scala:61) > > at > > > > > kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:336) > > at > > > > > kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:185) > > at > > > > > kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:99) > > at > > > > > kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:96) > > at > > > > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:743) > > at > > > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95) > > at > > > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95) > > at scala.collection.Iterator$class.foreach(Iterator.scala:772) > > at > > scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157) > > at > > > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190) > > at > scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45) > > at scala.collection.mutable.HashMap.foreach(HashMap.scala:95) > > at > > > > > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:742) > > at > > > > > kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:96) > > at > > > > > kafka.controller.PartitionStateMachine.startup(PartitionStateMachine.scala:68) > > at > > > > > kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:312) > > at > > > > > kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:162) > > at > > > kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:63) > > at > > > > > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:123) > > at > > > > > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:118) > > at > > > > > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:118) > > at kafka.utils.Utils$.inLock(Utils.scala:538) > > at > > > > > kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:118) > > at org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:549) > > at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) > > > > > > Anyways, this morning after sending out the email, I set out to restart > all > > the brokers. I found that 3 brokers were in a hung state. I tried to use > > the bin/kafka-server-stop.sh script (which is nothing but sending a > SIGINT > > signal), the java process running kafka would not terminate, I then > issued > > a 'kill -SIGTERM x' for the java process running Kafka, yet the > process > > would not terminate. This happened only on 3 nodes (1 node is running > only > > 1 broker). For the other nodes kafka-server-stop.sh successfully bought > > down the java process running Kafka. > > > > For the three brokers that was not responding to either SIGINT and > SIGTERM > > signal I issued a SIGKILL instead and this, for sure brought down the > > process. > > > > I then restarted brokers on all nodes. After that I again ran the > describe > > topic script. > > > > bin/kafka-topics.sh --describe --zookeeper tr-pan-hclstr-08.amers1b. > > ciscloud:2181/kafka/kafka-clstr-01 --topic wordcount > >
Re: Kafka - NotLeaderForPartitionException / LeaderNotAvailableException
If you see the hanging again, it would be great if you can take a thread dump so that we know where it is hanging. Thanks, Jun On Tue, Oct 14, 2014 at 10:35 PM, Abraham Jacob wrote: > Hi Jun, > > Thanks for responding... > > I am using Kafka 2.9.2-0.8.1.1 > > I looked through the controller logs on a couple of nodes and did not find > any exceptions or error. > > However in the state change log I see a bunch of the following exceptions - > > [2014-10-13 14:39:12,475] TRACE Controller 3 epoch 116 started leader > election for partition [wordcount,1] (state.change.logger) > [2014-10-13 14:39:12,479] ERROR Controller 3 epoch 116 initiated state > change for partition [wordcount,1] from OfflinePartition to OnlinePartition > failed (state.change.logger) > kafka.common.NoReplicaOnlineException: No replica for partition > [wordcount,1] is alive. Live brokers are: [Set()], Assigned replicas are: > [List(8, 7, 1)] > at > > kafka.controller.OfflinePartitionLeaderSelector.selectLeader(PartitionLeaderSelector.scala:61) > at > > kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:336) > at > > kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:185) > at > > kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:99) > at > > kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:96) > at > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:743) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95) > at scala.collection.Iterator$class.foreach(Iterator.scala:772) > at > scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:95) > at > > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:742) > at > > kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:96) > at > > kafka.controller.PartitionStateMachine.startup(PartitionStateMachine.scala:68) > at > > kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:312) > at > > kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:162) > at > kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:63) > at > > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:123) > at > > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:118) > at > > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:118) > at kafka.utils.Utils$.inLock(Utils.scala:538) > at > > kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:118) > at org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:549) > at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) > > > Anyways, this morning after sending out the email, I set out to restart all > the brokers. I found that 3 brokers were in a hung state. I tried to use > the bin/kafka-server-stop.sh script (which is nothing but sending a SIGINT > signal), the java process running kafka would not terminate, I then issued > a 'kill -SIGTERM x' for the java process running Kafka, yet the process > would not terminate. This happened only on 3 nodes (1 node is running only > 1 broker). For the other nodes kafka-server-stop.sh successfully bought > down the java process running Kafka. > > For the three brokers that was not responding to either SIGINT and SIGTERM > signal I issued a SIGKILL instead and this, for sure brought down the > process. > > I then restarted brokers on all nodes. After that I again ran the describe > topic script. > > bin/kafka-topics.sh --describe --zookeeper tr-pan-hclstr-08.amers1b. > ciscloud:2181/kafka/kafka-clstr-01 --topic wordcount > > > Topic:wordcount PartitionCount:8ReplicationFactor:3 Configs: > Topic: wordcountPartition: 0Leader: 7 Replicas: > 7,6,8 Isr: 6,7,8 > Topic: wordcountPartition: 1Leader: 8 Replicas: > 8,7,1 Isr: 1,7,8 > Topic: wordcountPartition: 2Leader: 1 Replicas: > 1,8,2 Isr: 1,2,8 > Topic: wordcountPartiti
Re: Kafka - NotLeaderForPartitionException / LeaderNotAvailableException
Hi Jun, Thanks for responding... I am using Kafka 2.9.2-0.8.1.1 I looked through the controller logs on a couple of nodes and did not find any exceptions or error. However in the state change log I see a bunch of the following exceptions - [2014-10-13 14:39:12,475] TRACE Controller 3 epoch 116 started leader election for partition [wordcount,1] (state.change.logger) [2014-10-13 14:39:12,479] ERROR Controller 3 epoch 116 initiated state change for partition [wordcount,1] from OfflinePartition to OnlinePartition failed (state.change.logger) kafka.common.NoReplicaOnlineException: No replica for partition [wordcount,1] is alive. Live brokers are: [Set()], Assigned replicas are: [List(8, 7, 1)] at kafka.controller.OfflinePartitionLeaderSelector.selectLeader(PartitionLeaderSelector.scala:61) at kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:336) at kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:185) at kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:99) at kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:96) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:743) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95) at scala.collection.Iterator$class.foreach(Iterator.scala:772) at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45) at scala.collection.mutable.HashMap.foreach(HashMap.scala:95) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:742) at kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:96) at kafka.controller.PartitionStateMachine.startup(PartitionStateMachine.scala:68) at kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:312) at kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:162) at kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:63) at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:123) at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:118) at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:118) at kafka.utils.Utils$.inLock(Utils.scala:538) at kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:118) at org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:549) at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) Anyways, this morning after sending out the email, I set out to restart all the brokers. I found that 3 brokers were in a hung state. I tried to use the bin/kafka-server-stop.sh script (which is nothing but sending a SIGINT signal), the java process running kafka would not terminate, I then issued a 'kill -SIGTERM x' for the java process running Kafka, yet the process would not terminate. This happened only on 3 nodes (1 node is running only 1 broker). For the other nodes kafka-server-stop.sh successfully bought down the java process running Kafka. For the three brokers that was not responding to either SIGINT and SIGTERM signal I issued a SIGKILL instead and this, for sure brought down the process. I then restarted brokers on all nodes. After that I again ran the describe topic script. bin/kafka-topics.sh --describe --zookeeper tr-pan-hclstr-08.amers1b. ciscloud:2181/kafka/kafka-clstr-01 --topic wordcount Topic:wordcount PartitionCount:8ReplicationFactor:3 Configs: Topic: wordcountPartition: 0Leader: 7 Replicas: 7,6,8 Isr: 6,7,8 Topic: wordcountPartition: 1Leader: 8 Replicas: 8,7,1 Isr: 1,7,8 Topic: wordcountPartition: 2Leader: 1 Replicas: 1,8,2 Isr: 1,2,8 Topic: wordcountPartition: 3Leader: 2 Replicas: 2,1,3 Isr: 1,2,3 Topic: wordcountPartition: 4Leader: 3 Replicas: 3,2,4 Isr: 2,3,4 Topic: wordcountPartition: 5Leader: 4 Replicas: 4,3,5 Isr: 3,4,5 Topic: wordcountPartition: 6Leader: 5 Replicas: 5,4,6 Isr: 4,5,6 Topic: wordcountPartition: 7Leader: 6 Replic
Re: Kafka - NotLeaderForPartitionException / LeaderNotAvailableException
Also, which version of Kafka are you using? Thanks, Jun On Tue, Oct 14, 2014 at 5:31 PM, Jun Rao wrote: > The following is a bit weird. It indicates no leader for partition 4, > which is inconsistent with what describe-topic shows. > > 2014-10-13 19:02:32,611 WARN [main] kafka.producer.BrokerPartitionInfo: > Error while fetching metadata partition 4 leader: nonereplicas: 3 > (tr-pan-hclstr-13.amers1b.ciscloud:9092),2 > (tr-pan-hclstr-12.amers1b.ciscloud:9092),4 > (tr-pan-hclstr-14.amers1b.ciscloud:9092) isr:isUnderReplicated: > true for topic partition [wordcount,4]: [class > kafka.common.LeaderNotAvailableException] > > Any error in the controller and the state-change log? Do you see broker 3 > marked as dead in the controller log? Also, could you check if the broker > registration in ZK ( > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+data+structures+in+Zookeeper) > has the correct host/port? > > Thanks, > > Jun > > On Mon, Oct 13, 2014 at 5:35 PM, Abraham Jacob > wrote: > >> Hi All, >> >> I have a 8 node Kafka cluster (broker.id - 1..8). On this cluster I have >> a >> topic "wordcount", which was 8 partitions with a replication factor of 3. >> >> So a describe of topic wordcount >> # bin/kafka-topics.sh --describe --zookeeper >> tr-pan-hclstr-08.amers1b.ciscloud:2181/kafka/kafka-clstr-01 --topic >> wordcount >> >> >> Topic:wordcount PartitionCount:8ReplicationFactor:3 Configs: >> Topic: wordcountPartition: 0Leader: 6 Replicas: 7,6,8 >> Isr: 6,7,8 >> Topic: wordcountPartition: 1Leader: 7 Replicas: 8,7,1 >> Isr: 7 >> Topic: wordcountPartition: 2Leader: 8 Replicas: 1,8,2 >> Isr: 8 >> Topic: wordcountPartition: 3Leader: 3 Replicas: 2,1,3 >> Isr: 3 >> Topic: wordcountPartition: 4Leader: 3 Replicas: 3,2,4 >> Isr: 3,2,4 >> Topic: wordcountPartition: 5Leader: 3 Replicas: 4,3,5 >> Isr: 3,5 >> Topic: wordcountPartition: 6Leader: 6 Replicas: 5,4,6 >> Isr: 6,5 >> Topic: wordcountPartition: 7Leader: 6 Replicas: 6,5,7 >> Isr: 6,5,7 >> >> I wrote a simple producer to write to this topic. However when running I >> get these messages in the logs - >> >> 2014-10-13 19:02:32,459 INFO [main] kafka.client.ClientUtils$: Fetching >> metadata from broker id:0,host:tr-pan-hclstr-11.amers1b.ciscloud,port:9092 >> with correlation id 0 for 1 topic(s) Set(wordcount) >> 2014-10-13 19:02:32,464 INFO [main] kafka.producer.SyncProducer: Connected >> to tr-pan-hclstr-11.amers1b.ciscloud:9092 for producing >> 2014-10-13 19:02:32,551 INFO [main] kafka.producer.SyncProducer: >> Disconnecting from tr-pan-hclstr-11.amers1b.ciscloud:9092 >> 2014-10-13 19:02:32,611 WARN [main] kafka.producer.BrokerPartitionInfo: >> Error while fetching metadata partition 4 leader: nonereplicas: >> 3 >> (tr-pan-hclstr-13.amers1b.ciscloud:9092),2 >> (tr-pan-hclstr-12.amers1b.ciscloud:9092),4 >> (tr-pan-hclstr-14.amers1b.ciscloud:9092) isr:isUnderReplicated: >> true for topic partition [wordcount,4]: [class >> kafka.common.LeaderNotAvailableException] >> 2014-10-13 19:02:33,505 INFO [main] kafka.producer.SyncProducer: Connected >> to tr-pan-hclstr-15.amers1b.ciscloud:9092 for producing >> 2014-10-13 19:02:33,543 WARN [main] >> kafka.producer.async.DefaultEventHandler: Produce request with correlation >> id 20611 failed due to [wordcount,5]: >> kafka.common.NotLeaderForPartitionException,[wordcount,6]: >> kafka.common.NotLeaderForPartitionException,[wordcount,7]: >> kafka.common.NotLeaderForPartitionException >> 2014-10-13 19:02:33,694 INFO [main] kafka.producer.SyncProducer: Connected >> to tr-pan-hclstr-18.amers1b.ciscloud:9092 for producing >> 2014-10-13 19:02:33,725 WARN [main] >> kafka.producer.async.DefaultEventHandler: Produce request with correlation >> id 20612 failed due to [wordcount,0]: >> kafka.common.NotLeaderForPartitionException >> 2014-10-13 19:02:33,861 INFO [main] kafka.producer.SyncProducer: Connected >> to tr-pan-hclstr-11.amers1b.ciscloud:9092 for producing >> 2014-10-13 19:02:33,983 WARN [main] >> kafka.producer.async.DefaultEventHandler: Failed to send data since >> partitions [wordcount,4] don't have a leader >> >> >> Obviously something is terribly wrong... I am quite new to Kafka, hence >> these messages don't make any sense to me, except for the fact that it is >> telling me that some of the partitions don't have any leader. >> >> Could somebody be kind enough to explain the above message? >> >> A few more questions - >> >> (1) How does one get into this state? >> (2) How can I get out of this state? >> (3) I have set auto.leader.rebalance.enable=true on all brokers. Shouldn't >> the partitions be balanced across all the brokers? >> (4) I can see that the Kafka service are running on all 8 nodes. (I used >> ps ax -o "pid pgid args" and I can see under the kafka Java process). >> (5) Is there a way I can force a re-balance? >> >>
Re: Kafka - NotLeaderForPartitionException / LeaderNotAvailableException
The following is a bit weird. It indicates no leader for partition 4, which is inconsistent with what describe-topic shows. 2014-10-13 19:02:32,611 WARN [main] kafka.producer.BrokerPartitionInfo: Error while fetching metadata partition 4 leader: nonereplicas: 3 (tr-pan-hclstr-13.amers1b.ciscloud:9092),2 (tr-pan-hclstr-12.amers1b.ciscloud:9092),4 (tr-pan-hclstr-14.amers1b.ciscloud:9092) isr:isUnderReplicated: true for topic partition [wordcount,4]: [class kafka.common.LeaderNotAvailableException] Any error in the controller and the state-change log? Do you see broker 3 marked as dead in the controller log? Also, could you check if the broker registration in ZK ( https://cwiki.apache.org/confluence/display/KAFKA/Kafka+data+structures+in+Zookeeper) has the correct host/port? Thanks, Jun On Mon, Oct 13, 2014 at 5:35 PM, Abraham Jacob wrote: > Hi All, > > I have a 8 node Kafka cluster (broker.id - 1..8). On this cluster I have a > topic "wordcount", which was 8 partitions with a replication factor of 3. > > So a describe of topic wordcount > # bin/kafka-topics.sh --describe --zookeeper > tr-pan-hclstr-08.amers1b.ciscloud:2181/kafka/kafka-clstr-01 --topic > wordcount > > > Topic:wordcount PartitionCount:8ReplicationFactor:3 Configs: > Topic: wordcountPartition: 0Leader: 6 Replicas: 7,6,8 > Isr: 6,7,8 > Topic: wordcountPartition: 1Leader: 7 Replicas: 8,7,1 > Isr: 7 > Topic: wordcountPartition: 2Leader: 8 Replicas: 1,8,2 > Isr: 8 > Topic: wordcountPartition: 3Leader: 3 Replicas: 2,1,3 > Isr: 3 > Topic: wordcountPartition: 4Leader: 3 Replicas: 3,2,4 > Isr: 3,2,4 > Topic: wordcountPartition: 5Leader: 3 Replicas: 4,3,5 > Isr: 3,5 > Topic: wordcountPartition: 6Leader: 6 Replicas: 5,4,6 > Isr: 6,5 > Topic: wordcountPartition: 7Leader: 6 Replicas: 6,5,7 > Isr: 6,5,7 > > I wrote a simple producer to write to this topic. However when running I > get these messages in the logs - > > 2014-10-13 19:02:32,459 INFO [main] kafka.client.ClientUtils$: Fetching > metadata from broker id:0,host:tr-pan-hclstr-11.amers1b.ciscloud,port:9092 > with correlation id 0 for 1 topic(s) Set(wordcount) > 2014-10-13 19:02:32,464 INFO [main] kafka.producer.SyncProducer: Connected > to tr-pan-hclstr-11.amers1b.ciscloud:9092 for producing > 2014-10-13 19:02:32,551 INFO [main] kafka.producer.SyncProducer: > Disconnecting from tr-pan-hclstr-11.amers1b.ciscloud:9092 > 2014-10-13 19:02:32,611 WARN [main] kafka.producer.BrokerPartitionInfo: > Error while fetching metadata partition 4 leader: nonereplicas: 3 > (tr-pan-hclstr-13.amers1b.ciscloud:9092),2 > (tr-pan-hclstr-12.amers1b.ciscloud:9092),4 > (tr-pan-hclstr-14.amers1b.ciscloud:9092) isr:isUnderReplicated: > true for topic partition [wordcount,4]: [class > kafka.common.LeaderNotAvailableException] > 2014-10-13 19:02:33,505 INFO [main] kafka.producer.SyncProducer: Connected > to tr-pan-hclstr-15.amers1b.ciscloud:9092 for producing > 2014-10-13 19:02:33,543 WARN [main] > kafka.producer.async.DefaultEventHandler: Produce request with correlation > id 20611 failed due to [wordcount,5]: > kafka.common.NotLeaderForPartitionException,[wordcount,6]: > kafka.common.NotLeaderForPartitionException,[wordcount,7]: > kafka.common.NotLeaderForPartitionException > 2014-10-13 19:02:33,694 INFO [main] kafka.producer.SyncProducer: Connected > to tr-pan-hclstr-18.amers1b.ciscloud:9092 for producing > 2014-10-13 19:02:33,725 WARN [main] > kafka.producer.async.DefaultEventHandler: Produce request with correlation > id 20612 failed due to [wordcount,0]: > kafka.common.NotLeaderForPartitionException > 2014-10-13 19:02:33,861 INFO [main] kafka.producer.SyncProducer: Connected > to tr-pan-hclstr-11.amers1b.ciscloud:9092 for producing > 2014-10-13 19:02:33,983 WARN [main] > kafka.producer.async.DefaultEventHandler: Failed to send data since > partitions [wordcount,4] don't have a leader > > > Obviously something is terribly wrong... I am quite new to Kafka, hence > these messages don't make any sense to me, except for the fact that it is > telling me that some of the partitions don't have any leader. > > Could somebody be kind enough to explain the above message? > > A few more questions - > > (1) How does one get into this state? > (2) How can I get out of this state? > (3) I have set auto.leader.rebalance.enable=true on all brokers. Shouldn't > the partitions be balanced across all the brokers? > (4) I can see that the Kafka service are running on all 8 nodes. (I used > ps ax -o "pid pgid args" and I can see under the kafka Java process). > (5) Is there a way I can force a re-balance? > > > > Regards, > Jacob > > > > -- > ~ >
Kafka - NotLeaderForPartitionException / LeaderNotAvailableException
Hi All, I have a 8 node Kafka cluster (broker.id - 1..8). On this cluster I have a topic "wordcount", which was 8 partitions with a replication factor of 3. So a describe of topic wordcount # bin/kafka-topics.sh --describe --zookeeper tr-pan-hclstr-08.amers1b.ciscloud:2181/kafka/kafka-clstr-01 --topic wordcount Topic:wordcount PartitionCount:8ReplicationFactor:3 Configs: Topic: wordcountPartition: 0Leader: 6 Replicas: 7,6,8 Isr: 6,7,8 Topic: wordcountPartition: 1Leader: 7 Replicas: 8,7,1 Isr: 7 Topic: wordcountPartition: 2Leader: 8 Replicas: 1,8,2 Isr: 8 Topic: wordcountPartition: 3Leader: 3 Replicas: 2,1,3 Isr: 3 Topic: wordcountPartition: 4Leader: 3 Replicas: 3,2,4 Isr: 3,2,4 Topic: wordcountPartition: 5Leader: 3 Replicas: 4,3,5 Isr: 3,5 Topic: wordcountPartition: 6Leader: 6 Replicas: 5,4,6 Isr: 6,5 Topic: wordcountPartition: 7Leader: 6 Replicas: 6,5,7 Isr: 6,5,7 I wrote a simple producer to write to this topic. However when running I get these messages in the logs - 2014-10-13 19:02:32,459 INFO [main] kafka.client.ClientUtils$: Fetching metadata from broker id:0,host:tr-pan-hclstr-11.amers1b.ciscloud,port:9092 with correlation id 0 for 1 topic(s) Set(wordcount) 2014-10-13 19:02:32,464 INFO [main] kafka.producer.SyncProducer: Connected to tr-pan-hclstr-11.amers1b.ciscloud:9092 for producing 2014-10-13 19:02:32,551 INFO [main] kafka.producer.SyncProducer: Disconnecting from tr-pan-hclstr-11.amers1b.ciscloud:9092 2014-10-13 19:02:32,611 WARN [main] kafka.producer.BrokerPartitionInfo: Error while fetching metadata partition 4 leader: nonereplicas: 3 (tr-pan-hclstr-13.amers1b.ciscloud:9092),2 (tr-pan-hclstr-12.amers1b.ciscloud:9092),4 (tr-pan-hclstr-14.amers1b.ciscloud:9092) isr:isUnderReplicated: true for topic partition [wordcount,4]: [class kafka.common.LeaderNotAvailableException] 2014-10-13 19:02:33,505 INFO [main] kafka.producer.SyncProducer: Connected to tr-pan-hclstr-15.amers1b.ciscloud:9092 for producing 2014-10-13 19:02:33,543 WARN [main] kafka.producer.async.DefaultEventHandler: Produce request with correlation id 20611 failed due to [wordcount,5]: kafka.common.NotLeaderForPartitionException,[wordcount,6]: kafka.common.NotLeaderForPartitionException,[wordcount,7]: kafka.common.NotLeaderForPartitionException 2014-10-13 19:02:33,694 INFO [main] kafka.producer.SyncProducer: Connected to tr-pan-hclstr-18.amers1b.ciscloud:9092 for producing 2014-10-13 19:02:33,725 WARN [main] kafka.producer.async.DefaultEventHandler: Produce request with correlation id 20612 failed due to [wordcount,0]: kafka.common.NotLeaderForPartitionException 2014-10-13 19:02:33,861 INFO [main] kafka.producer.SyncProducer: Connected to tr-pan-hclstr-11.amers1b.ciscloud:9092 for producing 2014-10-13 19:02:33,983 WARN [main] kafka.producer.async.DefaultEventHandler: Failed to send data since partitions [wordcount,4] don't have a leader Obviously something is terribly wrong... I am quite new to Kafka, hence these messages don't make any sense to me, except for the fact that it is telling me that some of the partitions don't have any leader. Could somebody be kind enough to explain the above message? A few more questions - (1) How does one get into this state? (2) How can I get out of this state? (3) I have set auto.leader.rebalance.enable=true on all brokers. Shouldn't the partitions be balanced across all the brokers? (4) I can see that the Kafka service are running on all 8 nodes. (I used ps ax -o "pid pgid args" and I can see under the kafka Java process). (5) Is there a way I can force a re-balance? Regards, Jacob -- ~
Re: NotLeaderForPartitionException after creating a topic
The following exception seems suspicious and I think it is fixed in 0.8 HEAD. Do you mind trying 0.8 HEAD and see if this can still be reproduced ? ava.util. NoSuchElementException: key not found: /disk1/test/kafka-data at scala.collection.MapLike$class.default(MapLike.scala:223) at scala.collection.immutable.Map$Map1.default(Map.scala:93) at scala.collection.MapLike$class.apply(MapLike.scala:134) at scala.collection.immutable.Map$Map1.apply(Map.scala:93) at kafka.cluster.Partition.getOrCreateReplica(Partition.scala:83) at kafka.cluster.Partition$$anonfun$1.apply(Partition.scala:149)
NotLeaderForPartitionException after creating a topic
Hello, I am using 0.8.0-beta1, with two brokers. After using bin/kafka-create-topic.sh created a topic named "ead_click", under bin/kafka-list-topic.sh I can see the topic seems to be created successfully: [2013-11-01 15:49:35,388] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient) topic: ead_click partition: 0 leader: 1 replicas: 1 isr: 1 [2013-11-01 15:49:35,720] INFO Terminate ZkClient event thread. (org.I0Itec.zkclient.ZkEventThread) But the producer for the topic says can not find leader. Then I start a consumer, an exception throwed: [2013-11-01 15:50:32,807] INFO [ConsumerFetcherThread-console-consumer-50963_qt103.corp.xxx.com-1383292232223-b874ebd0-0-1], Starting (kafka.consumer.ConsumerFetcherThread) [2013-11-01 15:50:32,833] ERROR [console-consumer-50963_qt103.corp.xxx.com-1383292232223-b874ebd0-leader-finder-thread], Error due to (kafka.consumer.ConsumerFetcherManager$LeaderFinderThread) kafka.common.NotLeaderForPartitionException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at java.lang.Class.newInstance0(Class.java:355) at java.lang.Class.newInstance(Class.java:308) at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:70) at kafka.consumer.SimpleConsumer.earliestOrLatestOffset(SimpleConsumer.scala:147) at kafka.consumer.ConsumerFetcherThread.handleOffsetOutOfRange(ConsumerFetcherThread.scala:60) at kafka.server.AbstractFetcherThread.addPartition(AbstractFetcherThread.scala:180) at kafka.server.AbstractFetcherManager.addFetcher(AbstractFetcherManager.scala:80) at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread$$anonfun$doWork$7.apply(ConsumerFetcherManager.scala:95) at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread$$anonfun$doWork$7.apply(ConsumerFetcherManager.scala:92) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80) at scala.collection.Iterator$class.foreach(Iterator.scala:631) at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:161) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:194) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:80) at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:92) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51) In state-change.log of broker 0, this exception can be found: [2013-11-01 15:25:04,608] TRACE Broker 1 received LeaderAndIsr request correlationId 9 from controller 1 epoch 1 starting the become-leader transition for partition [ead_click,0] (state.change.logger) [2013-11-01 15:25:04,635] ERROR Error on broker 1 while processing LeaderAndIsr request correlationId 9 received from controller 1 epoch 1 for partition (ead_click,0) (state.change.logger) java.util.NoSuchElementException: key not found: /disk1/test/kafka-data at scala.collection.MapLike$class.default(MapLike.scala:223) at scala.collection.immutable.Map$Map1.default(Map.scala:93) at scala.collection.MapLike$class.apply(MapLike.scala:134) at scala.collection.immutable.Map$Map1.apply(Map.scala:93) at kafka.cluster.Partition.getOrCreateReplica(Partition.scala:83) at kafka.cluster.Partition$$anonfun$1.apply(Partition.scala:149) at kafka.cluster.Partition$$anonfun$1.apply(Partition.scala:149) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206) at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:61) at scala.collection.immutable.List.foreach(List.scala:45) at scala.collection.TraversableLike$class.map(TraversableLike.scala:206) at scala.collection.immutable.List.map(List.scala:45) at kafka.cluster.Partition.makeLeader(Partition.scala:149) at kafka.server.ReplicaManager.kafka$server$ReplicaManager$$makeLeader(ReplicaManager.scala:257) at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$3.apply(ReplicaManager.scala:221) at kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$3.apply(ReplicaManager.scala:213) at scala.collection.immutable.Map$Map1.foreach(Map.scala:105) at kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:213) at kafka.server.KafkaApis.handleLeaderAndIsrRequest(KafkaApis.scala:87) at kafka.server.KafkaApis.handle(KafkaApis.scala:70) at kafka.serv