Re: Getting NotLeaderForPartitionException for a very long time

2020-04-28 Thread M.Gopala Krishnan
Thanks Manoj for the comment.
When i say it is taking long time, it is really long, i.e more than a
day!!! thats why i am not able to understand the behavior...

Regards,
Gopal


On Tue, Apr 28, 2020 at 2:22 PM  wrote:

> Follower take some time to become the leader in case leader is down . you
> can build retry logic to around this to handle this situation .
>
> On 4/28/20, 1:08 AM, "M.Gopala Krishnan"  wrote:
>
> [External]
>
>
> Hi,
>
> I have a 3 node kafka cluster (replication-factor : 3), suddenly one
> of the
> node in the cluster was down and i started seeing the
> NotLeaderForPartitionException exception in my application logs when
> sending the message to one of the topics, however for some of the
> topics i
> am able post and consume messages.
>
> I could see this problem lasting until all the kafka servers are
> restarted,
> after the restart things are all ok.
>
> Now, my question is, why not the new leader not elected for those
> topics
> but keep throwing the same NotLeaderForPartitionException exception
> and how
> to get the new leader election happen for these topics.
>
> *Exception Trace:*
>
> 2020-04-11 22:05:21,747 ERROR [pool-15-thread-297]
> [KafkaMessageProducer:92] Message send failed:
> java.util.concurrent.ExecutionException:
> org.apache.kafka.common.errors.NotLeaderForPartitionException: This
> server
> is not the leader for that topic-partition. at
>
> org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:94)
> at
>
> org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:64)
> at
>
> org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:29)
>
> Regards,
> Gopal
>
>
> This e-mail and any files transmitted with it are for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. If you are not the intended recipient(s), please reply to the
> sender and destroy all copies of the original message. Any unauthorized
> review, use, disclosure, dissemination, forwarding, printing or copying of
> this email, and/or any action taken in reliance on the contents of this
> e-mail is strictly prohibited and may be unlawful. Where permitted by
> applicable law, this e-mail and other e-mail communications sent to and
> from Cognizant e-mail addresses may be monitored.
>


Re: Getting NotLeaderForPartitionException for a very long time

2020-04-28 Thread Manoj.Agrawal2
Follower take some time to become the leader in case leader is down . you can 
build retry logic to around this to handle this situation .

On 4/28/20, 1:08 AM, "M.Gopala Krishnan"  wrote:

[External]


Hi,

I have a 3 node kafka cluster (replication-factor : 3), suddenly one of the
node in the cluster was down and i started seeing the
NotLeaderForPartitionException exception in my application logs when
sending the message to one of the topics, however for some of the topics i
am able post and consume messages.

I could see this problem lasting until all the kafka servers are restarted,
after the restart things are all ok.

Now, my question is, why not the new leader not elected for those topics
but keep throwing the same NotLeaderForPartitionException exception and how
to get the new leader election happen for these topics.

*Exception Trace:*

2020-04-11 22:05:21,747 ERROR [pool-15-thread-297]
[KafkaMessageProducer:92] Message send failed:
java.util.concurrent.ExecutionException:
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition. at

org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:94)
at

org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:64)
at

org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:29)

Regards,
Gopal


This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information. 
If you are not the intended recipient(s), please reply to the sender and 
destroy all copies of the original message. Any unauthorized review, use, 
disclosure, dissemination, forwarding, printing or copying of this email, 
and/or any action taken in reliance on the contents of this e-mail is strictly 
prohibited and may be unlawful. Where permitted by applicable law, this e-mail 
and other e-mail communications sent to and from Cognizant e-mail addresses may 
be monitored.


Getting NotLeaderForPartitionException for a very long time

2020-04-28 Thread M.Gopala Krishnan
Hi,

I have a 3 node kafka cluster (replication-factor : 3), suddenly one of the
node in the cluster was down and i started seeing the
NotLeaderForPartitionException exception in my application logs when
sending the message to one of the topics, however for some of the topics i
am able post and consume messages.

I could see this problem lasting until all the kafka servers are restarted,
after the restart things are all ok.

Now, my question is, why not the new leader not elected for those topics
but keep throwing the same NotLeaderForPartitionException exception and how
to get the new leader election happen for these topics.

*Exception Trace:*

2020-04-11 22:05:21,747 ERROR [pool-15-thread-297]
[KafkaMessageProducer:92] Message send failed:
java.util.concurrent.ExecutionException:
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition. at
org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:94)
at
org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:64)
at
org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:29)

Regards,
Gopal


UnknownTopicOrPartitionException & NotLeaderForPartitionException upon replication of topics.

2018-05-28 Thread Srinivas, Kaushik (Nokia - IN/Bangalore)
Hi Team,



Running kafka & zookeeper in kubernetes cluster.

No of brokers : 3

No of partitions per topic : 3

creating topic with 3 partitions, and looks like all the partitions are up.

Below is the snapshot to confirm the same,

Topic:applestore  PartitionCount:3  ReplicationFactor:3   Configs:
Topic: applestore  Partition: 0Leader: 1001Replicas: 1001,1003,1002 
   Isr: 1001,1003,1002
Topic: applestore  Partition: 1Leader: 1002Replicas: 1002,1001,1003 
   Isr: 1002,1001,1003
Topic: applestore  Partition: 2Leader: 1003Replicas: 1003,1002,1001 
   Isr: 1003,1002,1001

But, we see in the brokers as soon as the topics are created below stack traces 
appears,

error 1:
[2018-05-28 08:00:31,875] ERROR [ReplicaFetcher replicaId=1001, leaderId=1003, 
fetcherId=7] Error for partition applestore-2 to broker 
1003:org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This 
server does not host this topic-partition. (kafka.server.ReplicaFetcherThread)

error 2 :
[2018-05-28 00:43:20,993] ERROR [ReplicaFetcher replicaId=1003, leaderId=1001, 
fetcherId=0] Error for partition apple-6 to broker 
1001:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server 
is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)

When we tries producing records to each specific partition, it works fine and 
also log size across the replicated brokers appears to be equal, which means 
replication is happening fine.
Attaching the two stack trace files.

Why are these stack traces appearing ? can we ignore these stack traces if its 
some spam messages ?

Best regards,
kaushik
[2018-05-28 00:39:46,195] TRACE [Broker id=1003] Cached leader info 
PartitionState(controllerEpoch=1, leader=1001, leaderEpoch=5, isr=[1001, 1002, 
1003], zkVersion=8, replicas=[1003, 1001, 1002], offlineReplicas=[]) for 
partition steve-0 in response to UpdateMetadata request sent by controller 1001 
epoch 1 with correlation id 6 (state.change.logger)
[2018-05-28 00:39:46,195] TRACE [Broker id=1003] Cached leader info 
PartitionState(controllerEpoch=1, leader=1001, leaderEpoch=3, isr=[1001, 1002, 
1003], zkVersion=6, replicas=[1001, 1002, 1003], offlineReplicas=[]) for 
partition apple-7 in response to UpdateMetadata request sent by controller 1001 
epoch 1 with correlation id 6 (state.change.logger)
[2018-05-28 00:43:20,093] TRACE [Broker id=1003] Received LeaderAndIsr request 
PartitionState(controllerEpoch=1, leader=1003, leaderEpoch=6, 
isr=1001,1002,1003, zkVersion=9, replicas=1003,1001,1002, isNew=false) 
correlation id 7 from controller 1001 epoch 1 for partition apple-0 
(state.change.logger)
[2018-05-28 00:43:20,093] TRACE [Broker id=1003] Handling LeaderAndIsr request 
correlationId 7 from controller 1001 epoch 1 starting the become-leader 
transition for partition apple-0 (state.change.logger)
[2018-05-28 00:43:20,193] INFO [ReplicaFetcherManager on broker 1003] Removed 
fetcher for partitions apple-0 (kafka.server.ReplicaFetcherManager)
[2018-05-28 00:43:20,193] INFO [Partition apple-0 broker=1003] apple-0 starts 
at Leader Epoch 6 from offset 0. Previous Leader Epoch was: 5 
(kafka.cluster.Partition)
[2018-05-28 00:43:20,193] TRACE [Broker id=1003] Stopped fetchers as part of 
become-leader request from controller 1001 epoch 1 with correlation id 7 for 
partition apple-0 (last update controller epoch 1) (state.change.logger)
[2018-05-28 00:43:20,194] TRACE [Broker id=1003] Completed LeaderAndIsr request 
correlationId 7 from controller 1001 epoch 1 for the become-leader transition 
for partition apple-0 (state.change.logger)
[2018-05-28 00:43:20,993] ERROR [ReplicaFetcher replicaId=1003, leaderId=1001, 
fetcherId=0] Error for partition apple-6 to broker 
1001:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server 
is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2018-05-28 00:43:20,993] ERROR [ReplicaFetcher replicaId=1003, leaderId=1001, 
fetcherId=0] Error for partition steve-0 to broker 
1001:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server 
is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2018-05-28 00:43:20,993] TRACE [Broker id=1003] Cached leader info 
PartitionState(controllerEpoch=1, leader=1003, leaderEpoch=6, isr=[1001, 1002, 
1003], zkVersion=9, replicas=[1003, 1001, 1002], offlineReplicas=[]) for 
partition apple-0 in response to UpdateMetadata request sent by controller 1001 
epoch 1 with correlation id 8 (state.change.logger)
[2018-05-28 00:43:20,993] ERROR [ReplicaFetcher replicaId=1003, leaderId=1001, 
fetcherId=0] Error for partition mango2-0 to broker 
1001:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server 
is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2018-05-28 00:43:21,268] ERROR [ReplicaFetcher replicaId=1003, leaderId=1001, 
fetcherId=0] Error for partition

NotLeaderForPartitionException while restaring kafka brokers

2018-04-12 Thread Mihaela Stoycheva
Hello,

I have a Kafka Streams application that is consuming from two topics and
internally aggregating, transforming and joining data. Recently I was doing
a rolling restart of the production Kafka brokers and the following errors
appeared in my Kafka Streams application:

2018-04-12 10:39:13,097 [WARN ]  (1-producer)
org.apache.kafka.clients.producer.internals.Sender- [Producer
clientId=-f063f0c2-e05f-443a-8ac2-e1298b1a274c-StreamThread-1-producer]
Got error produce response with correlation id 2892774 on
topic-partition
-KSTREAM-KEY-SELECT-14-repartition-2, retrying
(9 attempts left). Error: NOT_LEADER_FOR_PARTITION

After the 9 attempts the following error was logged:

2018-04-11 10:39:14,125 [ERROR]  (1-producer)
org.apache.kafka.streams.processor.internals.RecordCollectorImpl-
task [0_2] Error sending record (key  value  timestamp
1523522324694) to topic
-KSTREAM-KEY-SELECT-14-repartition due to {};
No more records will be sent and no more offsets will be recorded for
this task.org.apache.kafka.common.errors.NotLeaderForPartitionException:
This server is not the leader for that topic-partition.

Followed by:

2018-04-12 10:39:14,140 [ERROR]  (amThread-1)
org.apache.kafka.streams.processor.internals.ProcessorStateManager
- task [0_2] Failed to flush state store
KSTREAM-AGGREGATE-STATE-STORE-03:
org.apache.kafka.streams.errors.StreamsException: task [0_2] Abort
sending since an error caught with a previous record (key  value
 timestamp 1523522324694) to topic
-KSTREAM-KEY-SELECT-14-repartition due to
org.apache.kafka.common.errors.NotLeaderForPartitionException: This
server is not the leader for that topic-partition..
at 
org.apache.kafka.streams.processor.internals.RecordCollectorImpl$1.onCompletion(RecordCollectorImpl.java:118)
~[arb-arpa.jar:1.4.0.0]
at 
org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:204)
~[arb-arpa.jar:1.4.0.0]
at 
org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:187)
~[arb-arpa.jar:1.4.0.0]
at 
org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:627)
~[arb-arpa.jar:1.4.0.0]
at 
org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:596)
~[arb-arpa.jar:1.4.0.0]
at 
org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:557)
~[arb-arpa.jar:1.4.0.0]
at 
org.apache.kafka.clients.producer.internals.Sender.handleProduceResponse(Sender.java:481)
~[arb-arpa.jar:1.4.0.0]
at 
org.apache.kafka.clients.producer.internals.Sender.access$100(Sender.java:74)
~[arb-arpa.jar:1.4.0.0]
at 
org.apache.kafka.clients.producer.internals.Sender$1.onComplete(Sender.java:692)
~[arb-arpa.jar:1.4.0.0]
at 
org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:101)
~[arb-arpa.jar:1.4.0.0]
at 
org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:482)
~[arb-arpa.jar:1.4.0.0]
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:474)
~[arb-arpa.jar:1.4.0.0]
at 
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:239)
~[arb-arpa.jar:1.4.0.0]
at 
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163)
~[arb-arpa.jar:1.4.0.0]
at java.lang.Thread.run(Thread.java:748)
~[na:1.8.0_131]org.apache.kafka.common.errors.NotLeaderForPartitionException:
This server is not the leader for that topic-partition.

The application died and stopped processing any new messages and had to be
restarted. In production there are 5 brokers and the replication factor is
3. The version of kafka streams that I use is 1.0.1 and the kafka version
of the broker is 1.0.1. My question is if this is an expected behavior?
Also is there any way to deal with it?
Regards,
Mihaela Stoycheva


NotLeaderForPartitionException in 0.8.x

2017-04-06 Thread Moiz Raja
Hi All,

I understand that we get a NotLeaderForPartitionException when the client tries 
to publish a message to a server which is not the Leader for a given partition. 
I also understand that once leadership does change the client will learn that 
and start publishing messages to the right server. I am however seeing that 
sometimes even after leadership has changed the client continues to send 
messages to the wrong server resulting in it continuously getting the 
NotLeaderForPartitionException. 

I am not sure how to reproduce this problem because we have only seen this 
sporadically in certain production deployments. When we get into this situation 
one step which usually works is to restart zookeeper, kafka and the client. 
Obviously this is not acceptable. Is there a workaround that I could implement 
which would help to remediate this situation. For example would re-creating the 
KafkaProducer help in this scenario?

Thanks,
-Moiz

Re: Getting NotLeaderForPartitionException in kafka broker

2017-02-26 Thread 宋鑫/NemoVisky
Force on This。


宋鑫/NemoVisky


Re: NotLeaderForPartitionException while doing repartitioning

2016-12-22 Thread Dvorkin, Eugene (CORP)
I had this issue lately. 
On broker 9. Check what errors you got from a change log:
kafka-run-class kafka.tools.StateChangeLogMerger --logs 
/var/log/kafka/state-change.log --topic [__consumer_offsets

If it complains about connection, it maybe this broker does not read data from 
zookeeper
Check zookeeper for this topic – number of partition.
If it there correctly, restart the broker. 




On 12/21/16, 11:49 PM, "Stephane Maarek"  wrote:

Hi,

I’m doing a repartitioning from broker 4 5 6 to broker 7 8 9. I’m getting a
LOT of the following errors (for all topics):

[2016-12-22 04:47:21,957] ERROR [ReplicaFetcherThread-0-9], Error for
partition [__consumer_offsets,29] to broker
9:org.apache.kafka.common.errors.NotLeaderForPartitionException: This
server is not the leader for that topic-partition.
(kafka.server.ReplicaFetcherThread)


1) Is that bad?
2) why is it happening? I’m just running the normal reassign partition tool…


I’m getting this with kafka 0.10.1.0

Regards,
Stephane



--
This message and any attachments are intended only for the use of the addressee 
and may contain information that is privileged and confidential. If the reader 
of the message is not the intended recipient or an authorized representative of 
the intended recipient, you are hereby notified that any dissemination of this 
communication is strictly prohibited. If you have received this communication 
in error, notify the sender immediately by return email and delete the message 
and any attachments from your system.


NotLeaderForPartitionException while doing repartitioning

2016-12-21 Thread Stephane Maarek
Hi,

I’m doing a repartitioning from broker 4 5 6 to broker 7 8 9. I’m getting a
LOT of the following errors (for all topics):

[2016-12-22 04:47:21,957] ERROR [ReplicaFetcherThread-0-9], Error for
partition [__consumer_offsets,29] to broker
9:org.apache.kafka.common.errors.NotLeaderForPartitionException: This
server is not the leader for that topic-partition.
(kafka.server.ReplicaFetcherThread)


1) Is that bad?
2) why is it happening? I’m just running the normal reassign partition tool…


I’m getting this with kafka 0.10.1.0

Regards,
Stephane


Re: The connection between kafka and zookeeper is often closed byzookeeper, lead to NotLeaderForPartitionException: This server is not theleader for that topic-partition.

2016-12-19 Thread Guozhang Wang
Xiaoyuan,

I am not an expert in ZK so here is what I can tell:

1. "NotLeaderForPartitionException" is not usually thrown when ZK
connection timed out, it is thrown when the produce requests has arrived
the broker but the brokers think themselves as not the leader (any more)
for the requested partitions. So it usually indicates that the partition
leaders has migrated.

2. It is generally suggested to not co-located the Kafka brokers and
Zookeepers on the same machines, since both are memory and IO-consumption
heavy applications and hence could increase long GCs, which can also
causing ZK or Kafka brokers "stalls". See this for more details:

https://kafka.apache.org/documentation/#zk

2. Without the version information of the brokers and clients I cannot tell
further what could be the issue.


Guozhang


On Thu, Dec 15, 2016 at 5:17 PM, Xiaoyuan Chen <253441...@qq.com> wrote:

> Any solution?
>
>
>
>
> -- 原始邮件 --
> 发件人: "Xiaoyuan Chen"<253441...@qq.com>;
> 发送时间: 2016年12月9日(星期五) 上午10:15
> 收件人: "users";
> 主题: The connection between kafka and zookeeper is often closed
> byzookeeper, lead to NotLeaderForPartitionException: This server is not
> theleader for that topic-partition.
>
>
>
> Hi guys,
>
> Situation:
>   3 nodes, each 32G memory, CPU 24 cores, 1T hd.
>   3 brokers on 3 nodes, and 3 zookeeper on these 3 nodes too, all the
> properties are default, start the zookeeper cluster and kafka cluster.
>   Create a topic (3 replications, 6 partions), like below:
> bin/kafka-topics.sh --create --zookeeper hw168:2181
> --replication-factor 3 --partitions 6 --topic test
>   And run the ProducerPerformance given by kafka on the two nodes at the
> same time, it means we have two producers, command like below:
> bin/kafka-run-class.sh org.apache.kafka.tools.ProducerPerformance
> --topic test --num-records 1 --record-size 100 --throughput -1
> --producer-props bootstrap.servers=hw168:9092 buffer.memory=67108864
> batch.size=65536 acks=1
>
> Problem:
>   We can see from the producer, a lot of  NotLeaderForPartitionException:
> org.apache.kafka.common.errors.NotLeaderForPartitionException: This
> server is not the leader for that topic-partition.
> org.apache.kafka.common.errors.NotLeaderForPartitionException: This
> server is not the leader for that topic-partition.
> org.apache.kafka.common.errors.NotLeaderForPartitionException: This
> server is not the leader for that topic-partition.
> …
>
> Track the process (by using DEBUG):
>   There is a INFO:
> INFO Client session timed out, have not heard from server in 11647ms
> for sessionid 0x258de4a26a4, closing socket connection and attempting
> reconnect (org.apache.zookeeper.ClientCnxn)
>
>   And We found that the connection between zkClient (kafka holds) and
> zookeeper server is closed by zookeeper server, the reason is that time is
> out, for details:
> [2016-12-08 20:24:00,547] DEBUG Partition [test,5] on broker 1:
> Skipping update high watermark since Old hw 15986847 [8012779 : 1068525112]
> is larger than new hw 15986847 [8012779 : 1068525112] for partition
> [test,5]. All leo's are 16566175 [16025299 : 72477384],15986847 [8012779 :
> 1068525112],16103549 [16025299 : 10485500] (kafka.cluster.Partition)
> [2016-12-08 20:24:00,547] DEBUG Adding index entry 16566161 =>
> 72475508 to 16025299.index. (kafka.log.OffsetIndex)
> [2016-12-08 20:24:11,368] DEBUG [Replica Manager on Broker 1]: Request
> key test-2 unblocked 0 fetch requests. (kafka.server.ReplicaManager)
> [2016-12-08 20:24:11,368] DEBUG Partition [test,2] on broker 1:
> Skipping update high watermark since Old hw 16064424 [16025299 : 5242750]
> is larger than new hw 16064424 [16025299 : 5242750] for partition [test,2].
> All leo's are 16566175 [16025299 : 72477384],16205274 [16025299 :
> 24116650],16064424 [16025299 : 5242750] (kafka.cluster.Partition)
> [2016-12-08 20:24:11,369] DEBUG [Replica Manager on Broker 1]: Produce
> to local log in 10821 ms (kafka.server.ReplicaManager)
> [2016-12-08 20:24:11,369] INFO Client session timed out, have not
> heard from server in 11647ms for sessionid 0x258de4a26a4, closing
> socket connection and attempting reconnect (org.apache.zookeeper.
> ClientCnxn)
>
>   Please watch the time, the there is no DEBUG between 20:24:00,547 and
> 20:24:11,368, it already exceeded the time for session timeout (6000ms), so
> it causes this disconnection.  We keep digging:
>   We found that it got stuck in the function:
>   selector.select(waitTimeOut); — in the method doTransport(…) in
> class org.apache.zookeeper.ClientCnxnSocketNIO
>   and that is the time ww got no

Aw: Re: NotLeaderForPartitionException

2016-12-16 Thread Sven Ludwig
Hi,

thank you for the nice clarification.

This is also described indirectly here, which I had not found before: 
https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-Partitioningandbootstrapping

The Log-Level of this could perhaps be considered to be reduced to WARN, if it 
is quite common and usually recovered.

Kind Regards
Sven


Gesendet: Dienstag, 06. Dezember 2016 um 21:47 Uhr
Von: "Apurva Mehta" 
An: users@kafka.apache.org
Betreff: Re: NotLeaderForPartitionException
Hi Sven,

You will see this exception during leader election. When the leader for a
partition moves to another broker, there is a period during which the
replicas would still connect to the original leader, at which point they
will raise this exception. This should be a very short period, after which
they will connect to and replicate from the new leader correctly.

This is not a fatal error, and you will see it if you are bouncing brokers
(since all the leaders on that broker will have to move after the bounce).
You may also see it if some brokers have connectivity issues: they may be
considered dead, and their partitions would be moved elsewhere.

Hope this helps,
Apurva

On Tue, Dec 6, 2016 at 10:06 AM, Sven Ludwig  wrote:

> Hello,
>
> in our Kafka clusters we sometimes observe a specific ERROR log-statement,
> and therefore we have doubts whether it is already running sable in our
> configuration. This occurs every now and then, like two or three times in a
> day. It is actually the predominant ERROR log-statement in our cluster.
> Example:
>
> [2016-12-06 17:14:50,909] ERROR [ReplicaFetcherThread-0-3], Error for
> partition [,] to broker
> 3:org.apache.kafka.common.errors.NotLeaderForPartitionException: This
> server is
> not the leader for that topic-partition. (kafka.server.
> ReplicaFetcherThread)
>
> We already asked Google, but we did not find sufficient answers to our
> questions, therefore I am asking on the mailing list:
>
> 1. What are the possible reasons for this particular error?
>
> 2. What are the implications of it?
>
> 3. What can be done to prevent it?
>
> Best Regards,
> Sven
>


??????The connection between kafka and zookeeper is often closed byzookeeper, lead to NotLeaderForPartitionException: This server is not theleader for that topic-partition.

2016-12-15 Thread Xiaoyuan Chen
Any solution?




--  --
??: "Xiaoyuan Chen"<253441...@qq.com>; 
: 2016??12??9??(??) 10:15
??: "users"; 
: The connection between kafka and zookeeper is often closed byzookeeper, 
lead to NotLeaderForPartitionException: This server is not theleader for that 
topic-partition.



Hi guys,

Situation:
  3 nodes, each 32G memory, CPU 24 cores, 1T hd.
  3 brokers on 3 nodes, and 3 zookeeper on these 3 nodes too, all the 
properties are default, start the zookeeper cluster and kafka cluster.
  Create a topic (3 replications, 6 partions), like below:
bin/kafka-topics.sh --create --zookeeper hw168:2181 --replication-factor 3 
--partitions 6 --topic test
  And run the ProducerPerformance given by kafka on the two nodes at the same 
time, it means we have two producers, command like below:
bin/kafka-run-class.sh org.apache.kafka.tools.ProducerPerformance --topic 
test --num-records 1 --record-size 100 --throughput -1 --producer-props 
bootstrap.servers=hw168:9092 buffer.memory=67108864 batch.size=65536 acks=1

Problem:
  We can see from the producer, a lot of  NotLeaderForPartitionException:
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server 
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server 
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server 
is not the leader for that topic-partition.
??

Track the process (by using DEBUG):
  There is a INFO: 
INFO Client session timed out, have not heard from server in 11647ms for 
sessionid 0x258de4a26a4, closing socket connection and attempting reconnect 
(org.apache.zookeeper.ClientCnxn)

  And We found that the connection between zkClient (kafka holds) and zookeeper 
server is closed by zookeeper server, the reason is that time is out, for 
details:
[2016-12-08 20:24:00,547] DEBUG Partition [test,5] on broker 1: Skipping 
update high watermark since Old hw 15986847 [8012779 : 1068525112] is larger 
than new hw 15986847 [8012779 : 1068525112] for partition [test,5]. All leo's 
are 16566175 [16025299 : 72477384],15986847 [8012779 : 1068525112],16103549 
[16025299 : 10485500] (kafka.cluster.Partition)
[2016-12-08 20:24:00,547] DEBUG Adding index entry 16566161 => 72475508 to 
16025299.index. (kafka.log.OffsetIndex)
[2016-12-08 20:24:11,368] DEBUG [Replica Manager on Broker 1]: Request key 
test-2 unblocked 0 fetch requests. (kafka.server.ReplicaManager)
[2016-12-08 20:24:11,368] DEBUG Partition [test,2] on broker 1: Skipping 
update high watermark since Old hw 16064424 [16025299 : 5242750] is larger than 
new hw 16064424 [16025299 : 5242750] for partition [test,2]. All leo's are 
16566175 [16025299 : 72477384],16205274 [16025299 : 24116650],16064424 
[16025299 : 5242750] (kafka.cluster.Partition)
[2016-12-08 20:24:11,369] DEBUG [Replica Manager on Broker 1]: Produce to 
local log in 10821 ms (kafka.server.ReplicaManager)
[2016-12-08 20:24:11,369] INFO Client session timed out, have not heard 
from server in 11647ms for sessionid 0x258de4a26a4, closing socket 
connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)

  Please watch the time, the there is no DEBUG between 20:24:00,547 and 
20:24:11,368, it already exceeded the time for session timeout (6000ms), so it 
causes this disconnection.  We keep digging:
  We found that it got stuck in the function:
  selector.select(waitTimeOut); ?? in the method doTransport(??) in 
class org.apache.zookeeper.ClientCnxnSocketNIO
  and that is the time ww got no DEBUG.
  For more details, Call procedure (zookeeper client):
  org.apache.zookeeper.ClientCnxn -> run() -> doTransport(..)
  In the function run(), every time it will check whether there is a timeout, 
if not, it will run doTransport, but the doTransport costs about 10s, so next 
loop, it will find the timeout.

  Keep going, I thought there could be a deadlock at that time, so I keep 
printing the jstack of the kafka and zookeeper. Using the shell like below:
  while true; do echo -e "\n\n"`date "+%Y-%m-%d %H:%M:%S,%N"`"\n"`jstack 
12165` >> zkjstack; echo -e "\n\n"`date "+%Y-%m-%d %H:%M:%S,%N"`"\n"`jstack 
12425` >> kafkajstack; done
  And I check the period we got NO DEBUG in the stack logs, surprise, there 
also NO LOG at that time!!! Why?

So I??m confused that why it got stuck in that function? Why there is no DEBUG 
or LOG in that weird period? Please help me. Thank you all.
  

Xiaoyuan

The connection between kafka and zookeeper is often closed by zookeeper, lead to NotLeaderForPartitionException: This server is not the leader for that topic-partition.

2016-12-08 Thread Jiecxy
Hi guys,

Situation:
  3 nodes, each 32G memory, CPU 24 cores, 1T hd.
  3 brokers on 3 nodes, and 3 zookeeper on these 3 nodes too, all the 
properties are default, start the zookeeper cluster and kafka cluster.
  Create a topic (3 replications, 6 partions), like below:
bin/kafka-topics.sh --create --zookeeper hw168:2181 --replication-factor 3 
--partitions 6 --topic test
  And run the ProducerPerformance given by kafka on the two nodes at the same 
time, it means we have two producers, command like below:
bin/kafka-run-class.sh org.apache.kafka.tools.ProducerPerformance --topic 
test --num-records 1 --record-size 100 --throughput -1 --producer-props 
bootstrap.servers=hw168:9092 buffer.memory=67108864 batch.size=65536 acks=1

Problem:
  We can see from the producer, a lot of  NotLeaderForPartitionException:
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server 
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server 
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server 
is not the leader for that topic-partition.
…

Track the process (by using DEBUG):
  There is a INFO: 
INFO Client session timed out, have not heard from server in 11647ms for 
sessionid 0x258de4a26a4, closing socket connection and attempting reconnect 
(org.apache.zookeeper.ClientCnxn)

  And We found that the connection between zkClient (kafka holds) and zookeeper 
server is closed by zookeeper server, the reason is that time is out, for 
details:
[2016-12-08 20:24:00,547] DEBUG Partition [test,5] on broker 1: Skipping 
update high watermark since Old hw 15986847 [8012779 : 1068525112] is larger 
than new hw 15986847 [8012779 : 1068525112] for partition [test,5]. All leo's 
are 16566175 [16025299 : 72477384],15986847 [8012779 : 1068525112],16103549 
[16025299 : 10485500] (kafka.cluster.Partition)
[2016-12-08 20:24:00,547] DEBUG Adding index entry 16566161 => 72475508 to 
16025299.index. (kafka.log.OffsetIndex)
[2016-12-08 20:24:11,368] DEBUG [Replica Manager on Broker 1]: Request key 
test-2 unblocked 0 fetch requests. (kafka.server.ReplicaManager)
[2016-12-08 20:24:11,368] DEBUG Partition [test,2] on broker 1: Skipping 
update high watermark since Old hw 16064424 [16025299 : 5242750] is larger than 
new hw 16064424 [16025299 : 5242750] for partition [test,2]. All leo's are 
16566175 [16025299 : 72477384],16205274 [16025299 : 24116650],16064424 
[16025299 : 5242750] (kafka.cluster.Partition)
[2016-12-08 20:24:11,369] DEBUG [Replica Manager on Broker 1]: Produce to 
local log in 10821 ms (kafka.server.ReplicaManager)
[2016-12-08 20:24:11,369] INFO Client session timed out, have not heard 
from server in 11647ms for sessionid 0x258de4a26a4, closing socket 
connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)

  Please watch the time, the there is no DEBUG between 20:24:00,547 and 
20:24:11,368, it already exceeded the time for session timeout (6000ms), so it 
causes this disconnection.  We keep digging:
  We found that it got stuck in the function:
  selector.select(waitTimeOut); — in the method doTransport(…) in class 
org.apache.zookeeper.ClientCnxnSocketNIO
  and that is the time ww got no DEBUG.
  For more details, Call procedure (zookeeper client):
  org.apache.zookeeper.ClientCnxn -> run() -> doTransport(..)
  In the function run(), every time it will check whether there is a timeout, 
if not, it will run doTransport, but the doTransport costs about 10s, so next 
loop, it will find the timeout.

  Keep going, I thought there could be a deadlock at that time, so I keep 
printing the jstack of the kafka and zookeeper. Using the shell like below:
  while true; do echo -e "\n\n"`date "+%Y-%m-%d %H:%M:%S,%N"`"\n"`jstack 
12165` >> zkjstack; echo -e "\n\n"`date "+%Y-%m-%d %H:%M:%S,%N"`"\n"`jstack 
12425` >> kafkajstack; done
  And I check the period we got NO DEBUG in the stack logs, surprise, there 
also NO LOG at that time!!! Why?

So I’m confused that why it got stuck in that function? Why there is no DEBUG 
or LOG in that weird period? Please help me. Thank you all.
  

Xiaoyuan



Re: NotLeaderForPartitionException

2016-12-06 Thread Apurva Mehta
Hi Sven,

You will see this exception during leader election. When the leader for a
partition moves to another broker, there is a period during which the
replicas would still connect to the original leader, at which point they
will raise this exception. This should be a very short period, after which
they will connect to and replicate from the new leader correctly.

This is not a fatal error, and you will see it if you are bouncing brokers
(since all the leaders on that broker will have to move after the bounce).
You may also see it if some brokers have connectivity issues: they may be
considered dead, and their partitions would be moved elsewhere.

Hope this helps,
Apurva

On Tue, Dec 6, 2016 at 10:06 AM, Sven Ludwig  wrote:

> Hello,
>
> in our Kafka clusters we sometimes observe a specific ERROR log-statement,
> and therefore we have doubts whether it is already running sable in our
> configuration. This occurs every now and then, like two or three times in a
> day. It is actually the predominant ERROR log-statement in our cluster.
> Example:
>
> [2016-12-06 17:14:50,909] ERROR [ReplicaFetcherThread-0-3], Error for
> partition [,] to broker
> 3:org.apache.kafka.common.errors.NotLeaderForPartitionException: This
> server is
> not the leader for that topic-partition. (kafka.server.
> ReplicaFetcherThread)
>
> We already asked Google, but we did not find sufficient answers to our
> questions, therefore I am asking on the mailing list:
>
> 1. What are the possible reasons for this particular error?
>
> 2. What are the implications of it?
>
> 3. What can be done to prevent it?
>
> Best Regards,
> Sven
>


NotLeaderForPartitionException

2016-12-06 Thread Sven Ludwig
Hello,
 
in our Kafka clusters we sometimes observe a specific ERROR log-statement, and 
therefore we have doubts whether it is already running sable in our 
configuration. This occurs every now and then, like two or three times in a 
day. It is actually the predominant ERROR log-statement in our cluster. Example:
 
[2016-12-06 17:14:50,909] ERROR [ReplicaFetcherThread-0-3], Error for partition 
[,] to broker 
3:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
 
We already asked Google, but we did not find sufficient answers to our 
questions, therefore I am asking on the mailing list:
 
1. What are the possible reasons for this particular error?
 
2. What are the implications of it?
 
3. What can be done to prevent it?
 
Best Regards,
Sven


I am getting NotLeaderForPartitionException exception on internal changelog table topcis

2016-12-05 Thread Sachin Mittal
Hi,
In my application I have replicated internal changelog topics.

>From time to time I get this exception and I am not able to figure out why.


[2016-12-05 11:05:10,635] ERROR Error sending record to topic
test-stream-key-table-changelog
(org.apache.kafka.streams.processor.internals.RecordCollector)
org.apache.kafka.common.errors.NotLeaderForPartitionException: This
server is not the leader for that topic-partition.
[2016-12-05 11:05:10,755] ERROR Error sending record to topic
test-stream-key-table-changelog
(org.apache.kafka.streams.processor.internals.RecordCollector)


Please let me know what could be causing this.

Thanks

Sachin


Re: NotLeaderForPartitionException causing event loss

2016-08-12 Thread Zakee
Increasing number of retries and/or retry.backoff.ms will help reduce the data 
loss. Figure out how long NLFPE occurs (this happens only as long as metadata 
is obsolete), and configure below props accordingly.

message.send.max.retries=3 (default)
retry.backoff.ms=100 (default)


> On Aug 12, 2016, at 1:59 AM, sunil kalva  wrote:
> 
> We are seeing data loss, whenever we see "NotLeaderForPartitionException"
> exception,
> We are using 0.8.2 java client publisher API with callback, when i get
> callback with the error i am logging them in a file and retrying them
> later.
> So number of errors = number of logged events are matching but overal,l few
> events are missing.
> Observations
> Timestamp of missing events looks like just before the timestamp of the
> error occurrence, meaning every time when we get this error we are loosing
> couple events which are sent just before the error time.
> 
> Is there any way to handle this, should it be handle at client side or some
> tuning required at kafka side.
> 
> t
> SunilKalva


Fit Mom Daily
Australians Shocked: Don't Use Botox, Do This Instead
http://thirdpartyoffers.netzero.net/TGL3231/57ae0e55f24a2e555d79st01vuc


NotLeaderForPartitionException causing event loss

2016-08-12 Thread sunil kalva
We are seeing data loss, whenever we see "NotLeaderForPartitionException"
exception,
We are using 0.8.2 java client publisher API with callback, when i get
callback with the error i am logging them in a file and retrying them
later.
So number of errors = number of logged events are matching but overal,l few
events are missing.
Observations
Timestamp of missing events looks like just before the timestamp of the
error occurrence, meaning every time when we get this error we are loosing
couple events which are sent just before the error time.

Is there any way to handle this, should it be handle at client side or some
tuning required at kafka side.

t
SunilKalva


NotLeaderForPartitionException -- how to recover?

2016-07-25 Thread Samuel Taylor
Hi everyone,

Has anyone seen this error and/or is there a good way to recover from it?

I have two brokers running, and after attempting to produce to a topic
(using the library kafka-python), and they are both spitting out lots of
messages that look like this:

ERROR [ReplicaFetcherThread-0-1], Error for partition
[__consumer_offsets,2] to broker 1:org.apache.kafka.common.errors.
NotLeaderForPartitionException: This server is not the leader for that
topic-partition. (kafka.server.ReplicaFetcherThread)
(The partition number varies. And on broker 1, the message says "to broker
2" instead of "to broker 1")

Listing the topics reveals that broker 1 is the leader for all 50
partitions of __consumer_offsets. Both brokers are complaining about the
same partitions, which is confusing to me because it looks like broker 1
should be the leader.

I have tried:
- restarting Kafka on each broker
- deleting the kafka-logs directory on each broker (and restarting)
- deleting the zookeeper directory on each server (and restarting)

Very grateful for any help!

Thanks,
Samuel


Re: NotLeaderForPartitionException: This server is not the leader for that topic-partition.

2016-02-08 Thread Robert Metzger
Sorry for reviving this old mailing list discussion.

I'm facing a similar issue while running a load test with many small topics
(100 topics) with 4 partitions each.
There is also a Flink user who's facing the issue:
https://issues.apache.org/jira/browse/FLINK-3066

Are you also writing into many topics at the same time?


On Sat, Nov 14, 2015 at 4:07 PM, Sean Morris (semorris) 
wrote:

> I have suddenly starting having issues where when I am producing data I
> occasionally get "NotLeaderForPartitionException: This server is not the
> leader for that topic-partition". I am using Kafka 2.10-0.8.2.1  with the
> new Producer class and no retries. If I add retries to the Producer
> properties I am ending up with duplicates on the consumer side.
>
> Any idea what can be the cause of the NotLeaderForPartitionException?
>
> Thanks.
>
>


NotLeaderForPartitionException: This server is not the leader for that topic-partition.

2015-11-14 Thread Sean Morris (semorris)
I have suddenly starting having issues where when I am producing data I 
occasionally get "NotLeaderForPartitionException: This server is not the leader 
for that topic-partition". I am using Kafka 2.10-0.8.2.1  with the new Producer 
class and no retries. If I add retries to the Producer properties I am ending 
up with duplicates on the consumer side.

Any idea what can be the cause of the NotLeaderForPartitionException?

Thanks.



Produce request failed due to NotLeaderForPartitionException (leader epoch is old)

2015-10-27 Thread Chaitanya GSK
Hi Ivan,

I face the same issue you had. Were you able to figure out why it’s happening? 
If so, could you please share it.

Thanks,
Chaitanya GSK

Produce request failed due to NotLeaderForPartitionException (leader epoch is old)

2015-06-16 Thread Ivan Balashov
Hi,

During a round of kafka data discrepancy investigation I came across a
bunch of recurring errors below:

producer.log
>

2015-06-14 13:06:25,591 WARN  [task-thread-9]
> (k.p.a.DefaultEventHandler:83) - Produce request with correlation id 624
> failed due to [mytopic,21]: kafka.common.NotLeaderForPartitionException


kafka.log

[2015-06-14 13:05:13,025] 418953499 [request-expiration-task] WARN
> kafka.server.ReplicaManager
> - [Replica Manager on Broker 61]: Fetch request with correlation id 1 from 
> client fetchReq on partition [mytopic,21] failed due to Leader not local for 
> partition [mytopic,21] on broker 61
>



> state-change.log
> [2015-06-14 13:05:11,495] WARN Broker 29 ignoring LeaderAndIsr request
> from controller 45 with correlation id 41799 epoch 27 for partition
> [mytopic,21] since its associated leader epoch 191 is old. Current leader
> epoch is 191 (state.change.logger)


The warnings keep repeating several times during a day, and sometimes they
coincide with timestamps of presumably missing records.

As far as I understand occasional NotLeaderForPartitionException are fine,
but does the same apply to "old leader epoch" warning?

Could it be caused by any zk issue? However, I don't seem to find anything
particularly interesting in zk logs, except "likely client has closed
socket" or "Unexpected Exception:
java.nio.channels.CancelledKeyException". The former is all over the log
(must be the client issue) and the latter are rare and not correlated with
the original warnings.

Thanks,

Here are some bits of configuration:
Kafka 0.8.1.2, 3 brokers + 3 zk, 2x replication
request.required.acks=1
retry.backoff.ms=1000


Re: Getting NotLeaderForPartitionException in kafka broker

2015-05-14 Thread tao xiao
i will try to reproduce this problem later this week.

Bouncing the broker fixed the issue but the issue surfaced again after a
period of time. A little more context about this is that the cluster was
deployed to VMs and I discovered that the issue appeared whenever CPU wait
time was extremely high like 90% CPU time spent on I/O wait. I am more
interesting in understanding under what circumstance this issue would
happen so that I can take appropriate actions

On Fri, May 15, 2015 at 8:04 AM, Jiangjie Qin 
wrote:

>
> If you can reproduce this problem steadily, once you see this issue, can
> you grep the controller log for topic partition in question and see if
> there is anything interesting?
>
> Thanks.
>
> Jiangjie (Becket) Qin
>
> On 5/14/15, 3:43 AM, "tao xiao"  wrote:
>
> >Yes, it does exist in ZK and the node that had the
> >NotLeaderForPartitionException
> >is the leader of the topic
> >
> >On Thu, May 14, 2015 at 6:12 AM, Jiangjie Qin 
> >wrote:
> >
> >> Does this topic exist in Zookeeper?
> >>
> >> On 5/12/15, 11:35 PM, "tao xiao"  wrote:
> >>
> >> >Hi,
> >> >
> >> >Any updates on this issue? I keep seeing this issue happening over and
> >> >over
> >> >again
> >> >
> >> >On Thu, May 7, 2015 at 7:28 PM, tao xiao  wrote:
> >> >
> >> >> Hi team,
> >> >>
> >> >> I have a 12 nodes cluster that has 800 topics and each of which has
> >> >>only 1
> >> >> partition. I observed that one of the node keeps generating
> >> >> NotLeaderForPartitionException that causes the node to be
> >>unresponsive
> >> >>to
> >> >> all requests. Below is the exception
> >> >>
> >> >> [2015-05-07 04:16:01,014] ERROR [ReplicaFetcherThread-1-12], Error
> >>for
> >> >> partition [topic1,0] to broker 12:class
> >> >> kafka.common.NotLeaderForPartitionException
> >> >> (kafka.server.ReplicaFetcherThread)
> >> >>
> >> >> All other nodes in the cluster generate lots of replication error
> >>too as
> >> >> shown below due to unresponsiveness of above node.
> >> >>
> >> >> [2015-05-07 04:17:34,917] WARN [Replica Manager on Broker 1]: Fetch
> >> >> request with correlation id 3630911 from client
> >> >>ReplicaFetcherThread-0-1 on
> >> >> partition [topic1,0] failed due to Leader not local for partition
> >> >> [cg22_user.item_attr_info.lcr,0] on broker 1
> >> >>(kafka.server.ReplicaManager)
> >> >>
> >> >> Any suggestion why the node runs into the unstable stage and any
> >> >> configuration I can set to prevent this?
> >> >>
> >> >> I use kafka 0.8.2.1
> >> >>
> >> >> And here is the server.properties
> >> >>
> >> >>
> >> >> broker.id=5
> >> >> port=9092
> >> >> num.network.threads=3
> >> >> num.io.threads=8
> >> >> socket.send.buffer.bytes=1048576
> >> >> socket.receive.buffer.bytes=1048576
> >> >> socket.request.max.bytes=104857600
> >> >> log.dirs=/mnt/kafka
> >> >> num.partitions=1
> >> >> num.recovery.threads.per.data.dir=1
> >> >> log.retention.hours=1
> >> >> log.segment.bytes=1073741824
> >> >> log.retention.check.interval.ms=30
> >> >> log.cleaner.enable=false
> >> >> zookeeper.connect=ip:2181
> >> >> zookeeper.connection.timeout.ms=6000
> >> >> unclean.leader.election.enable=false
> >> >> delete.topic.enable=true
> >> >> default.replication.factor=3
> >> >> num.replica.fetchers=3
> >> >> delete.topic.enable=true
> >> >> kafka.metrics.reporters=report.KafkaMetricsCollector
> >> >> straas.hubble.conf.file=/etc/kafka/report.conf
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Regards,
> >> >> Tao
> >> >>
> >> >
> >> >
> >> >
> >> >--
> >> >Regards,
> >> >Tao
> >>
> >>
> >
> >
> >--
> >Regards,
> >Tao
>
>


-- 
Regards,
Tao


Re: Getting NotLeaderForPartitionException in kafka broker

2015-05-14 Thread Jiangjie Qin

If you can reproduce this problem steadily, once you see this issue, can
you grep the controller log for topic partition in question and see if
there is anything interesting?

Thanks.

Jiangjie (Becket) Qin

On 5/14/15, 3:43 AM, "tao xiao"  wrote:

>Yes, it does exist in ZK and the node that had the
>NotLeaderForPartitionException
>is the leader of the topic
>
>On Thu, May 14, 2015 at 6:12 AM, Jiangjie Qin 
>wrote:
>
>> Does this topic exist in Zookeeper?
>>
>> On 5/12/15, 11:35 PM, "tao xiao"  wrote:
>>
>> >Hi,
>> >
>> >Any updates on this issue? I keep seeing this issue happening over and
>> >over
>> >again
>> >
>> >On Thu, May 7, 2015 at 7:28 PM, tao xiao  wrote:
>> >
>> >> Hi team,
>> >>
>> >> I have a 12 nodes cluster that has 800 topics and each of which has
>> >>only 1
>> >> partition. I observed that one of the node keeps generating
>> >> NotLeaderForPartitionException that causes the node to be
>>unresponsive
>> >>to
>> >> all requests. Below is the exception
>> >>
>> >> [2015-05-07 04:16:01,014] ERROR [ReplicaFetcherThread-1-12], Error
>>for
>> >> partition [topic1,0] to broker 12:class
>> >> kafka.common.NotLeaderForPartitionException
>> >> (kafka.server.ReplicaFetcherThread)
>> >>
>> >> All other nodes in the cluster generate lots of replication error
>>too as
>> >> shown below due to unresponsiveness of above node.
>> >>
>> >> [2015-05-07 04:17:34,917] WARN [Replica Manager on Broker 1]: Fetch
>> >> request with correlation id 3630911 from client
>> >>ReplicaFetcherThread-0-1 on
>> >> partition [topic1,0] failed due to Leader not local for partition
>> >> [cg22_user.item_attr_info.lcr,0] on broker 1
>> >>(kafka.server.ReplicaManager)
>> >>
>> >> Any suggestion why the node runs into the unstable stage and any
>> >> configuration I can set to prevent this?
>> >>
>> >> I use kafka 0.8.2.1
>> >>
>> >> And here is the server.properties
>> >>
>> >>
>> >> broker.id=5
>> >> port=9092
>> >> num.network.threads=3
>> >> num.io.threads=8
>> >> socket.send.buffer.bytes=1048576
>> >> socket.receive.buffer.bytes=1048576
>> >> socket.request.max.bytes=104857600
>> >> log.dirs=/mnt/kafka
>> >> num.partitions=1
>> >> num.recovery.threads.per.data.dir=1
>> >> log.retention.hours=1
>> >> log.segment.bytes=1073741824
>> >> log.retention.check.interval.ms=30
>> >> log.cleaner.enable=false
>> >> zookeeper.connect=ip:2181
>> >> zookeeper.connection.timeout.ms=6000
>> >> unclean.leader.election.enable=false
>> >> delete.topic.enable=true
>> >> default.replication.factor=3
>> >> num.replica.fetchers=3
>> >> delete.topic.enable=true
>> >> kafka.metrics.reporters=report.KafkaMetricsCollector
>> >> straas.hubble.conf.file=/etc/kafka/report.conf
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Regards,
>> >> Tao
>> >>
>> >
>> >
>> >
>> >--
>> >Regards,
>> >Tao
>>
>>
>
>
>-- 
>Regards,
>Tao



Re: Getting NotLeaderForPartitionException in kafka broker

2015-05-14 Thread Mayuresh Gharat
Can you try bouncing that broker?

Thanks,

Mayuresh

On Thu, May 14, 2015 at 3:43 AM, tao xiao  wrote:

> Yes, it does exist in ZK and the node that had the
> NotLeaderForPartitionException
> is the leader of the topic
>
> On Thu, May 14, 2015 at 6:12 AM, Jiangjie Qin 
> wrote:
>
> > Does this topic exist in Zookeeper?
> >
> > On 5/12/15, 11:35 PM, "tao xiao"  wrote:
> >
> > >Hi,
> > >
> > >Any updates on this issue? I keep seeing this issue happening over and
> > >over
> > >again
> > >
> > >On Thu, May 7, 2015 at 7:28 PM, tao xiao  wrote:
> > >
> > >> Hi team,
> > >>
> > >> I have a 12 nodes cluster that has 800 topics and each of which has
> > >>only 1
> > >> partition. I observed that one of the node keeps generating
> > >> NotLeaderForPartitionException that causes the node to be unresponsive
> > >>to
> > >> all requests. Below is the exception
> > >>
> > >> [2015-05-07 04:16:01,014] ERROR [ReplicaFetcherThread-1-12], Error for
> > >> partition [topic1,0] to broker 12:class
> > >> kafka.common.NotLeaderForPartitionException
> > >> (kafka.server.ReplicaFetcherThread)
> > >>
> > >> All other nodes in the cluster generate lots of replication error too
> as
> > >> shown below due to unresponsiveness of above node.
> > >>
> > >> [2015-05-07 04:17:34,917] WARN [Replica Manager on Broker 1]: Fetch
> > >> request with correlation id 3630911 from client
> > >>ReplicaFetcherThread-0-1 on
> > >> partition [topic1,0] failed due to Leader not local for partition
> > >> [cg22_user.item_attr_info.lcr,0] on broker 1
> > >>(kafka.server.ReplicaManager)
> > >>
> > >> Any suggestion why the node runs into the unstable stage and any
> > >> configuration I can set to prevent this?
> > >>
> > >> I use kafka 0.8.2.1
> > >>
> > >> And here is the server.properties
> > >>
> > >>
> > >> broker.id=5
> > >> port=9092
> > >> num.network.threads=3
> > >> num.io.threads=8
> > >> socket.send.buffer.bytes=1048576
> > >> socket.receive.buffer.bytes=1048576
> > >> socket.request.max.bytes=104857600
> > >> log.dirs=/mnt/kafka
> > >> num.partitions=1
> > >> num.recovery.threads.per.data.dir=1
> > >> log.retention.hours=1
> > >> log.segment.bytes=1073741824
> > >> log.retention.check.interval.ms=30
> > >> log.cleaner.enable=false
> > >> zookeeper.connect=ip:2181
> > >> zookeeper.connection.timeout.ms=6000
> > >> unclean.leader.election.enable=false
> > >> delete.topic.enable=true
> > >> default.replication.factor=3
> > >> num.replica.fetchers=3
> > >> delete.topic.enable=true
> > >> kafka.metrics.reporters=report.KafkaMetricsCollector
> > >> straas.hubble.conf.file=/etc/kafka/report.conf
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Regards,
> > >> Tao
> > >>
> > >
> > >
> > >
> > >--
> > >Regards,
> > >Tao
> >
> >
>
>
> --
> Regards,
> Tao
>



-- 
-Regards,
Mayuresh R. Gharat
(862) 250-7125


Re: Getting NotLeaderForPartitionException in kafka broker

2015-05-14 Thread tao xiao
Yes, it does exist in ZK and the node that had the
NotLeaderForPartitionException
is the leader of the topic

On Thu, May 14, 2015 at 6:12 AM, Jiangjie Qin 
wrote:

> Does this topic exist in Zookeeper?
>
> On 5/12/15, 11:35 PM, "tao xiao"  wrote:
>
> >Hi,
> >
> >Any updates on this issue? I keep seeing this issue happening over and
> >over
> >again
> >
> >On Thu, May 7, 2015 at 7:28 PM, tao xiao  wrote:
> >
> >> Hi team,
> >>
> >> I have a 12 nodes cluster that has 800 topics and each of which has
> >>only 1
> >> partition. I observed that one of the node keeps generating
> >> NotLeaderForPartitionException that causes the node to be unresponsive
> >>to
> >> all requests. Below is the exception
> >>
> >> [2015-05-07 04:16:01,014] ERROR [ReplicaFetcherThread-1-12], Error for
> >> partition [topic1,0] to broker 12:class
> >> kafka.common.NotLeaderForPartitionException
> >> (kafka.server.ReplicaFetcherThread)
> >>
> >> All other nodes in the cluster generate lots of replication error too as
> >> shown below due to unresponsiveness of above node.
> >>
> >> [2015-05-07 04:17:34,917] WARN [Replica Manager on Broker 1]: Fetch
> >> request with correlation id 3630911 from client
> >>ReplicaFetcherThread-0-1 on
> >> partition [topic1,0] failed due to Leader not local for partition
> >> [cg22_user.item_attr_info.lcr,0] on broker 1
> >>(kafka.server.ReplicaManager)
> >>
> >> Any suggestion why the node runs into the unstable stage and any
> >> configuration I can set to prevent this?
> >>
> >> I use kafka 0.8.2.1
> >>
> >> And here is the server.properties
> >>
> >>
> >> broker.id=5
> >> port=9092
> >> num.network.threads=3
> >> num.io.threads=8
> >> socket.send.buffer.bytes=1048576
> >> socket.receive.buffer.bytes=1048576
> >> socket.request.max.bytes=104857600
> >> log.dirs=/mnt/kafka
> >> num.partitions=1
> >> num.recovery.threads.per.data.dir=1
> >> log.retention.hours=1
> >> log.segment.bytes=1073741824
> >> log.retention.check.interval.ms=30
> >> log.cleaner.enable=false
> >> zookeeper.connect=ip:2181
> >> zookeeper.connection.timeout.ms=6000
> >> unclean.leader.election.enable=false
> >> delete.topic.enable=true
> >> default.replication.factor=3
> >> num.replica.fetchers=3
> >> delete.topic.enable=true
> >> kafka.metrics.reporters=report.KafkaMetricsCollector
> >> straas.hubble.conf.file=/etc/kafka/report.conf
> >>
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Tao
> >>
> >
> >
> >
> >--
> >Regards,
> >Tao
>
>


-- 
Regards,
Tao


Re: Getting NotLeaderForPartitionException in kafka broker

2015-05-13 Thread Jiangjie Qin
Does this topic exist in Zookeeper?

On 5/12/15, 11:35 PM, "tao xiao"  wrote:

>Hi,
>
>Any updates on this issue? I keep seeing this issue happening over and
>over
>again
>
>On Thu, May 7, 2015 at 7:28 PM, tao xiao  wrote:
>
>> Hi team,
>>
>> I have a 12 nodes cluster that has 800 topics and each of which has
>>only 1
>> partition. I observed that one of the node keeps generating
>> NotLeaderForPartitionException that causes the node to be unresponsive
>>to
>> all requests. Below is the exception
>>
>> [2015-05-07 04:16:01,014] ERROR [ReplicaFetcherThread-1-12], Error for
>> partition [topic1,0] to broker 12:class
>> kafka.common.NotLeaderForPartitionException
>> (kafka.server.ReplicaFetcherThread)
>>
>> All other nodes in the cluster generate lots of replication error too as
>> shown below due to unresponsiveness of above node.
>>
>> [2015-05-07 04:17:34,917] WARN [Replica Manager on Broker 1]: Fetch
>> request with correlation id 3630911 from client
>>ReplicaFetcherThread-0-1 on
>> partition [topic1,0] failed due to Leader not local for partition
>> [cg22_user.item_attr_info.lcr,0] on broker 1
>>(kafka.server.ReplicaManager)
>>
>> Any suggestion why the node runs into the unstable stage and any
>> configuration I can set to prevent this?
>>
>> I use kafka 0.8.2.1
>>
>> And here is the server.properties
>>
>>
>> broker.id=5
>> port=9092
>> num.network.threads=3
>> num.io.threads=8
>> socket.send.buffer.bytes=1048576
>> socket.receive.buffer.bytes=1048576
>> socket.request.max.bytes=104857600
>> log.dirs=/mnt/kafka
>> num.partitions=1
>> num.recovery.threads.per.data.dir=1
>> log.retention.hours=1
>> log.segment.bytes=1073741824
>> log.retention.check.interval.ms=30
>> log.cleaner.enable=false
>> zookeeper.connect=ip:2181
>> zookeeper.connection.timeout.ms=6000
>> unclean.leader.election.enable=false
>> delete.topic.enable=true
>> default.replication.factor=3
>> num.replica.fetchers=3
>> delete.topic.enable=true
>> kafka.metrics.reporters=report.KafkaMetricsCollector
>> straas.hubble.conf.file=/etc/kafka/report.conf
>>
>>
>>
>>
>> --
>> Regards,
>> Tao
>>
>
>
>
>-- 
>Regards,
>Tao



Re: Getting NotLeaderForPartitionException in kafka broker

2015-05-12 Thread tao xiao
Hi,

Any updates on this issue? I keep seeing this issue happening over and over
again

On Thu, May 7, 2015 at 7:28 PM, tao xiao  wrote:

> Hi team,
>
> I have a 12 nodes cluster that has 800 topics and each of which has only 1
> partition. I observed that one of the node keeps generating
> NotLeaderForPartitionException that causes the node to be unresponsive to
> all requests. Below is the exception
>
> [2015-05-07 04:16:01,014] ERROR [ReplicaFetcherThread-1-12], Error for
> partition [topic1,0] to broker 12:class
> kafka.common.NotLeaderForPartitionException
> (kafka.server.ReplicaFetcherThread)
>
> All other nodes in the cluster generate lots of replication error too as
> shown below due to unresponsiveness of above node.
>
> [2015-05-07 04:17:34,917] WARN [Replica Manager on Broker 1]: Fetch
> request with correlation id 3630911 from client ReplicaFetcherThread-0-1 on
> partition [topic1,0] failed due to Leader not local for partition
> [cg22_user.item_attr_info.lcr,0] on broker 1 (kafka.server.ReplicaManager)
>
> Any suggestion why the node runs into the unstable stage and any
> configuration I can set to prevent this?
>
> I use kafka 0.8.2.1
>
> And here is the server.properties
>
>
> broker.id=5
> port=9092
> num.network.threads=3
> num.io.threads=8
> socket.send.buffer.bytes=1048576
> socket.receive.buffer.bytes=1048576
> socket.request.max.bytes=104857600
> log.dirs=/mnt/kafka
> num.partitions=1
> num.recovery.threads.per.data.dir=1
> log.retention.hours=1
> log.segment.bytes=1073741824
> log.retention.check.interval.ms=30
> log.cleaner.enable=false
> zookeeper.connect=ip:2181
> zookeeper.connection.timeout.ms=6000
> unclean.leader.election.enable=false
> delete.topic.enable=true
> default.replication.factor=3
> num.replica.fetchers=3
> delete.topic.enable=true
> kafka.metrics.reporters=report.KafkaMetricsCollector
> straas.hubble.conf.file=/etc/kafka/report.conf
>
>
>
>
> --
> Regards,
> Tao
>



-- 
Regards,
Tao


Getting NotLeaderForPartitionException in kafka broker

2015-05-07 Thread tao xiao
Hi team,

I have a 12 nodes cluster that has 800 topics and each of which has only 1
partition. I observed that one of the node keeps generating
NotLeaderForPartitionException that causes the node to be unresponsive to
all requests. Below is the exception

[2015-05-07 04:16:01,014] ERROR [ReplicaFetcherThread-1-12], Error for
partition [topic1,0] to broker 12:class
kafka.common.NotLeaderForPartitionException
(kafka.server.ReplicaFetcherThread)

All other nodes in the cluster generate lots of replication error too as
shown below due to unresponsiveness of above node.

[2015-05-07 04:17:34,917] WARN [Replica Manager on Broker 1]: Fetch request
with correlation id 3630911 from client ReplicaFetcherThread-0-1 on
partition [topic1,0] failed due to Leader not local for partition
[cg22_user.item_attr_info.lcr,0] on broker 1 (kafka.server.ReplicaManager)

Any suggestion why the node runs into the unstable stage and any
configuration I can set to prevent this?

I use kafka 0.8.2.1

And here is the server.properties


broker.id=5
port=9092
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.dirs=/mnt/kafka
num.partitions=1
num.recovery.threads.per.data.dir=1
log.retention.hours=1
log.segment.bytes=1073741824
log.retention.check.interval.ms=30
log.cleaner.enable=false
zookeeper.connect=ip:2181
zookeeper.connection.timeout.ms=6000
unclean.leader.election.enable=false
delete.topic.enable=true
default.replication.factor=3
num.replica.fetchers=3
delete.topic.enable=true
kafka.metrics.reporters=report.KafkaMetricsCollector
straas.hubble.conf.file=/etc/kafka/report.conf




-- 
Regards,
Tao


Re: NotLeaderForPartitionException while doing performance test

2015-01-08 Thread Sa Li
Thanks, Jaikiran

I was trying to duplicate the same issue by running the same performance
test on master node of cluster , say exemplary-birds.master, and I did see
such error again
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.

At the same time, I did "lsof -I" here is the screenshot

java   6991   root   28u  IPv6 4334833  0t0  TCP *:50320 (LISTEN)
java   6991   root   29u  IPv6 4351835  0t0  TCP
exemplary-birds.master:9092->exemplary-birds.master:50472 (ESTABLISHED)
java   6991   root   38u  IPv6 4366588  0t0  TCP
exemplary-birds.master:59536->complicated-laugh.master:2181 (ESTABLISHED)
java   6991   root  150u  IPv6 4361502  0t0  TCP *:9092 (LISTEN)
java   6991   root  151u  IPv6 4368439  0t0  TCP
exemplary-birds.master:9092->harmful-jar.master:51131 (ESTABLISHED)
java   6991   root  154u  IPv6 4365924  0t0  TCP
exemplary-birds.master:55248->voluminous-mass.master:9092 (ESTABLISHED)
java   6991   root  155u  IPv6 4366591  0t0  TCP
exemplary-birds.master:55245->voluminous-mass.master:9092 (ESTABLISHED)
java   6991   root  157u  IPv6 4365923  0t0  TCP
exemplary-birds.master:41946->harmful-jar.master:9092 (ESTABLISHED)
java   6991   root  176u  IPv6 4358833  0t0  TCP
exemplary-birds.master:55251->voluminous-mass.master:9092 (ESTABLISHED)
java   6991   root  179u  IPv6 4292501  0t0  TCP
exemplary-birds.master:41951->harmful-jar.master:9092 (ESTABLISHED)
java   6991   root  180u  IPv6 4338331  0t0  TCP
exemplary-birds.master:9092->harmful-jar.master:51133 (ESTABLISHED)
java   6991   root  181u  IPv6 4364530  0t0  TCP
exemplary-birds.master:9092->voluminous-mass.master:42897 (ESTABLISHED)
java   6991   root  182u  IPv6 4358834  0t0  TCP
exemplary-birds.master:9092->harmful-jar.master:51134 (ESTABLISHED)
java   6991   root  183u  IPv6 4354353  0t0  TCP
exemplary-birds.master:9092->voluminous-mass.master:42898 (ESTABLISHED)
java   6991   root  190u  IPv6 4351836  0t0  TCP
exemplary-birds.master:9092->localhost:40786 (ESTABLISHED)
java   6991   root  201u  IPv6 4364543  0t0  TCP
exemplary-birds.master:9092->harmful-jar.master:51135 (ESTABLISHED)
java   6991   root  202u  IPv6 4364544  0t0  TCP
exemplary-birds.master:9092->voluminous-mass.master:42899 (ESTABLISHED)
java   7218   root   44u  IPv6 4366240  0t0  TCP *:46256 (LISTEN)
java   7218   root   48u  IPv6 4366602  0t0  TCP
exemplary-birds.master:50472->exemplary-birds.master:9092 (ESTABLISHED)
java   7218   root   50u  IPv6 4350446  0t0  TCP
exemplary-birds.master:41960->harmful-jar.master:9092 (ESTABLISHED)
java   7218   root   51u  IPv6 4350447  0t0  TCP
localhost:40786->exemplary-birds.master:9092 (ESTABLISHED)
java   7218   root   52u  IPv6 4350448  0t0  TCP
exemplary-birds.master:55263->voluminous-mass.master:9092 (ESTABLISHED)
java  17582   root   44u  IPv6 4326187  0t0  TCP *:46316 (LISTEN)
ntpd  18649ntp   16u  IPv4  656334  0t0  UDP *:ntp
ntpd  18649ntp   17u  IPv6  656335  0t0  UDP *:ntp
ntpd  18649ntp   18u  IPv4  656341  0t0  UDP localhost:ntp
ntpd  18649ntp   19u  IPv4  656342  0t0  UDP
exemplary-birds.master:ntp
ntpd  18649ntp   20u  IPv6  656343  0t0  UDP localhost:ntp
ntpd  18649ntp   21u  IPv6  656344  0t0  UDP
[fe80::7a2b:cbff:fe1f:2e77]:ntp
sshd  21995   root3u  IPv4 4277546  0t0  TCP
exemplary-birds.master:ssh->10.100.68.15:60642 (ESTABLISHED)
sshd  22091 fitsum3u  IPv4 4277546  0t0  TCP
exemplary-birds.master:ssh->10.100.68.15:60642 (ESTABLISHED)
java  22152   root   21u  IPv6  213140  0t0  TCP *:52411 (LISTEN)
jav

Re: NotLeaderForPartitionException while doing performance test

2015-01-07 Thread Jaikiran Pai
There are different ways to find the connection count and each one 
depends on the operating system that's being used. "lsof -i" is one 
option, for example, on *nix systems.


-Jaikiran
On Thursday 08 January 2015 11:40 AM, Sa Li wrote:

Yes, it is weird hostname, ;), that is what our system guys name it. How to
take a note to measure the connections open to 10.100.98.102?

Thanks

AL
On Jan 7, 2015 9:42 PM, "Jaikiran Pai"  wrote:


On Thursday 08 January 2015 01:51 AM, Sa Li wrote:


see this type of error again, back to normal in few secs

[2015-01-07 20:19:49,744] WARN Error in I/O with harmful-jar.master/
10.100.98.102


That's a really weird hostname, the "harmful-jar.master". Is that really
your hostname? You mention that this happens during performance testing.
Have you taken a note of how many connection are open to that 10.100.98.102
IP when this "Connection refused" exception happens?

-Jaikiran


(org.apache.kafka.common.network.Selector)

java.net.ConnectException: Connection refused
  at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
  at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
  at org.apache.kafka.common.network.Selector.poll(
Selector.java:232)
  at
org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191)
  at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184)
  at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
  at java.lang.Thread.run(Thread.java:745)
[2015-01-07 20:19:49,754] WARN Error in I/O with harmful-jar.master/
10.100.98.102 (org.apache.kafka.common.network.Selector)
java.net.ConnectException: Connection refused
  at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
  at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
  at org.apache.kafka.common.network.Selector.poll(
Selector.java:232)
  at
org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191)
  at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184)
  at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
  at java.lang.Thread.run(Thread.java:745)
[2015-01-07 20:19:49,764] WARN Error in I/O with harmful-jar.master/
10.100.98.102 (org.apache.kafka.common.network.Selector)
java.net.ConnectException: Connection refused
  at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
  at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
  at org.apache.kafka.common.network.Selector.poll(
Selector.java:232)
  at
org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191)
  at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184)
  at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
  at java.lang.Thread.run(Thread.java:745)
160403 records sent, 32080.6 records/sec (91.78 MB/sec), 507.0 ms avg
latency, 2418.0 max latency.
109882 records sent, 21976.4 records/sec (62.87 MB/sec), 672.7 ms avg
latency, 3529.0 max latency.
100315 records sent, 19995.0 records/sec (57.21 MB/sec), 774.8 ms avg
latency, 3858.0 max latency.

On Wed, Jan 7, 2015 at 12:07 PM, Sa Li  wrote:

  Hi, All

I am doing performance test by

bin/kafka-run-class.sh org.apache.kafka.clients.
tools.ProducerPerformance
test-rep-three 5 100 -1 acks=1 bootstrap.servers=
10.100.98.100:9092,10.100.98.101:9092,10.100.98.102:9092
buffer.memory=67108864 batch.size=8196

where the topic test-rep-three is described as follow:

bin/kafka-topics.sh --describe --zookeeper 10.100.98.101:2181 --topic
test-rep-three
Topic:test-rep-threePartitionCount:8ReplicationFactor:3
Configs:
  Topic: test-rep-three   Partition: 0Leader: 100
  Replicas:
100,102,101   Isr: 102,101,100
  Topic: test-rep-three   Partition: 1Leader: 101
  Replicas:
101,100,102   Isr: 102,101,100
  Topic: test-rep-three   Partition: 2Leader: 102
  Replicas:
102,101,100   Isr: 101,102,100
  Topic: test-rep-three   Partition: 3Leader: 100
  Replicas:
100,101,102   Isr: 101,100,102
  Topic: test-rep-three   Partition: 4Leader: 101
  Replicas:
101,102,100   Isr: 102,100,101
  Topic: test-rep-three   Partition: 5Leader: 102
  Replicas:
102,100,101   Isr: 100,102,101
  Topic: test-rep-three   Partition: 6Leader: 102
  Replicas:
100,102,101   Isr: 102,101,100
  Topic: test-rep-three   Partition: 7Leader: 101
  Replicas:
101,100,102   Isr: 101,100,102

Apparently, it produces the messages and run for a while, but it
periodically have such exceptions:

org.apache.kafka.common.errors.NotLeaderForPartitionException: This
server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This
server
is not the leader for that topic-partition.
org.apache.kafka.

Re: NotLeaderForPartitionException while doing performance test

2015-01-07 Thread Sa Li
Yes, it is weird hostname, ;), that is what our system guys name it. How to
take a note to measure the connections open to 10.100.98.102?

Thanks

AL
On Jan 7, 2015 9:42 PM, "Jaikiran Pai"  wrote:

> On Thursday 08 January 2015 01:51 AM, Sa Li wrote:
>
>> see this type of error again, back to normal in few secs
>>
>> [2015-01-07 20:19:49,744] WARN Error in I/O with harmful-jar.master/
>> 10.100.98.102
>>
>
> That's a really weird hostname, the "harmful-jar.master". Is that really
> your hostname? You mention that this happens during performance testing.
> Have you taken a note of how many connection are open to that 10.100.98.102
> IP when this "Connection refused" exception happens?
>
> -Jaikiran
>
>
>(org.apache.kafka.common.network.Selector)
>> java.net.ConnectException: Connection refused
>>  at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>  at
>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>>  at org.apache.kafka.common.network.Selector.poll(
>> Selector.java:232)
>>  at
>> org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191)
>>  at
>> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184)
>>  at
>> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
>>  at java.lang.Thread.run(Thread.java:745)
>> [2015-01-07 20:19:49,754] WARN Error in I/O with harmful-jar.master/
>> 10.100.98.102 (org.apache.kafka.common.network.Selector)
>> java.net.ConnectException: Connection refused
>>  at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>  at
>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>>  at org.apache.kafka.common.network.Selector.poll(
>> Selector.java:232)
>>  at
>> org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191)
>>  at
>> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184)
>>  at
>> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
>>  at java.lang.Thread.run(Thread.java:745)
>> [2015-01-07 20:19:49,764] WARN Error in I/O with harmful-jar.master/
>> 10.100.98.102 (org.apache.kafka.common.network.Selector)
>> java.net.ConnectException: Connection refused
>>  at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>  at
>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>>  at org.apache.kafka.common.network.Selector.poll(
>> Selector.java:232)
>>  at
>> org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191)
>>  at
>> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184)
>>  at
>> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
>>  at java.lang.Thread.run(Thread.java:745)
>> 160403 records sent, 32080.6 records/sec (91.78 MB/sec), 507.0 ms avg
>> latency, 2418.0 max latency.
>> 109882 records sent, 21976.4 records/sec (62.87 MB/sec), 672.7 ms avg
>> latency, 3529.0 max latency.
>> 100315 records sent, 19995.0 records/sec (57.21 MB/sec), 774.8 ms avg
>> latency, 3858.0 max latency.
>>
>> On Wed, Jan 7, 2015 at 12:07 PM, Sa Li  wrote:
>>
>>  Hi, All
>>>
>>> I am doing performance test by
>>>
>>> bin/kafka-run-class.sh org.apache.kafka.clients.
>>> tools.ProducerPerformance
>>> test-rep-three 5 100 -1 acks=1 bootstrap.servers=
>>> 10.100.98.100:9092,10.100.98.101:9092,10.100.98.102:9092
>>> buffer.memory=67108864 batch.size=8196
>>>
>>> where the topic test-rep-three is described as follow:
>>>
>>> bin/kafka-topics.sh --describe --zookeeper 10.100.98.101:2181 --topic
>>> test-rep-three
>>> Topic:test-rep-threePartitionCount:8ReplicationFactor:3
>>> Configs:
>>>  Topic: test-rep-three   Partition: 0Leader: 100
>>>  Replicas:
>>> 100,102,101   Isr: 102,101,100
>>>  Topic: test-rep-three   Partition: 1Leader: 101
>>>  Replicas:
>>> 101,100,102   Isr: 102,101,100
>>>  Topic: test-rep-three   Partition: 2Leader: 102
>>>  Replicas:
>>> 102,101,100   Isr: 101,102,100
>>>  Topic: test-rep-three   Partition: 3Leader: 100
>>>  Replicas:
>>> 100,101,102   Isr: 101,100,102
>>>  Topic: test-rep-three   Partition: 4Leader: 101
>>>  Replicas:
>>> 101,102,100   Isr: 102,100,101
>>>  Topic: test-rep-three   Partition: 5Leader: 102
>>>  Replicas:
>>> 102,100,101   Isr: 100,102,101
>>>  Topic: test-rep-three   Partition: 6Leader: 102
>>>  Replicas:
>>> 100,102,101   Isr: 102,101,100
>>>  Topic: test-rep-three   Partition: 7Leader: 101
>>>  Replicas:
>>> 101,100,102   Isr: 101,100,102
>>>
>>> Apparently, it produces the messages and run for a while, but it
>>> periodically have such exceptions:
>>>
>>> org.apache.kafka.common.errors.NotLeaderForPartitionException: This
>>> server
>>> is not the leader for that topic-partition.
>>> org.apache.kafka.common.errors.NotLeaderForPar

Re: NotLeaderForPartitionException while doing performance test

2015-01-07 Thread Jaikiran Pai

On Thursday 08 January 2015 01:51 AM, Sa Li wrote:

see this type of error again, back to normal in few secs

[2015-01-07 20:19:49,744] WARN Error in I/O with harmful-jar.master/
10.100.98.102


That's a really weird hostname, the "harmful-jar.master". Is that really 
your hostname? You mention that this happens during performance testing. 
Have you taken a note of how many connection are open to that 
10.100.98.102 IP when this "Connection refused" exception happens?


-Jaikiran



  (org.apache.kafka.common.network.Selector)
java.net.ConnectException: Connection refused
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
 at org.apache.kafka.common.network.Selector.poll(Selector.java:232)
 at
org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191)
 at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184)
 at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:745)
[2015-01-07 20:19:49,754] WARN Error in I/O with harmful-jar.master/
10.100.98.102 (org.apache.kafka.common.network.Selector)
java.net.ConnectException: Connection refused
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
 at org.apache.kafka.common.network.Selector.poll(Selector.java:232)
 at
org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191)
 at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184)
 at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:745)
[2015-01-07 20:19:49,764] WARN Error in I/O with harmful-jar.master/
10.100.98.102 (org.apache.kafka.common.network.Selector)
java.net.ConnectException: Connection refused
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
 at org.apache.kafka.common.network.Selector.poll(Selector.java:232)
 at
org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191)
 at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184)
 at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:745)
160403 records sent, 32080.6 records/sec (91.78 MB/sec), 507.0 ms avg
latency, 2418.0 max latency.
109882 records sent, 21976.4 records/sec (62.87 MB/sec), 672.7 ms avg
latency, 3529.0 max latency.
100315 records sent, 19995.0 records/sec (57.21 MB/sec), 774.8 ms avg
latency, 3858.0 max latency.

On Wed, Jan 7, 2015 at 12:07 PM, Sa Li  wrote:


Hi, All

I am doing performance test by

bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance
test-rep-three 5 100 -1 acks=1 bootstrap.servers=
10.100.98.100:9092,10.100.98.101:9092,10.100.98.102:9092
buffer.memory=67108864 batch.size=8196

where the topic test-rep-three is described as follow:

bin/kafka-topics.sh --describe --zookeeper 10.100.98.101:2181 --topic
test-rep-three
Topic:test-rep-threePartitionCount:8ReplicationFactor:3
Configs:
 Topic: test-rep-three   Partition: 0Leader: 100 Replicas:
100,102,101   Isr: 102,101,100
 Topic: test-rep-three   Partition: 1Leader: 101 Replicas:
101,100,102   Isr: 102,101,100
 Topic: test-rep-three   Partition: 2Leader: 102 Replicas:
102,101,100   Isr: 101,102,100
 Topic: test-rep-three   Partition: 3Leader: 100 Replicas:
100,101,102   Isr: 101,100,102
 Topic: test-rep-three   Partition: 4Leader: 101 Replicas:
101,102,100   Isr: 102,100,101
 Topic: test-rep-three   Partition: 5Leader: 102 Replicas:
102,100,101   Isr: 100,102,101
 Topic: test-rep-three   Partition: 6Leader: 102 Replicas:
100,102,101   Isr: 102,101,100
 Topic: test-rep-three   Partition: 7Leader: 101 Replicas:
101,100,102   Isr: 101,100,102

Apparently, it produces the messages and run for a while, but it
periodically have such exceptions:

org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partit

Re: NotLeaderForPartitionException while doing performance test

2015-01-07 Thread Sa Li
I checked topic config, isr changes dynamically.

root@voluminous-mass:/srv/kafka# bin/kafka-topics.sh --describe --zookeeper
10.100.98.101:2181 --topic test-rep-three
Topic:test-rep-threePartitionCount:8ReplicationFactor:3
Configs:
Topic: test-rep-three   Partition: 0Leader: 100 Replicas:
100,102,101   Isr: 100
Topic: test-rep-three   Partition: 1Leader: 100 Replicas:
101,100,102   Isr: 100,101,102
Topic: test-rep-three   Partition: 2Leader: 102 Replicas:
102,101,100   Isr: 101,102
Topic: test-rep-three   Partition: 3Leader: 100 Replicas:
100,101,102   Isr: 100
Topic: test-rep-three   Partition: 4Leader: 100 Replicas:
101,102,100   Isr: 100
Topic: test-rep-three   Partition: 5Leader: 102 Replicas:
102,100,101   Isr: 100,102,101
Topic: test-rep-three   Partition: 6Leader: 100 Replicas:
100,102,101   Isr: 100,102,101
Topic: test-rep-three   Partition: 7Leader: 100 Replicas:
101,100,102   Isr: 100
root@voluminous-mass:/srv/kafka# bin/kafka-topics.sh --describe --zookeeper
10.100.98.101:2181 --topic test-rep-three
Topic:test-rep-threePartitionCount:8ReplicationFactor:3
Configs:
Topic: test-rep-three   Partition: 0Leader: 100 Replicas:
100,102,101   Isr: 102,100,101
Topic: test-rep-three   Partition: 1Leader: 101 Replicas:
101,100,102   Isr: 101,102,100
Topic: test-rep-three   Partition: 2Leader: 102 Replicas:
102,101,100   Isr: 101,102
Topic: test-rep-three   Partition: 3Leader: 100 Replicas:
100,101,102   Isr: 101,100,102
Topic: test-rep-three   Partition: 4Leader: 101 Replicas:
101,102,100   Isr: 101,102,100
Topic: test-rep-three   Partition: 5Leader: 102 Replicas:
102,100,101   Isr: 102,101,100
Topic: test-rep-three   Partition: 6Leader: 102 Replicas:
100,102,101   Isr: 102,101
Topic: test-rep-three   Partition: 7Leader: 101 Replicas:
101,100,102   Isr: 101,100,102

Why that happen?

thanks


On Wed, Jan 7, 2015 at 12:21 PM, Sa Li  wrote:

> see this type of error again, back to normal in few secs
>
> [2015-01-07 20:19:49,744] WARN Error in I/O with harmful-jar.master/
> 10.100.98.102 (org.apache.kafka.common.network.Selector)
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
> at org.apache.kafka.common.network.Selector.poll(Selector.java:232)
> at
> org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191)
> at
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184)
> at
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
> at java.lang.Thread.run(Thread.java:745)
> [2015-01-07 20:19:49,754] WARN Error in I/O with harmful-jar.master/
> 10.100.98.102 (org.apache.kafka.common.network.Selector)
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
> at org.apache.kafka.common.network.Selector.poll(Selector.java:232)
> at
> org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191)
> at
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184)
> at
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
> at java.lang.Thread.run(Thread.java:745)
> [2015-01-07 20:19:49,764] WARN Error in I/O with harmful-jar.master/
> 10.100.98.102 (org.apache.kafka.common.network.Selector)
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
> at org.apache.kafka.common.network.Selector.poll(Selector.java:232)
> at
> org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191)
> at
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184)
> at
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
> at java.lang.Thread.run(Thread.java:745)
> 160403 records sent, 32080.6 records/sec (91.78 MB/sec), 507.0 ms avg
> latency, 2418.0 max latency.
> 109882 records sent, 21976.4 records/sec (62.87 MB/sec), 672.7 ms avg
> latency, 3529.0 max latency.
> 100315 records sent, 19995.0 records/sec (57.21 MB/sec), 774.8 ms avg
> latency, 3858.0 max latency.
>
> On Wed, Jan 7, 2015 at 12:07 PM, Sa Li  wrote:
>
>> Hi, All
>>
>> I am doing performance test by
>>
>> bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance
>> test-rep-three 5 100 -1 acks=1 bootstrap.servers=
>> 10.100.98.100:9092,10.100.98.101:9092,10.100.98.102:9092
>> buffer.memor

Re: NotLeaderForPartitionException while doing performance test

2015-01-07 Thread Sa Li
see this type of error again, back to normal in few secs

[2015-01-07 20:19:49,744] WARN Error in I/O with harmful-jar.master/
10.100.98.102 (org.apache.kafka.common.network.Selector)
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.kafka.common.network.Selector.poll(Selector.java:232)
at
org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191)
at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184)
at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
at java.lang.Thread.run(Thread.java:745)
[2015-01-07 20:19:49,754] WARN Error in I/O with harmful-jar.master/
10.100.98.102 (org.apache.kafka.common.network.Selector)
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.kafka.common.network.Selector.poll(Selector.java:232)
at
org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191)
at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184)
at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
at java.lang.Thread.run(Thread.java:745)
[2015-01-07 20:19:49,764] WARN Error in I/O with harmful-jar.master/
10.100.98.102 (org.apache.kafka.common.network.Selector)
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.kafka.common.network.Selector.poll(Selector.java:232)
at
org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191)
at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184)
at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
at java.lang.Thread.run(Thread.java:745)
160403 records sent, 32080.6 records/sec (91.78 MB/sec), 507.0 ms avg
latency, 2418.0 max latency.
109882 records sent, 21976.4 records/sec (62.87 MB/sec), 672.7 ms avg
latency, 3529.0 max latency.
100315 records sent, 19995.0 records/sec (57.21 MB/sec), 774.8 ms avg
latency, 3858.0 max latency.

On Wed, Jan 7, 2015 at 12:07 PM, Sa Li  wrote:

> Hi, All
>
> I am doing performance test by
>
> bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance
> test-rep-three 5 100 -1 acks=1 bootstrap.servers=
> 10.100.98.100:9092,10.100.98.101:9092,10.100.98.102:9092
> buffer.memory=67108864 batch.size=8196
>
> where the topic test-rep-three is described as follow:
>
> bin/kafka-topics.sh --describe --zookeeper 10.100.98.101:2181 --topic
> test-rep-three
> Topic:test-rep-threePartitionCount:8ReplicationFactor:3
> Configs:
> Topic: test-rep-three   Partition: 0Leader: 100 Replicas:
> 100,102,101   Isr: 102,101,100
> Topic: test-rep-three   Partition: 1Leader: 101 Replicas:
> 101,100,102   Isr: 102,101,100
> Topic: test-rep-three   Partition: 2Leader: 102 Replicas:
> 102,101,100   Isr: 101,102,100
> Topic: test-rep-three   Partition: 3Leader: 100 Replicas:
> 100,101,102   Isr: 101,100,102
> Topic: test-rep-three   Partition: 4Leader: 101 Replicas:
> 101,102,100   Isr: 102,100,101
> Topic: test-rep-three   Partition: 5Leader: 102 Replicas:
> 102,100,101   Isr: 100,102,101
> Topic: test-rep-three   Partition: 6Leader: 102 Replicas:
> 100,102,101   Isr: 102,101,100
> Topic: test-rep-three   Partition: 7Leader: 101 Replicas:
> 101,100,102   Isr: 101,100,102
>
> Apparently, it produces the messages and run for a while, but it
> periodically have such exceptions:
>
> org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
> is not the leader for that topic-partition.
> org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
> is not the leader for that topic-partition.
> org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
> is not the leader for that topic-partition.
> org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
> is not the leader for that topic-partition.
> org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
> is not the leader for that topic-partition.
> org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
> is not the leader for that topic-partition.
> org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
> is not the leader for that topic-partition.
> org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
> is not the leader for that topic-partition.
> org.apache.kafka.common.errors.NotLeade

NotLeaderForPartitionException while doing performance test

2015-01-07 Thread Sa Li
Hi, All

I am doing performance test by

bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance
test-rep-three 5 100 -1 acks=1 bootstrap.servers=10.100.98.100:9092,
10.100.98.101:9092,10.100.98.102:9092 buffer.memory=67108864 batch.size=8196

where the topic test-rep-three is described as follow:

bin/kafka-topics.sh --describe --zookeeper 10.100.98.101:2181 --topic
test-rep-three
Topic:test-rep-threePartitionCount:8ReplicationFactor:3
Configs:
Topic: test-rep-three   Partition: 0Leader: 100 Replicas:
100,102,101   Isr: 102,101,100
Topic: test-rep-three   Partition: 1Leader: 101 Replicas:
101,100,102   Isr: 102,101,100
Topic: test-rep-three   Partition: 2Leader: 102 Replicas:
102,101,100   Isr: 101,102,100
Topic: test-rep-three   Partition: 3Leader: 100 Replicas:
100,101,102   Isr: 101,100,102
Topic: test-rep-three   Partition: 4Leader: 101 Replicas:
101,102,100   Isr: 102,100,101
Topic: test-rep-three   Partition: 5Leader: 102 Replicas:
102,100,101   Isr: 100,102,101
Topic: test-rep-three   Partition: 6Leader: 102 Replicas:
100,102,101   Isr: 102,101,100
Topic: test-rep-three   Partition: 7Leader: 101 Replicas:
101,100,102   Isr: 101,100,102

Apparently, it produces the messages and run for a while, but it
periodically have such exceptions:

org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
141292 records sent, 28258.4 records/sec (80.85 MB/sec), 551.2 ms avg
latency, 1494.0 max latency.
142526 records sent, 28505.2 records/sec (81.55 MB/sec), 580.8 ms avg
latency, 1513.0 max latency.
146564 records sent, 29312.8 records/sec (83.86 MB/sec), 557.9 ms avg
latency, 1431.0 max latency.
146755 records sent, 29351.0 records/sec (83.97 MB/sec), 556.7 ms avg
latency, 1480.0 max latency.
147963 records sent, 29592.6 records/sec (84.67 MB/sec), 556.7 ms avg
latency, 1546.0 max latency.
146931 records sent, 29386.2 records/sec (84.07 MB/sec), 550.9 ms avg
latency, 1715.0 max latency.
146947 records sent, 29389.4 records/sec (84.08 MB/sec), 555.1 ms avg
latency, 1750.0 max latency.
146422 records sent, 29284.4 records/sec (83.78 MB/sec), 557.9 ms avg
latency, 1818.0 max latency.
147516 records sent, 29503.2 records/sec (84.41 MB/sec), 555.6 ms avg
latency, 1806.0 max latency.
147877 records sent, 29575.4 records/sec (84.62 MB/sec), 552.1 ms avg
latency, 1821.0 max latency.
147201 records sent, 29440.2 records/sec (84.23 MB/sec), 554.5 ms avg
latency, 1826.0 max latency.
148317 records sent, 29663.4 records/sec (84.87 MB/sec), 558.1 ms avg
latency, 1792.0 max latency.
147756 records sent, 29551.2 records/sec (84.55 MB/sec), 550.9 ms avg
latency, 1806.0 max latency

then back into correct process state, is that because rebalance?

thanks



-- 

Alec Li


Re: Kafka - NotLeaderForPartitionException / LeaderNotAvailableException

2014-10-15 Thread Abraham Jacob
Will definitely take a thread dump! So, far its been running fine.

-Jacob

On Wed, Oct 15, 2014 at 8:40 PM, Jun Rao  wrote:

> If you see the hanging again, it would be great if you can take a thread
> dump so that we know where it is hanging.
>
> Thanks,
>
> Jun
>
> On Tue, Oct 14, 2014 at 10:35 PM, Abraham Jacob 
> wrote:
>
> > Hi Jun,
> >
> > Thanks for responding...
> >
> > I am using Kafka 2.9.2-0.8.1.1
> >
> > I looked through the controller logs on a couple of nodes and did not
> find
> > any exceptions or error.
> >
> > However in the state change log I see a bunch of the following
> exceptions -
> >
> > [2014-10-13 14:39:12,475] TRACE Controller 3 epoch 116 started leader
> > election for partition [wordcount,1] (state.change.logger)
> > [2014-10-13 14:39:12,479] ERROR Controller 3 epoch 116 initiated state
> > change for partition [wordcount,1] from OfflinePartition to
> OnlinePartition
> > failed (state.change.logger)
> > kafka.common.NoReplicaOnlineException: No replica for partition
> > [wordcount,1] is alive. Live brokers are: [Set()], Assigned replicas are:
> > [List(8, 7, 1)]
> > at
> >
> >
> kafka.controller.OfflinePartitionLeaderSelector.selectLeader(PartitionLeaderSelector.scala:61)
> > at
> >
> >
> kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:336)
> > at
> >
> >
> kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:185)
> > at
> >
> >
> kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:99)
> > at
> >
> >
> kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:96)
> > at
> >
> >
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:743)
> > at
> >
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
> > at
> >
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
> > at scala.collection.Iterator$class.foreach(Iterator.scala:772)
> > at
> > scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157)
> > at
> >
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190)
> > at
> scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45)
> > at scala.collection.mutable.HashMap.foreach(HashMap.scala:95)
> > at
> >
> >
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:742)
> > at
> >
> >
> kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:96)
> > at
> >
> >
> kafka.controller.PartitionStateMachine.startup(PartitionStateMachine.scala:68)
> > at
> >
> >
> kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:312)
> > at
> >
> >
> kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:162)
> > at
> >
> kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:63)
> > at
> >
> >
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:123)
> > at
> >
> >
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:118)
> > at
> >
> >
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:118)
> > at kafka.utils.Utils$.inLock(Utils.scala:538)
> > at
> >
> >
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:118)
> > at org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:549)
> > at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
> >
> >
> > Anyways, this morning after sending out the email, I set out to restart
> all
> > the brokers. I found that 3 brokers were in a hung state. I tried to use
> > the bin/kafka-server-stop.sh script (which is nothing but sending a
> SIGINT
> > signal), the java process running kafka would not terminate, I then
> issued
> > a 'kill -SIGTERM x' for the java process running Kafka, yet the
> process
> > would not terminate. This happened only on 3 nodes (1 node is running
> only
> > 1 broker). For the other nodes kafka-server-stop.sh successfully bought
> > down the java process running Kafka.
> >
> > For the three brokers that was not responding to either SIGINT and
> SIGTERM
> > signal I issued a SIGKILL instead and this, for sure brought down the
> > process.
> >
> > I then restarted brokers on all nodes. After that I again ran the
> describe
> > topic script.
> >
> > bin/kafka-topics.sh --describe --zookeeper tr-pan-hclstr-08.amers1b.
> > ciscloud:2181/kafka/kafka-clstr-01 --topic wordcount
> >

Re: Kafka - NotLeaderForPartitionException / LeaderNotAvailableException

2014-10-15 Thread Jun Rao
If you see the hanging again, it would be great if you can take a thread
dump so that we know where it is hanging.

Thanks,

Jun

On Tue, Oct 14, 2014 at 10:35 PM, Abraham Jacob 
wrote:

> Hi Jun,
>
> Thanks for responding...
>
> I am using Kafka 2.9.2-0.8.1.1
>
> I looked through the controller logs on a couple of nodes and did not find
> any exceptions or error.
>
> However in the state change log I see a bunch of the following exceptions -
>
> [2014-10-13 14:39:12,475] TRACE Controller 3 epoch 116 started leader
> election for partition [wordcount,1] (state.change.logger)
> [2014-10-13 14:39:12,479] ERROR Controller 3 epoch 116 initiated state
> change for partition [wordcount,1] from OfflinePartition to OnlinePartition
> failed (state.change.logger)
> kafka.common.NoReplicaOnlineException: No replica for partition
> [wordcount,1] is alive. Live brokers are: [Set()], Assigned replicas are:
> [List(8, 7, 1)]
> at
>
> kafka.controller.OfflinePartitionLeaderSelector.selectLeader(PartitionLeaderSelector.scala:61)
> at
>
> kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:336)
> at
>
> kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:185)
> at
>
> kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:99)
> at
>
> kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:96)
> at
>
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:743)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
> at scala.collection.Iterator$class.foreach(Iterator.scala:772)
> at
> scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157)
> at
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:95)
> at
>
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:742)
> at
>
> kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:96)
> at
>
> kafka.controller.PartitionStateMachine.startup(PartitionStateMachine.scala:68)
> at
>
> kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:312)
> at
>
> kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:162)
> at
> kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:63)
> at
>
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:123)
> at
>
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:118)
> at
>
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:118)
> at kafka.utils.Utils$.inLock(Utils.scala:538)
> at
>
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:118)
> at org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:549)
> at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
>
>
> Anyways, this morning after sending out the email, I set out to restart all
> the brokers. I found that 3 brokers were in a hung state. I tried to use
> the bin/kafka-server-stop.sh script (which is nothing but sending a SIGINT
> signal), the java process running kafka would not terminate, I then issued
> a 'kill -SIGTERM x' for the java process running Kafka, yet the process
> would not terminate. This happened only on 3 nodes (1 node is running only
> 1 broker). For the other nodes kafka-server-stop.sh successfully bought
> down the java process running Kafka.
>
> For the three brokers that was not responding to either SIGINT and SIGTERM
> signal I issued a SIGKILL instead and this, for sure brought down the
> process.
>
> I then restarted brokers on all nodes. After that I again ran the describe
> topic script.
>
> bin/kafka-topics.sh --describe --zookeeper tr-pan-hclstr-08.amers1b.
> ciscloud:2181/kafka/kafka-clstr-01 --topic wordcount
>
>
> Topic:wordcount PartitionCount:8ReplicationFactor:3 Configs:
> Topic: wordcountPartition: 0Leader: 7   Replicas:
> 7,6,8 Isr: 6,7,8
> Topic: wordcountPartition: 1Leader: 8   Replicas:
> 8,7,1 Isr: 1,7,8
> Topic: wordcountPartition: 2Leader: 1   Replicas:
> 1,8,2 Isr: 1,2,8
> Topic: wordcountPartiti

Re: Kafka - NotLeaderForPartitionException / LeaderNotAvailableException

2014-10-14 Thread Abraham Jacob
Hi Jun,

Thanks for responding...

I am using Kafka 2.9.2-0.8.1.1

I looked through the controller logs on a couple of nodes and did not find
any exceptions or error.

However in the state change log I see a bunch of the following exceptions -

[2014-10-13 14:39:12,475] TRACE Controller 3 epoch 116 started leader
election for partition [wordcount,1] (state.change.logger)
[2014-10-13 14:39:12,479] ERROR Controller 3 epoch 116 initiated state
change for partition [wordcount,1] from OfflinePartition to OnlinePartition
failed (state.change.logger)
kafka.common.NoReplicaOnlineException: No replica for partition
[wordcount,1] is alive. Live brokers are: [Set()], Assigned replicas are:
[List(8, 7, 1)]
at
kafka.controller.OfflinePartitionLeaderSelector.selectLeader(PartitionLeaderSelector.scala:61)
at
kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:336)
at
kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:185)
at
kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:99)
at
kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:96)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:743)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
at scala.collection.Iterator$class.foreach(Iterator.scala:772)
at
scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157)
at
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:95)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:742)
at
kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:96)
at
kafka.controller.PartitionStateMachine.startup(PartitionStateMachine.scala:68)
at
kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:312)
at
kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:162)
at
kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:63)
at
kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:123)
at
kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:118)
at
kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:118)
at kafka.utils.Utils$.inLock(Utils.scala:538)
at
kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:118)
at org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:549)
at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)


Anyways, this morning after sending out the email, I set out to restart all
the brokers. I found that 3 brokers were in a hung state. I tried to use
the bin/kafka-server-stop.sh script (which is nothing but sending a SIGINT
signal), the java process running kafka would not terminate, I then issued
a 'kill -SIGTERM x' for the java process running Kafka, yet the process
would not terminate. This happened only on 3 nodes (1 node is running only
1 broker). For the other nodes kafka-server-stop.sh successfully bought
down the java process running Kafka.

For the three brokers that was not responding to either SIGINT and SIGTERM
signal I issued a SIGKILL instead and this, for sure brought down the
process.

I then restarted brokers on all nodes. After that I again ran the describe
topic script.

bin/kafka-topics.sh --describe --zookeeper tr-pan-hclstr-08.amers1b.
ciscloud:2181/kafka/kafka-clstr-01 --topic wordcount


Topic:wordcount PartitionCount:8ReplicationFactor:3 Configs:
Topic: wordcountPartition: 0Leader: 7   Replicas:
7,6,8 Isr: 6,7,8
Topic: wordcountPartition: 1Leader: 8   Replicas:
8,7,1 Isr: 1,7,8
Topic: wordcountPartition: 2Leader: 1   Replicas:
1,8,2 Isr: 1,2,8
Topic: wordcountPartition: 3Leader: 2   Replicas:
2,1,3 Isr: 1,2,3
Topic: wordcountPartition: 4Leader: 3   Replicas:
3,2,4 Isr: 2,3,4
Topic: wordcountPartition: 5Leader: 4   Replicas:
4,3,5 Isr: 3,4,5
Topic: wordcountPartition: 6Leader: 5   Replicas:
5,4,6 Isr: 4,5,6
Topic: wordcountPartition: 7Leader: 6   Replic

Re: Kafka - NotLeaderForPartitionException / LeaderNotAvailableException

2014-10-14 Thread Jun Rao
Also, which version of Kafka are you using?

Thanks,

Jun

On Tue, Oct 14, 2014 at 5:31 PM, Jun Rao  wrote:

> The following is a bit weird. It indicates no leader for partition 4,
> which is inconsistent with what describe-topic shows.
>
> 2014-10-13 19:02:32,611 WARN [main] kafka.producer.BrokerPartitionInfo:
> Error while fetching metadata   partition 4 leader: nonereplicas: 3
> (tr-pan-hclstr-13.amers1b.ciscloud:9092),2
> (tr-pan-hclstr-12.amers1b.ciscloud:9092),4
> (tr-pan-hclstr-14.amers1b.ciscloud:9092) isr:isUnderReplicated:
> true for topic partition [wordcount,4]: [class
> kafka.common.LeaderNotAvailableException]
>
> Any error in the controller and the state-change log? Do you see broker 3
> marked as dead in the controller log? Also, could you check if the broker
> registration in ZK (
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+data+structures+in+Zookeeper)
> has the correct host/port?
>
> Thanks,
>
> Jun
>
> On Mon, Oct 13, 2014 at 5:35 PM, Abraham Jacob 
> wrote:
>
>> Hi All,
>>
>> I have a 8 node Kafka cluster (broker.id - 1..8). On this cluster I have
>> a
>> topic "wordcount", which was 8 partitions with a replication factor of 3.
>>
>> So a describe of topic wordcount
>> # bin/kafka-topics.sh --describe --zookeeper
>> tr-pan-hclstr-08.amers1b.ciscloud:2181/kafka/kafka-clstr-01 --topic
>> wordcount
>>
>>
>> Topic:wordcount PartitionCount:8ReplicationFactor:3 Configs:
>> Topic: wordcountPartition: 0Leader: 6   Replicas: 7,6,8
>> Isr: 6,7,8
>> Topic: wordcountPartition: 1Leader: 7   Replicas: 8,7,1
>> Isr: 7
>> Topic: wordcountPartition: 2Leader: 8   Replicas: 1,8,2
>> Isr: 8
>> Topic: wordcountPartition: 3Leader: 3   Replicas: 2,1,3
>> Isr: 3
>> Topic: wordcountPartition: 4Leader: 3   Replicas: 3,2,4
>> Isr: 3,2,4
>> Topic: wordcountPartition: 5Leader: 3   Replicas: 4,3,5
>> Isr: 3,5
>> Topic: wordcountPartition: 6Leader: 6   Replicas: 5,4,6
>> Isr: 6,5
>> Topic: wordcountPartition: 7Leader: 6   Replicas: 6,5,7
>> Isr: 6,5,7
>>
>> I wrote a simple producer to write to this topic. However when running I
>> get these messages in the logs -
>>
>> 2014-10-13 19:02:32,459 INFO [main] kafka.client.ClientUtils$: Fetching
>> metadata from broker id:0,host:tr-pan-hclstr-11.amers1b.ciscloud,port:9092
>> with correlation id 0 for 1 topic(s) Set(wordcount)
>> 2014-10-13 19:02:32,464 INFO [main] kafka.producer.SyncProducer: Connected
>> to tr-pan-hclstr-11.amers1b.ciscloud:9092 for producing
>> 2014-10-13 19:02:32,551 INFO [main] kafka.producer.SyncProducer:
>> Disconnecting from tr-pan-hclstr-11.amers1b.ciscloud:9092
>> 2014-10-13 19:02:32,611 WARN [main] kafka.producer.BrokerPartitionInfo:
>> Error while fetching metadata   partition 4 leader: nonereplicas:
>> 3
>> (tr-pan-hclstr-13.amers1b.ciscloud:9092),2
>> (tr-pan-hclstr-12.amers1b.ciscloud:9092),4
>> (tr-pan-hclstr-14.amers1b.ciscloud:9092) isr:isUnderReplicated:
>> true for topic partition [wordcount,4]: [class
>> kafka.common.LeaderNotAvailableException]
>> 2014-10-13 19:02:33,505 INFO [main] kafka.producer.SyncProducer: Connected
>> to tr-pan-hclstr-15.amers1b.ciscloud:9092 for producing
>> 2014-10-13 19:02:33,543 WARN [main]
>> kafka.producer.async.DefaultEventHandler: Produce request with correlation
>> id 20611 failed due to [wordcount,5]:
>> kafka.common.NotLeaderForPartitionException,[wordcount,6]:
>> kafka.common.NotLeaderForPartitionException,[wordcount,7]:
>> kafka.common.NotLeaderForPartitionException
>> 2014-10-13 19:02:33,694 INFO [main] kafka.producer.SyncProducer: Connected
>> to tr-pan-hclstr-18.amers1b.ciscloud:9092 for producing
>> 2014-10-13 19:02:33,725 WARN [main]
>> kafka.producer.async.DefaultEventHandler: Produce request with correlation
>> id 20612 failed due to [wordcount,0]:
>> kafka.common.NotLeaderForPartitionException
>> 2014-10-13 19:02:33,861 INFO [main] kafka.producer.SyncProducer: Connected
>> to tr-pan-hclstr-11.amers1b.ciscloud:9092 for producing
>> 2014-10-13 19:02:33,983 WARN [main]
>> kafka.producer.async.DefaultEventHandler: Failed to send data since
>> partitions [wordcount,4] don't have a leader
>>
>>
>> Obviously something is terribly wrong... I am quite new to Kafka, hence
>> these messages don't make any sense to me, except for the fact that it is
>> telling me that some of the partitions don't have any leader.
>>
>> Could somebody be kind enough to explain the above message?
>>
>> A few more questions -
>>
>> (1) How does one get into this state?
>> (2) How can I get out of this state?
>> (3) I have set auto.leader.rebalance.enable=true on all brokers. Shouldn't
>> the partitions be balanced across all the brokers?
>> (4) I can see that the Kafka service are running on all 8 nodes. (I used
>>  ps ax -o  "pid pgid args" and I can see under the kafka Java process).
>> (5) Is there a way I can force a re-balance?
>>
>>

Re: Kafka - NotLeaderForPartitionException / LeaderNotAvailableException

2014-10-14 Thread Jun Rao
The following is a bit weird. It indicates no leader for partition 4, which
is inconsistent with what describe-topic shows.

2014-10-13 19:02:32,611 WARN [main] kafka.producer.BrokerPartitionInfo:
Error while fetching metadata   partition 4 leader: nonereplicas: 3
(tr-pan-hclstr-13.amers1b.ciscloud:9092),2
(tr-pan-hclstr-12.amers1b.ciscloud:9092),4
(tr-pan-hclstr-14.amers1b.ciscloud:9092) isr:isUnderReplicated:
true for topic partition [wordcount,4]: [class
kafka.common.LeaderNotAvailableException]

Any error in the controller and the state-change log? Do you see broker 3
marked as dead in the controller log? Also, could you check if the broker
registration in ZK (
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+data+structures+in+Zookeeper)
has the correct host/port?

Thanks,

Jun

On Mon, Oct 13, 2014 at 5:35 PM, Abraham Jacob  wrote:

> Hi All,
>
> I have a 8 node Kafka cluster (broker.id - 1..8). On this cluster I have a
> topic "wordcount", which was 8 partitions with a replication factor of 3.
>
> So a describe of topic wordcount
> # bin/kafka-topics.sh --describe --zookeeper
> tr-pan-hclstr-08.amers1b.ciscloud:2181/kafka/kafka-clstr-01 --topic
> wordcount
>
>
> Topic:wordcount PartitionCount:8ReplicationFactor:3 Configs:
> Topic: wordcountPartition: 0Leader: 6   Replicas: 7,6,8
> Isr: 6,7,8
> Topic: wordcountPartition: 1Leader: 7   Replicas: 8,7,1
> Isr: 7
> Topic: wordcountPartition: 2Leader: 8   Replicas: 1,8,2
> Isr: 8
> Topic: wordcountPartition: 3Leader: 3   Replicas: 2,1,3
> Isr: 3
> Topic: wordcountPartition: 4Leader: 3   Replicas: 3,2,4
> Isr: 3,2,4
> Topic: wordcountPartition: 5Leader: 3   Replicas: 4,3,5
> Isr: 3,5
> Topic: wordcountPartition: 6Leader: 6   Replicas: 5,4,6
> Isr: 6,5
> Topic: wordcountPartition: 7Leader: 6   Replicas: 6,5,7
> Isr: 6,5,7
>
> I wrote a simple producer to write to this topic. However when running I
> get these messages in the logs -
>
> 2014-10-13 19:02:32,459 INFO [main] kafka.client.ClientUtils$: Fetching
> metadata from broker id:0,host:tr-pan-hclstr-11.amers1b.ciscloud,port:9092
> with correlation id 0 for 1 topic(s) Set(wordcount)
> 2014-10-13 19:02:32,464 INFO [main] kafka.producer.SyncProducer: Connected
> to tr-pan-hclstr-11.amers1b.ciscloud:9092 for producing
> 2014-10-13 19:02:32,551 INFO [main] kafka.producer.SyncProducer:
> Disconnecting from tr-pan-hclstr-11.amers1b.ciscloud:9092
> 2014-10-13 19:02:32,611 WARN [main] kafka.producer.BrokerPartitionInfo:
> Error while fetching metadata   partition 4 leader: nonereplicas: 3
> (tr-pan-hclstr-13.amers1b.ciscloud:9092),2
> (tr-pan-hclstr-12.amers1b.ciscloud:9092),4
> (tr-pan-hclstr-14.amers1b.ciscloud:9092) isr:isUnderReplicated:
> true for topic partition [wordcount,4]: [class
> kafka.common.LeaderNotAvailableException]
> 2014-10-13 19:02:33,505 INFO [main] kafka.producer.SyncProducer: Connected
> to tr-pan-hclstr-15.amers1b.ciscloud:9092 for producing
> 2014-10-13 19:02:33,543 WARN [main]
> kafka.producer.async.DefaultEventHandler: Produce request with correlation
> id 20611 failed due to [wordcount,5]:
> kafka.common.NotLeaderForPartitionException,[wordcount,6]:
> kafka.common.NotLeaderForPartitionException,[wordcount,7]:
> kafka.common.NotLeaderForPartitionException
> 2014-10-13 19:02:33,694 INFO [main] kafka.producer.SyncProducer: Connected
> to tr-pan-hclstr-18.amers1b.ciscloud:9092 for producing
> 2014-10-13 19:02:33,725 WARN [main]
> kafka.producer.async.DefaultEventHandler: Produce request with correlation
> id 20612 failed due to [wordcount,0]:
> kafka.common.NotLeaderForPartitionException
> 2014-10-13 19:02:33,861 INFO [main] kafka.producer.SyncProducer: Connected
> to tr-pan-hclstr-11.amers1b.ciscloud:9092 for producing
> 2014-10-13 19:02:33,983 WARN [main]
> kafka.producer.async.DefaultEventHandler: Failed to send data since
> partitions [wordcount,4] don't have a leader
>
>
> Obviously something is terribly wrong... I am quite new to Kafka, hence
> these messages don't make any sense to me, except for the fact that it is
> telling me that some of the partitions don't have any leader.
>
> Could somebody be kind enough to explain the above message?
>
> A few more questions -
>
> (1) How does one get into this state?
> (2) How can I get out of this state?
> (3) I have set auto.leader.rebalance.enable=true on all brokers. Shouldn't
> the partitions be balanced across all the brokers?
> (4) I can see that the Kafka service are running on all 8 nodes. (I used
>  ps ax -o  "pid pgid args" and I can see under the kafka Java process).
> (5) Is there a way I can force a re-balance?
>
>
>
> Regards,
> Jacob
>
>
>
> --
> ~
>


Kafka - NotLeaderForPartitionException / LeaderNotAvailableException

2014-10-13 Thread Abraham Jacob
Hi All,

I have a 8 node Kafka cluster (broker.id - 1..8). On this cluster I have a
topic "wordcount", which was 8 partitions with a replication factor of 3.

So a describe of topic wordcount
# bin/kafka-topics.sh --describe --zookeeper
tr-pan-hclstr-08.amers1b.ciscloud:2181/kafka/kafka-clstr-01 --topic
wordcount


Topic:wordcount PartitionCount:8ReplicationFactor:3 Configs:
Topic: wordcountPartition: 0Leader: 6   Replicas: 7,6,8
Isr: 6,7,8
Topic: wordcountPartition: 1Leader: 7   Replicas: 8,7,1
Isr: 7
Topic: wordcountPartition: 2Leader: 8   Replicas: 1,8,2
Isr: 8
Topic: wordcountPartition: 3Leader: 3   Replicas: 2,1,3
Isr: 3
Topic: wordcountPartition: 4Leader: 3   Replicas: 3,2,4
Isr: 3,2,4
Topic: wordcountPartition: 5Leader: 3   Replicas: 4,3,5
Isr: 3,5
Topic: wordcountPartition: 6Leader: 6   Replicas: 5,4,6
Isr: 6,5
Topic: wordcountPartition: 7Leader: 6   Replicas: 6,5,7
Isr: 6,5,7

I wrote a simple producer to write to this topic. However when running I
get these messages in the logs -

2014-10-13 19:02:32,459 INFO [main] kafka.client.ClientUtils$: Fetching
metadata from broker id:0,host:tr-pan-hclstr-11.amers1b.ciscloud,port:9092
with correlation id 0 for 1 topic(s) Set(wordcount)
2014-10-13 19:02:32,464 INFO [main] kafka.producer.SyncProducer: Connected
to tr-pan-hclstr-11.amers1b.ciscloud:9092 for producing
2014-10-13 19:02:32,551 INFO [main] kafka.producer.SyncProducer:
Disconnecting from tr-pan-hclstr-11.amers1b.ciscloud:9092
2014-10-13 19:02:32,611 WARN [main] kafka.producer.BrokerPartitionInfo:
Error while fetching metadata   partition 4 leader: nonereplicas: 3
(tr-pan-hclstr-13.amers1b.ciscloud:9092),2
(tr-pan-hclstr-12.amers1b.ciscloud:9092),4
(tr-pan-hclstr-14.amers1b.ciscloud:9092) isr:isUnderReplicated:
true for topic partition [wordcount,4]: [class
kafka.common.LeaderNotAvailableException]
2014-10-13 19:02:33,505 INFO [main] kafka.producer.SyncProducer: Connected
to tr-pan-hclstr-15.amers1b.ciscloud:9092 for producing
2014-10-13 19:02:33,543 WARN [main]
kafka.producer.async.DefaultEventHandler: Produce request with correlation
id 20611 failed due to [wordcount,5]:
kafka.common.NotLeaderForPartitionException,[wordcount,6]:
kafka.common.NotLeaderForPartitionException,[wordcount,7]:
kafka.common.NotLeaderForPartitionException
2014-10-13 19:02:33,694 INFO [main] kafka.producer.SyncProducer: Connected
to tr-pan-hclstr-18.amers1b.ciscloud:9092 for producing
2014-10-13 19:02:33,725 WARN [main]
kafka.producer.async.DefaultEventHandler: Produce request with correlation
id 20612 failed due to [wordcount,0]:
kafka.common.NotLeaderForPartitionException
2014-10-13 19:02:33,861 INFO [main] kafka.producer.SyncProducer: Connected
to tr-pan-hclstr-11.amers1b.ciscloud:9092 for producing
2014-10-13 19:02:33,983 WARN [main]
kafka.producer.async.DefaultEventHandler: Failed to send data since
partitions [wordcount,4] don't have a leader


Obviously something is terribly wrong... I am quite new to Kafka, hence
these messages don't make any sense to me, except for the fact that it is
telling me that some of the partitions don't have any leader.

Could somebody be kind enough to explain the above message?

A few more questions -

(1) How does one get into this state?
(2) How can I get out of this state?
(3) I have set auto.leader.rebalance.enable=true on all brokers. Shouldn't
the partitions be balanced across all the brokers?
(4) I can see that the Kafka service are running on all 8 nodes. (I used
 ps ax -o  "pid pgid args" and I can see under the kafka Java process).
(5) Is there a way I can force a re-balance?



Regards,
Jacob



-- 
~


Re: NotLeaderForPartitionException after creating a topic

2013-11-01 Thread Neha Narkhede
The following exception seems suspicious and I think it is fixed in 0.8
HEAD. Do you mind trying 0.8 HEAD and see if this can still be reproduced ?

ava.util.
NoSuchElementException: key not found: /disk1/test/kafka-data
 at scala.collection.MapLike$class.default(MapLike.scala:223)
 at scala.collection.immutable.Map$Map1.default(Map.scala:93)
 at scala.collection.MapLike$class.apply(MapLike.scala:134)
 at scala.collection.immutable.Map$Map1.apply(Map.scala:93)
 at kafka.cluster.Partition.getOrCreateReplica(Partition.scala:83)
 at kafka.cluster.Partition$$anonfun$1.apply(Partition.scala:149)


NotLeaderForPartitionException after creating a topic

2013-11-01 Thread Henry Ma
Hello,

I am using 0.8.0-beta1, with two brokers. After
using bin/kafka-create-topic.sh created a topic named "ead_click",
under bin/kafka-list-topic.sh I can see the topic seems to be created
successfully:

[2013-11-01 15:49:35,388] INFO zookeeper state changed (SyncConnected)
(org.I0Itec.zkclient.ZkClient)
topic: ead_click partition: 0 leader: 1 replicas: 1 isr: 1
[2013-11-01 15:49:35,720] INFO Terminate ZkClient event thread.
(org.I0Itec.zkclient.ZkEventThread)

But the producer for the topic says can not find leader. Then I start a
consumer, an exception throwed:

[2013-11-01 15:50:32,807] INFO
[ConsumerFetcherThread-console-consumer-50963_qt103.corp.xxx.com-1383292232223-b874ebd0-0-1],
Starting  (kafka.consumer.ConsumerFetcherThread)
[2013-11-01 15:50:32,833] ERROR
[console-consumer-50963_qt103.corp.xxx.com-1383292232223-b874ebd0-leader-finder-thread],
Error due to  (kafka.consumer.ConsumerFetcherManager$LeaderFinderThread)
kafka.common.NotLeaderForPartitionException
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
 at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at java.lang.Class.newInstance0(Class.java:355)
 at java.lang.Class.newInstance(Class.java:308)
 at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:70)
 at
kafka.consumer.SimpleConsumer.earliestOrLatestOffset(SimpleConsumer.scala:147)
 at
kafka.consumer.ConsumerFetcherThread.handleOffsetOutOfRange(ConsumerFetcherThread.scala:60)
 at
kafka.server.AbstractFetcherThread.addPartition(AbstractFetcherThread.scala:180)
 at
kafka.server.AbstractFetcherManager.addFetcher(AbstractFetcherManager.scala:80)
 at
kafka.consumer.ConsumerFetcherManager$LeaderFinderThread$$anonfun$doWork$7.apply(ConsumerFetcherManager.scala:95)
 at
kafka.consumer.ConsumerFetcherManager$LeaderFinderThread$$anonfun$doWork$7.apply(ConsumerFetcherManager.scala:92)
 at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
 at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:80)
 at scala.collection.Iterator$class.foreach(Iterator.scala:631)
 at
scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:161)
 at
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:194)
 at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
 at scala.collection.mutable.HashMap.foreach(HashMap.scala:80)
 at
kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:92)
 at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)


In state-change.log of broker 0, this exception can be found:
[2013-11-01 15:25:04,608] TRACE Broker 1 received LeaderAndIsr request
correlationId 9 from controller 1 epoch 1 starting the become-leader
transition for partition [ead_click,0] (state.change.logger)
[2013-11-01 15:25:04,635] ERROR Error on broker 1 while processing
LeaderAndIsr request correlationId 9 received from controller 1 epoch 1 for
partition (ead_click,0) (state.change.logger)
java.util.NoSuchElementException: key not found: /disk1/test/kafka-data
 at scala.collection.MapLike$class.default(MapLike.scala:223)
 at scala.collection.immutable.Map$Map1.default(Map.scala:93)
 at scala.collection.MapLike$class.apply(MapLike.scala:134)
 at scala.collection.immutable.Map$Map1.apply(Map.scala:93)
 at kafka.cluster.Partition.getOrCreateReplica(Partition.scala:83)
 at kafka.cluster.Partition$$anonfun$1.apply(Partition.scala:149)
 at kafka.cluster.Partition$$anonfun$1.apply(Partition.scala:149)
 at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206)
 at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206)
 at
scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:61)
 at scala.collection.immutable.List.foreach(List.scala:45)
 at
scala.collection.TraversableLike$class.map(TraversableLike.scala:206)
 at scala.collection.immutable.List.map(List.scala:45)
 at kafka.cluster.Partition.makeLeader(Partition.scala:149)
 at
kafka.server.ReplicaManager.kafka$server$ReplicaManager$$makeLeader(ReplicaManager.scala:257)
 at
kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$3.apply(ReplicaManager.scala:221)
 at
kafka.server.ReplicaManager$$anonfun$becomeLeaderOrFollower$3.apply(ReplicaManager.scala:213)
 at scala.collection.immutable.Map$Map1.foreach(Map.scala:105)
 at
kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:213)
 at kafka.server.KafkaApis.handleLeaderAndIsrRequest(KafkaApis.scala:87)
 at kafka.server.KafkaApis.handle(KafkaApis.scala:70)
 at kafka.serv