[jira] [Comment Edited] (KAFKA-1029) Zookeeper leader election stuck in ephemeral node retry loop

2014-03-21 Thread Jason Rosenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13943920#comment-13943920
 ] 

Jason Rosenberg edited comment on KAFKA-1029 at 3/22/14 5:29 AM:
-

I'm seeing this happening now (we are running 0.8).  Should this have been 
fixed?

It appears to only happen every once in a while, but when it does, it seems 
persist and repeat indefinitely.  Sometimes restarting the consumer app can fix 
the problem, but not always.


was (Author: jbrosenb...@gmail.com):
I'm seeing this happening now (we are running 0.8).  Should this have been 
fixed?

> Zookeeper leader election stuck in ephemeral node retry loop
> 
>
> Key: KAFKA-1029
> URL: https://issues.apache.org/jira/browse/KAFKA-1029
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.8.0
>Reporter: Sam Meder
>Assignee: Sam Meder
>Priority: Blocker
> Fix For: 0.8.0
>
> Attachments: 
> 0002-KAFKA-1029-Use-brokerId-instead-of-leaderId-when-tri.patch
>
>
> We're seeing the following log statements (over and over):
> [2013-08-27 07:21:49,538] INFO conflict in /controller data: { "brokerid":3, 
> "timestamp":"1377587945206", "version":1 } stored data: { "brokerid":2, 
> "timestamp":"1377587460904", "version":1 } (kafka.utils.ZkUtils$)
> [2013-08-27 07:21:49,559] INFO I wrote this conflicted ephemeral node [{ 
> "brokerid":3, "timestamp":"1377587945206", "version":1 }] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> where the broker is essentially stuck in the loop that is trying to deal with 
> left-over ephemeral nodes. The code looks a bit racy to me. In particular:
> ZookeeperLeaderElector:
>   def elect: Boolean = {
> controllerContext.zkClient.subscribeDataChanges(electionPath, 
> leaderChangeListener)
> val timestamp = SystemTime.milliseconds.toString
> val electString = ...
> try {
>   
> createEphemeralPathExpectConflictHandleZKBug(controllerContext.zkClient, 
> electionPath, electString, leaderId,
> (controllerString : String, leaderId : Any) => 
> KafkaController.parseControllerId(controllerString) == 
> leaderId.asInstanceOf[Int],
> controllerContext.zkSessionTimeout)
> leaderChangeListener is registered before the create call (by the way, it 
> looks like a new registration will be added every elect call - shouldn't it 
> register in startup()?) so can update leaderId to the current leader before 
> the call to create. If that happens then we will continuously get node exists 
> exceptions and the checker function will always return true, i.e. we will 
> never get out of the while(true) loop.
> I think the right fix here is to pass brokerId instead of leaderId when 
> calling create, i.e.
> createEphemeralPathExpectConflictHandleZKBug(controllerContext.zkClient, 
> electionPath, electString, brokerId,
> (controllerString : String, leaderId : Any) => 
> KafkaController.parseControllerId(controllerString) == 
> leaderId.asInstanceOf[Int],
> controllerContext.zkSessionTimeout)
> The loop dealing with the ephemeral node bug is now only triggered for the 
> broker that owned the node previously, although I am still not 100% sure if 
> that is sufficient.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (KAFKA-1029) Zookeeper leader election stuck in ephemeral node retry loop

2014-03-23 Thread Jason Rosenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944775#comment-13944775
 ] 

Jason Rosenberg edited comment on KAFKA-1029 at 3/24/14 6:25 AM:
-

Perhaps this should be re-opened a separate ticket?

The issue seems to have started when we had a network outage.  Several 
high-level consumers could not communicate at all with zookeeper (or kafka) for 
several minutes.  When the network was restarted, these continual "I wrote this 
conlicted ephemeral node" log messages have been running steadily, e.g.:

2014-03-19 00:13:14,165  INFO [ZkClient-EventThread-51-myzkserver] 
utils.ZkUtils$ - conflict in 
/consumers/myapp/ids/myapp_myhost-1394905418548-e159fc25 data: { 
"pattern":"white_list", "subscription":{ "(^\\Qmy.event\\E(\\.[\\w-]+)*$)" : 1 
}, "timestamp":"1395187970147", "version":1 } stored data: { 
"pattern":"white_list", "subscription":{ "(^\\Qmy.event\\E(\\.[\\w-]+)*$)" : 1 
}, "timestamp":"1395187967170", "version":1 }
2014-03-19 00:13:14,166  INFO [ZkClient-EventThread-51-myzkserver] 
utils.ZkUtils$ - I wrote this conflicted ephemeral node [{ 
"pattern":"white_list", "subscription":{ "(^\\Qmy.event\\E(\\.[\\w-]+)*$)" : 1 
}, "timestamp":"1395187970147", "version":1 }] at 
/consumers/myapp/ids/myapp_awa60.sjc1b.square-1394905418548-e159fc25 a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry

These are happening continuously.  We have tried doing a rolling restart of our 
consumer apps, and even a rolling restart of zk.  We are using zk 3.3.6, in 
this case, but will try upgrading to 3.4.5 shortly.

It seems that these ephemeral node errors always happen right after a consumer 
rebalance occurs.


was (Author: jbrosenb...@gmail.com):
Perhaps this should be re-opened a separate ticket?

The issue seems to have started when we had a network outage.  Several 
high-level consumers could not communicate at all with zookeeper (or kafka) for 
several minutes.  When the network was restarted, these continual "I wrote this 
conlicted ephemeral node" log messages have been running steadily, e.g.:

2014-03-19 00:13:14,165  INFO [ZkClient-EventThread-51-myzkserver] 
utils.ZkUtils$ - conflict in 
/consumers/myapp/ids/myapp_myhost-1394905418548-e159fc25 data: { 
"pattern":"white_list", "subscription":{ "(^\\Qmy.event\\E(\\.[\\w-]+)*$)" : 1 
}, "timestamp":"1395187970147", "version":1 } stored data: { 
"pattern":"white_list", "subscription":{ "(^\\Qmy.event\\E(\\.[\\w-]+)*$)" : 1 
}, "timestamp":"1395187967170", "version":1 }
2014-03-19 00:13:14,166  INFO [ZkClient-EventThread-51-myzkserver] 
utils.ZkUtils$ - I wrote this conflicted ephemeral node [{ 
"pattern":"white_list", "subscription":{ "(^\\Qmy.event\\E(\\.[\\w-]+)*$)" : 1 
}, "timestamp":"1395187970147", "version":1 }] at 
/consumers/myapp/ids/myapp_awa60.sjc1b.square-1394905418548-e159fc25 a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry

These are happening continuously.  We have tried doing a rolling restart of our 
consumer apps, and even a rolling restart of zk.  We are using zk 3.3.6, in 
this case, but will try upgrading to 3.4.5 shortly.

It seems that there are repeated consumer rebalances, as well.

> Zookeeper leader election stuck in ephemeral node retry loop
> 
>
> Key: KAFKA-1029
> URL: https://issues.apache.org/jira/browse/KAFKA-1029
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.8.0
>Reporter: Sam Meder
>Assignee: Sam Meder
>Priority: Blocker
> Fix For: 0.8.0
>
> Attachments: 
> 0002-KAFKA-1029-Use-brokerId-instead-of-leaderId-when-tri.patch
>
>
> We're seeing the following log statements (over and over):
> [2013-08-27 07:21:49,538] INFO conflict in /controller data: { "brokerid":3, 
> "timestamp":"1377587945206", "version":1 } stored data: { "brokerid":2, 
> "timestamp":"1377587460904", "version":1 } (kafka.utils.ZkUtils$)
> [2013-08-27 07:21:49,559] INFO I wrote this conflicted ephemeral node [{ 
> "brokerid":3, "timestamp":"1377587945206", "version":1 }] at /controller a 
> while back in a different session, hence I will backoff for this node to be 
> deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
> where the broker is essentially stuck in the loop that is trying to deal with 
> left-over ephemeral nodes. The code looks a bit racy to me. In particular:
> ZookeeperLeaderElector:
>   def elect: Boolean = {
> controllerContext.zkClient.subscribeDataChanges(electionPath, 
> leaderChangeListener)
> val timestamp = SystemTime.milliseconds.toString
> val electString = ...
> try {
>   
> createEphemeralPathExpectConflictHand

[jira] [Comment Edited] (KAFKA-1029) Zookeeper leader election stuck in ephemeral node retry loop

2014-04-29 Thread Michael Noll (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984265#comment-13984265
 ] 

Michael Noll edited comment on KAFKA-1029 at 4/29/14 12:59 PM:
---

Interestingly I ran into the very same issue while doing basic validation of 
Kafka 0.8.1.1.  Note that unlike Jason Rosenberg's case I was only using 
code/tools/scripts that are shipped with Kafka (e.g. console producer and 
console consumer).

Here is an example log message, which was repeated indefinitely:

{code}
[2014-04-29 09:48:27,207] INFO conflict in /controller data: 
{"version":1,"brokerid":0,"timestamp":"1398764901156"} stored data: 
{"version":1,"brokerid":0,"timestamp":"1398764894941"} (kafka.utils.ZkUtils$)
[2014-04-29 09:48:27,218] INFO I wrote this conflicted ephemeral node 
[{"version":1,"brokerid":0,"timestamp":"1398764901156"}] at /controller a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry (kafka.utils.ZkUtils$)
{code}

*How to reproduce (unsolved)*

Unfortunately I cannot consistently reproduce the issue and, to be honest, I am 
still not sure what actually triggers the bug.  As such I can only summarize 
what I did before and around the time when this bug was triggered, which I 
could observe through errors in log files and through errors being displayed 
after running certain commands.  So yes, it's a bit like shooting in the dark.

Here's an overview of my test setup:

- I deployed Kafka 0.8.1.1 to one machine {{kafka1}} and ZooKeeper 3.4.5 to a 
second machine {{zookeeper1}}.
- I followed the Kafka 0.8.1 quick start guide to create a topic "test" with 1 
partition and a replication factor of 1.
- I sent test messages to the topic "test" via the console producer.
- I read test messages from the topic "test" via the console consumer.
- Apart from producing and consuming a handful of test messages I also ran some 
supposedly read-only admin commands such as "describing" the topic, and running 
the consumer offset checker tool.
- At "some" point, Kafka was caught in an indefinite loop complaining about 
"conflicted ephemeral node".


The following paragraphs list in more detail what I did before the error popped 
up.

Producer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# This command returned no topics at this point = worked as expected
$ bin/kafka-topics.sh --list --zookeeper zookeeper1

# I created a topic, worked as expected
$ bin/kafka-topics.sh --create --zookeeper zookeeper1 --replication-factor 1 
--partitions 1 --topic test

# I requested details of topic "test", worked as expected
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic test

# I started a console producer and manually send a handful of test messages, 
worked as expected (see consumer below)
$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
{code}

Consumer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# I started a console consumer, worked as expected (i.e. could read messages 
sent by producer, see above)
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning
{code}

Up to that point, everything worked.  But then the Kafka broker went the way of 
the dodo.  As I said I can't pinpoint the cause, and re-running the same 
commands on a fresh Kafka/ZooKeeper deployment (fresh VMs etc.) didn't 
consistently trigger the issue like I hoped.

Here's what I did after the commands above, and at some point I eventually did 
observe the original error described in this JIRA ticket.  Again, at the moment 
I cannot tell what actually triggered the bug.

Producer:

{code}
# Test-driving the consumer offset checker
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1
# At this point consumer "foo" was not expected to exist.
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1 --broker-info --group foo

#
# I then re-started the console producer (see below), now configured to use the 
group id "foo".
#
{code}

Consumer:

{code}
# I re-started the console consumer, now configured to use the group id "foo".
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning --group foo
{code}

At this point "describing" the topic gave the following info, indicating that 
there was a problem (e.g. no leader for partition):

{code}
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic testTopic:test  
PartitionCount:1ReplicationFactor:1 Configs:
Topic: test Partition: 0Leader: -1  Replicas: 0 Isr:
{code}

Log files such as {{state-change.log}} showed these error messages:

{code}
[2014-04-29 07:44:47,816] TRACE Broker 0 cached leader info 
(LeaderAndIsrInfo:(Leader:0,ISR:0,LeaderEpoch:0,ControllerEp
och:1),ReplicationFactor:1),AllReplicas:0) for partition [test,0] in response 
to UpdateM

[jira] [Comment Edited] (KAFKA-1029) Zookeeper leader election stuck in ephemeral node retry loop

2014-04-29 Thread Michael Noll (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984265#comment-13984265
 ] 

Michael Noll edited comment on KAFKA-1029 at 4/29/14 1:02 PM:
--

Interestingly I ran into the very same issue while doing basic validation of 
Kafka 0.8.1.1.  Note that unlike Jason Rosenberg's case I was only using 
code/tools/scripts that are shipped with Kafka (e.g. console producer and 
console consumer).

Here is an example log message, which was repeated indefinitely:

{code}
[2014-04-29 09:48:27,207] INFO conflict in /controller data: 
{"version":1,"brokerid":0,"timestamp":"1398764901156"} stored data: 
{"version":1,"brokerid":0,"timestamp":"1398764894941"} (kafka.utils.ZkUtils$)
[2014-04-29 09:48:27,218] INFO I wrote this conflicted ephemeral node 
[{"version":1,"brokerid":0,"timestamp":"1398764901156"}] at /controller a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry (kafka.utils.ZkUtils$)
{code}

*How to reproduce (unsolved)*

Unfortunately I cannot consistently reproduce the issue and, to be honest, I am 
still not sure what actually triggers the bug.  As such I can only summarize 
what I did before and around the time when this bug was triggered, which I 
could observe through errors in log files and through errors being displayed 
after running certain commands.  So yes, it's a bit like shooting in the dark.

Here's an overview of my test setup:

- I deployed Kafka 0.8.1.1 to one machine {{kafka1}} and ZooKeeper 3.4.5 to a 
second machine {{zookeeper1}}.
- I followed the Kafka 0.8.1 quick start guide to create a topic "test" with 1 
partition and a replication factor of 1.
- I sent test messages to the topic "test" via the console producer.
- I read test messages from the topic "test" via the console consumer.
- Apart from producing and consuming a handful of test messages I also ran some 
supposedly read-only admin commands such as "describing" the topic, and running 
the consumer offset checker tool.
- At "some" point, Kafka was caught in an indefinite loop complaining about 
"conflicted ephemeral node".


The following paragraphs list in more detail what I did before the error popped 
up.

Producer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# This command returned no topics at this point = worked as expected
$ bin/kafka-topics.sh --list --zookeeper zookeeper1

# I created a topic, worked as expected
$ bin/kafka-topics.sh --create --zookeeper zookeeper1 --replication-factor 1 
--partitions 1 --topic test

# I requested details of topic "test", worked as expected
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic test

# I started a console producer and manually send a handful of test messages, 
worked as expected (see consumer below)
$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
{code}

Consumer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# I started a console consumer, worked as expected (i.e. could read messages 
sent by producer, see above)
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning
{code}

Up to that point, everything worked.  But then the Kafka broker went the way of 
the dodo.  As I said I can't pinpoint the cause, and re-running the same 
commands on a fresh Kafka/ZooKeeper deployment (fresh VMs etc.) didn't 
consistently trigger the issue like I hoped.

Here's what I did after the commands above, and at some point I eventually did 
observe the original error described in this JIRA ticket.  Again, at the moment 
I cannot tell what actually triggered the bug.

Producer:

{code}
# Test-driving the consumer offset checker
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1
# At this point consumer "foo" was not expected to exist.
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1 --broker-info --group foo

#
# I then re-started the console producer (see below), now configured to use the 
group id "foo".
#
{code}

Consumer:

{code}
# I re-started the console consumer, now configured to use the group id "foo".
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning --group foo
{code}

At this point "describing" the topic gave the following info, indicating that 
there was a problem (e.g. no leader for partition):

{code}
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic testTopic:test  
PartitionCount:1ReplicationFactor:1 Configs:
Topic: test Partition: 0Leader: -1  Replicas: 0 Isr:
{code}

Log files such as {{state-change.log}} showed these error messages:

{code}
[2014-04-29 07:44:47,816] TRACE Broker 0 cached leader info 
(LeaderAndIsrInfo:(Leader:0,ISR:0,LeaderEpoch:0,ControllerEp
och:1),ReplicationFactor:1),AllReplicas:0) for partition [test,0] in response 
to UpdateMet

[jira] [Comment Edited] (KAFKA-1029) Zookeeper leader election stuck in ephemeral node retry loop

2014-04-29 Thread Michael Noll (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984265#comment-13984265
 ] 

Michael Noll edited comment on KAFKA-1029 at 4/29/14 1:12 PM:
--

Interestingly I ran into the very same issue while doing basic validation of 
Kafka 0.8.1.1.  Note that unlike Jason Rosenberg's case I was only using 
code/tools/scripts that are shipped with Kafka (e.g. console producer and 
console consumer).

Here is an example log message, which was repeated indefinitely:

{code}
[2014-04-29 09:48:27,207] INFO conflict in /controller data: 
{"version":1,"brokerid":0,"timestamp":"1398764901156"} stored data: 
{"version":1,"brokerid":0,"timestamp":"1398764894941"} (kafka.utils.ZkUtils$)
[2014-04-29 09:48:27,218] INFO I wrote this conflicted ephemeral node 
[{"version":1,"brokerid":0,"timestamp":"1398764901156"}] at /controller a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry (kafka.utils.ZkUtils$)
{code}

*How to reproduce (unsolved)*

Unfortunately I cannot consistently reproduce the issue and, to be honest, I am 
still not sure what actually triggers the bug.  As such I can only summarize 
what I did before and around the time when this bug was triggered, which I 
could observe through errors in log files and through errors being displayed 
after running certain commands.  So yes, it's a bit like shooting in the dark.

Here's an overview of my test setup:

- I deployed Kafka 0.8.1.1 to one machine {{kafka1}} and ZooKeeper 3.4.5 to a 
second machine {{zookeeper1}}.
- I followed the Kafka 0.8.1 quick start guide to create a topic "test" with 1 
partition and a replication factor of 1.
- I sent test messages to the topic "test" via the console producer.
- I read test messages from the topic "test" via the console consumer.
- Apart from producing and consuming a handful of test messages I also ran some 
supposedly read-only admin commands such as "describing" the topic, and running 
the consumer offset checker tool.
- At "some" point, Kafka was caught in an indefinite loop complaining about 
"conflicted ephemeral node".


The following paragraphs list in more detail what I did before the error popped 
up.

Producer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# This command returned no topics at this point = worked as expected
$ bin/kafka-topics.sh --list --zookeeper zookeeper1

# I created a topic, worked as expected
$ bin/kafka-topics.sh --create --zookeeper zookeeper1 --replication-factor 1 
--partitions 1 --topic test

# I requested details of topic "test", worked as expected
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic test

# I started a console producer and manually send a handful of test messages, 
worked as expected (see consumer below)
$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
{code}

Consumer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# I started a console consumer, worked as expected (i.e. could read messages 
sent by producer, see above)
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning
{code}

Up to that point, everything worked.  But then the Kafka broker went the way of 
the dodo.  As I said I can't pinpoint the cause, and re-running the same 
commands on a fresh Kafka/ZooKeeper deployment (fresh VMs etc.) didn't 
consistently trigger the issue like I hoped.

Here's what I did after the commands above, and at some point I eventually did 
observe the original error described in this JIRA ticket.  Again, at the moment 
I cannot tell what actually triggered the bug.

Producer:

{code}
# Test-driving the consumer offset checker
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1
# At this point consumer "foo" was not expected to exist.
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1 --broker-info --group foo

#
# I then re-started the console producer (see below), now configured to use the 
group id "foo".
#
{code}

Consumer:

{code}
# I re-started the console consumer, now configured to use the group id "foo".
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning --group foo
{code}

At this point "describing" the topic gave the following info, indicating that 
there was a problem (e.g. no leader for partition):

{code}
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic testTopic:test  
PartitionCount:1ReplicationFactor:1 Configs:
Topic: test Partition: 0Leader: -1  Replicas: 0 Isr:
{code}

Log files such as {{state-change.log}} showed these error messages:

{code}
[2014-04-29 07:44:47,816] TRACE Broker 0 cached leader info 
(LeaderAndIsrInfo:(Leader:0,ISR:0,LeaderEpoch:0,ControllerEp
och:1),ReplicationFactor:1),AllReplicas:0) for partition [test,0] in response 
to UpdateMet

[jira] [Comment Edited] (KAFKA-1029) Zookeeper leader election stuck in ephemeral node retry loop

2014-04-29 Thread Michael Noll (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984265#comment-13984265
 ] 

Michael Noll edited comment on KAFKA-1029 at 4/29/14 1:22 PM:
--

Interestingly I ran into a very similar issue while doing basic validation of 
Kafka 0.8.1.1.  Note that unlike Jason Rosenberg's case I was only using 
code/tools/scripts that are shipped with Kafka (e.g. console producer and 
console consumer).

Here is an example log message, which was repeated indefinitely:

{code}
[2014-04-29 09:48:27,207] INFO conflict in /controller data: 
{"version":1,"brokerid":0,"timestamp":"1398764901156"} stored data: 
{"version":1,"brokerid":0,"timestamp":"1398764894941"} (kafka.utils.ZkUtils$)
[2014-04-29 09:48:27,218] INFO I wrote this conflicted ephemeral node 
[{"version":1,"brokerid":0,"timestamp":"1398764901156"}] at /controller a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry (kafka.utils.ZkUtils$)
{code}

*How to reproduce (unsolved)*

Unfortunately I cannot consistently reproduce the issue and, to be honest, I am 
still not sure what actually triggers the bug.  As such I can only summarize 
what I did before and around the time when this bug was triggered, which I 
could observe through errors in log files and through errors being displayed 
after running certain commands.  So yes, it's a bit like shooting in the dark.

Here's an overview of my test setup:

- I deployed Kafka 0.8.1.1 to one machine {{kafka1}} and ZooKeeper 3.4.5 to a 
second machine {{zookeeper1}}.
- I followed the Kafka 0.8.1 quick start guide to create a topic "test" with 1 
partition and a replication factor of 1.
- I sent test messages to the topic "test" via the console producer.
- I read test messages from the topic "test" via the console consumer.
- Apart from producing and consuming a handful of test messages I also ran some 
supposedly read-only admin commands such as "describing" the topic, and running 
the consumer offset checker tool.
- At "some" point, Kafka was caught in an indefinite loop complaining about 
"conflicted ephemeral node".


The following paragraphs list in more detail what I did before the error popped 
up.

Producer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# This command returned no topics at this point = worked as expected
$ bin/kafka-topics.sh --list --zookeeper zookeeper1

# I created a topic, worked as expected
$ bin/kafka-topics.sh --create --zookeeper zookeeper1 --replication-factor 1 
--partitions 1 --topic test

# I requested details of topic "test", worked as expected
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic test

# I started a console producer and manually send a handful of test messages, 
worked as expected (see consumer below)
$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
{code}

Consumer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# I started a console consumer, worked as expected (i.e. could read messages 
sent by producer, see above)
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning
{code}

Up to that point, everything worked.  But then the Kafka broker went the way of 
the dodo.  As I said I can't pinpoint the cause, and re-running the same 
commands on a fresh Kafka/ZooKeeper deployment (fresh VMs etc.) didn't 
consistently trigger the issue like I hoped.

Here's what I did after the commands above, and at some point I eventually did 
observe the original error described in this JIRA ticket.  Again, at the moment 
I cannot tell what actually triggered the bug.

Producer:

{code}
# Test-driving the consumer offset checker
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1
# At this point consumer "foo" was not expected to exist.
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1 --broker-info --group foo

#
# I then re-started the console producer (see below), now configured to use the 
group id "foo".
#
{code}

Consumer:

{code}
# I re-started the console consumer, now configured to use the group id "foo".
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning --group foo
{code}

At this point "describing" the topic gave the following info, indicating that 
there was a problem (e.g. no leader for partition):

{code}
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic testTopic:test  
PartitionCount:1ReplicationFactor:1 Configs:
Topic: test Partition: 0Leader: -1  Replicas: 0 Isr:
{code}

Log files such as {{state-change.log}} showed these error messages:

{code}
[2014-04-29 07:44:47,816] TRACE Broker 0 cached leader info 
(LeaderAndIsrInfo:(Leader:0,ISR:0,LeaderEpoch:0,ControllerEp
och:1),ReplicationFactor:1),AllReplicas:0) for partition [test,0] in response 
to UpdateMe

[jira] [Comment Edited] (KAFKA-1029) Zookeeper leader election stuck in ephemeral node retry loop

2014-04-29 Thread Michael Noll (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984265#comment-13984265
 ] 

Michael Noll edited comment on KAFKA-1029 at 4/29/14 1:27 PM:
--

Interestingly I ran into a very similar issue while doing basic validation of 
Kafka 0.8.1.1.  Note that unlike Jason Rosenberg's case I was only using 
code/tools/scripts that are shipped with Kafka (e.g. console producer and 
console consumer).

Here is an example log message, which was repeated indefinitely:

{code}
[2014-04-29 09:48:27,207] INFO conflict in /controller data: 
{"version":1,"brokerid":0,"timestamp":"1398764901156"} stored data: 
{"version":1,"brokerid":0,"timestamp":"1398764894941"} (kafka.utils.ZkUtils$)
[2014-04-29 09:48:27,218] INFO I wrote this conflicted ephemeral node 
[{"version":1,"brokerid":0,"timestamp":"1398764901156"}] at /controller a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry (kafka.utils.ZkUtils$)
{code}

*How to reproduce (unsolved)*

Unfortunately I cannot consistently reproduce the issue and, to be honest, I am 
still not sure what actually triggers the bug.  As such I can only summarize 
what I did before and around the time when this bug was triggered, which I 
could observe through errors in log files and through errors being displayed 
after running certain commands.  So yes, it's a bit like shooting in the dark.

Here's an overview of my test setup:

- I deployed Kafka 0.8.1.1 to one machine {{kafka1}} and ZooKeeper 3.4.5 to a 
second machine {{zookeeper1}}.
- I followed the Kafka 0.8.1 quick start guide to create a topic "test" with 1 
partition and a replication factor of 1.
- I sent test messages to the topic "test" via the console producer.
- I read test messages from the topic "test" via the console consumer.
- Apart from producing and consuming a handful of test messages I also ran some 
supposedly read-only admin commands such as "describing" the topic, and running 
the consumer offset checker tool.
- At "some" point, Kafka was caught in an indefinite loop complaining about 
"conflicted ephemeral node".


The following paragraphs list in more detail what I did before the error popped 
up.

Producer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# This command returned no topics at this point = worked as expected
$ bin/kafka-topics.sh --list --zookeeper zookeeper1

# I created a topic, worked as expected
$ bin/kafka-topics.sh --create --zookeeper zookeeper1 --replication-factor 1 
--partitions 1 --topic test

# I requested details of topic "test", worked as expected
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic test

# I started a console producer and manually send a handful of test messages, 
worked as expected (see consumer below)
$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
{code}

Consumer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# I started a console consumer, worked as expected (i.e. could read messages 
sent by producer, see above)
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning
{code}

Up to that point, everything worked.  But then the Kafka broker went the way of 
the dodo.  As I said I can't pinpoint the cause, and re-running the same 
commands on a fresh Kafka/ZooKeeper deployment (fresh VMs etc.) didn't 
consistently trigger the issue like I hoped.

Here's what I did after the commands above, and at some point I eventually did 
observe the original error described in this JIRA ticket.  Again, at the moment 
I cannot tell what actually triggered the bug.

Producer:

{code}
# Test-driving the consumer offset checker
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1
# At this point consumer "foo" was not expected to exist.
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1 --broker-info --group foo

#
# I then re-started the console producer (see below), now configured to use the 
group id "foo".
#
{code}

Consumer:

{code}
# I re-started the console consumer, now configured to use the group id "foo".
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning --group foo
{code}

At this point "describing" the topic gave the following info, indicating that 
there was a problem (e.g. no leader for partition):

{code}
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic testTopic:test  
PartitionCount:1ReplicationFactor:1 Configs:
Topic: test Partition: 0Leader: -1  Replicas: 0 Isr:
{code}

Log files such as {{state-change.log}} showed these error messages:

{code}
[2014-04-29 07:44:47,816] TRACE Broker 0 cached leader info 
(LeaderAndIsrInfo:(Leader:0,ISR:0,LeaderEpoch:0,ControllerEp
och:1),ReplicationFactor:1),AllReplicas:0) for partition [test,0] in response 
to UpdateMe

[jira] [Comment Edited] (KAFKA-1029) Zookeeper leader election stuck in ephemeral node retry loop

2014-04-29 Thread Michael Noll (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984265#comment-13984265
 ] 

Michael Noll edited comment on KAFKA-1029 at 4/29/14 1:27 PM:
--

Interestingly I ran into a very similar issue while doing basic validation of 
Kafka 0.8.1.1.  Note that unlike Jason Rosenberg's case I was only using 
code/tools/scripts that are shipped with Kafka (e.g. console producer and 
console consumer).

Here is an example log message, which was repeated indefinitely:

{code}
[2014-04-29 09:48:27,207] INFO conflict in /controller data: 
{"version":1,"brokerid":0,"timestamp":"1398764901156"} stored data: 
{"version":1,"brokerid":0,"timestamp":"1398764894941"} (kafka.utils.ZkUtils$)
[2014-04-29 09:48:27,218] INFO I wrote this conflicted ephemeral node 
[{"version":1,"brokerid":0,"timestamp":"1398764901156"}] at /controller a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry (kafka.utils.ZkUtils$)
{code}

*How to reproduce (unsolved)*

Unfortunately I cannot consistently reproduce the issue and, to be honest, I am 
still not sure what actually triggers the bug.  As such I can only summarize 
what I did before and around the time when this bug was triggered, which I 
could observe through errors in log files and through errors being displayed 
after running certain commands.  So yes, it's a bit like shooting in the dark.

Here's an overview of my test setup:

- I deployed Kafka 0.8.1.1 to one machine {{kafka1}} and ZooKeeper 3.4.5 to a 
second machine {{zookeeper1}}.
- I followed the Kafka 0.8.1 quick start guide to create a topic "test" with 1 
partition and a replication factor of 1.
- I sent test messages to the topic "test" via the console producer.
- I read test messages from the topic "test" via the console consumer.
- Apart from producing and consuming a handful of test messages I also ran some 
supposedly read-only admin commands such as "describing" the topic, and running 
the consumer offset checker tool.
- At "some" point, Kafka was caught in an indefinite loop complaining about 
"conflicted ephemeral node".


The following paragraphs list in more detail what I did before the error popped 
up.

Producer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# This command returned no topics at this point = worked as expected
$ bin/kafka-topics.sh --list --zookeeper zookeeper1

# I created a topic, worked as expected
$ bin/kafka-topics.sh --create --zookeeper zookeeper1 --replication-factor 1 
--partitions 1 --topic test

# I requested details of topic "test", worked as expected
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic test

# I started a console producer and manually send a handful of test messages, 
worked as expected (see consumer below)
$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
{code}

Consumer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# I started a console consumer, worked as expected (i.e. could read messages 
sent by producer, see above)
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning
{code}

Up to that point, everything worked.  But then the Kafka broker went the way of 
the dodo.  As I said I can't pinpoint the cause, and re-running the same 
commands on a fresh Kafka/ZooKeeper deployment (fresh VMs etc.) didn't 
consistently trigger the issue like I hoped.

Here's what I did after the commands above, and at some point I eventually did 
observe the original error described in this JIRA ticket.  Again, at the moment 
I cannot tell what actually triggered the bug.

Producer:

{code}
# Test-driving the consumer offset checker
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1
# At this point consumer "foo" was not expected to exist.
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1 --broker-info --group foo

#
# I then re-started the console producer (see below), now configured to use the 
group id "foo".
#
{code}

Consumer:

{code}
# I re-started the console consumer, now configured to use the group id "foo".
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning --group foo
{code}

At this point "describing" the topic gave the following info, indicating that 
there was a problem (e.g. no leader for partition):

{code}
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic testTopic:test  
PartitionCount:1ReplicationFactor:1 Configs:
Topic: test Partition: 0Leader: -1  Replicas: 0 Isr:
{code}

Log files such as {{state-change.log}} showed these error messages:

{code}
[2014-04-29 07:44:47,816] TRACE Broker 0 cached leader info 
(LeaderAndIsrInfo:(Leader:0,ISR:0,LeaderEpoch:0,ControllerEp
och:1),ReplicationFactor:1),AllReplicas:0) for partition [test,0] in response 
to UpdateMe

[jira] [Comment Edited] (KAFKA-1029) Zookeeper leader election stuck in ephemeral node retry loop

2014-04-29 Thread Michael Noll (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984265#comment-13984265
 ] 

Michael Noll edited comment on KAFKA-1029 at 4/29/14 1:30 PM:
--

Interestingly I ran into a very similar issue while doing basic validation of 
Kafka 0.8.1.1.  Note that unlike Jason Rosenberg's case I was only using 
code/tools/scripts that are shipped with Kafka (e.g. console producer and 
console consumer).

Here is an example log message, which was repeated indefinitely:

{code}
[2014-04-29 09:48:27,207] INFO conflict in /controller data: 
{"version":1,"brokerid":0,"timestamp":"1398764901156"} stored data: 
{"version":1,"brokerid":0,"timestamp":"1398764894941"} (kafka.utils.ZkUtils$)
[2014-04-29 09:48:27,218] INFO I wrote this conflicted ephemeral node 
[{"version":1,"brokerid":0,"timestamp":"1398764901156"}] at /controller a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry (kafka.utils.ZkUtils$)
{code}

*How to reproduce (unsolved)*

Unfortunately I cannot consistently reproduce the issue and, to be honest, I am 
still not sure what actually triggers the bug.  As such I can only summarize 
what I did before and around the time when this bug was triggered, which I 
could observe through errors in log files and through errors being displayed 
after running certain commands.  So yes, it's a bit like shooting in the dark.

Here's an overview of my test setup:

- I deployed Kafka 0.8.1.1 to one machine {{kafka1}} and ZooKeeper 3.4.5 to a 
second machine {{zookeeper1}}.
- I followed the Kafka 0.8.1 quick start guide to create a topic "test" with 1 
partition and a replication factor of 1.
- I sent test messages to the topic "test" via the console producer.
- I read test messages from the topic "test" via the console consumer.
- Apart from producing and consuming a handful of test messages I also ran some 
supposedly read-only admin commands such as "describing" the topic, and running 
the consumer offset checker tool.
- At "some" point, Kafka was caught in an indefinite loop complaining about 
"conflict in {{/controller}} data".


The following paragraphs list in more detail what I did before the error popped 
up.

Producer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# This command returned no topics at this point = worked as expected
$ bin/kafka-topics.sh --list --zookeeper zookeeper1

# I created a topic, worked as expected
$ bin/kafka-topics.sh --create --zookeeper zookeeper1 --replication-factor 1 
--partitions 1 --topic test

# I requested details of topic "test", worked as expected
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic test

# I started a console producer and manually send a handful of test messages, 
worked as expected (see consumer below)
$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
{code}

Consumer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# I started a console consumer, worked as expected (i.e. could read messages 
sent by producer, see above)
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning
{code}

Up to that point, everything worked.  But then the Kafka broker went the way of 
the dodo.  As I said I can't pinpoint the cause, and re-running the same 
commands on a fresh Kafka/ZooKeeper deployment (fresh VMs etc.) didn't 
consistently trigger the issue like I hoped.

Here's what I did after the commands above, and at some point I eventually did 
observe the original error described in this JIRA ticket.  Again, at the moment 
I cannot tell what actually triggered the bug.

Producer:

{code}
# Test-driving the consumer offset checker
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1
# At this point consumer "foo" was not expected to exist.
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1 --broker-info --group foo

#
# I then re-started the console producer (see below), now configured to use the 
group id "foo".
#
{code}

Consumer:

{code}
# I re-started the console consumer, now configured to use the group id "foo".
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning --group foo
{code}

At this point "describing" the topic gave the following info, indicating that 
there was a problem (e.g. no leader for partition):

{code}
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic testTopic:test  
PartitionCount:1ReplicationFactor:1 Configs:
Topic: test Partition: 0Leader: -1  Replicas: 0 Isr:
{code}

Log files such as {{state-change.log}} showed these error messages:

{code}
[2014-04-29 07:44:47,816] TRACE Broker 0 cached leader info 
(LeaderAndIsrInfo:(Leader:0,ISR:0,LeaderEpoch:0,ControllerEp
och:1),ReplicationFactor:1),AllReplicas:0) for partition [test,0] in response 
to U

[jira] [Comment Edited] (KAFKA-1029) Zookeeper leader election stuck in ephemeral node retry loop

2014-04-29 Thread Michael Noll (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984265#comment-13984265
 ] 

Michael Noll edited comment on KAFKA-1029 at 4/29/14 1:35 PM:
--

Interestingly I ran into a very similar issue while doing basic validation of 
Kafka 0.8.1.1.  Note that unlike Jason Rosenberg's case I was only using 
code/tools/scripts that are shipped with Kafka (e.g. console producer and 
console consumer).

Here is an example log message, which was repeated indefinitely:

{code}
[2014-04-29 09:48:27,207] INFO conflict in /controller data: 
{"version":1,"brokerid":0,"timestamp":"1398764901156"} stored data: 
{"version":1,"brokerid":0,"timestamp":"1398764894941"} (kafka.utils.ZkUtils$)
[2014-04-29 09:48:27,218] INFO I wrote this conflicted ephemeral node 
[{"version":1,"brokerid":0,"timestamp":"1398764901156"}] at /controller a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry (kafka.utils.ZkUtils$)
{code}

*How to reproduce (unsolved)*

Unfortunately I cannot consistently reproduce the issue and, to be honest, I am 
still not sure what actually triggers the bug.  As such I can only summarize 
what I did before and around the time when this bug was triggered, which I 
could observe through errors in log files and through errors being displayed 
after running certain commands.  So yes, it's a bit like shooting in the dark.

Here's an overview of my test setup:

- I deployed Kafka 0.8.1.1 to one machine {{kafka1}} and ZooKeeper 3.4.5 to a 
second machine {{zookeeper1}}.
- I followed the Kafka 0.8.1 quick start guide to create a topic "test" with 1 
partition and a replication factor of 1.
- I sent test messages to the topic "test" via the console producer.
- I read test messages from the topic "test" via the console consumer.
- Apart from producing and consuming a handful of test messages I also ran some 
supposedly read-only admin commands such as "describing" the topic, and running 
the consumer offset checker tool.
- At "some" point, Kafka was caught in an indefinite loop complaining about 
"conflict in {{/controller}} data".


The following paragraphs list in more detail what I did before the error popped 
up.

Producer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# This command returned no topics at this point = worked as expected
$ bin/kafka-topics.sh --list --zookeeper zookeeper1

# I created a topic, worked as expected
$ bin/kafka-topics.sh --create --zookeeper zookeeper1 --replication-factor 1 
--partitions 1 --topic test

# I requested details of topic "test", worked as expected
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic test
Topic:test  PartitionCount:1ReplicationFactor:1 Configs:
Topic: test Partition: 0Leader: 0   Replicas: 0 Isr: 0

# I started a console producer and manually send a handful of test messages, 
worked as expected (see consumer below)
$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
{code}

Consumer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# I started a console consumer, worked as expected (i.e. could read messages 
sent by producer, see above)
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning
{code}

Up to that point, everything worked.  But then the Kafka broker went the way of 
the dodo.  As I said I can't pinpoint the cause, and re-running the same 
commands on a fresh Kafka/ZooKeeper deployment (fresh VMs etc.) didn't 
consistently trigger the issue like I hoped.

Here's what I did after the commands above, and at some point I eventually did 
observe the original error described in this JIRA ticket.  Again, at the moment 
I cannot tell what actually triggered the bug.

Producer:

{code}
# Test-driving the consumer offset checker
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1
# At this point consumer "foo" was not expected to exist.
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1 --broker-info --group foo

#
# I then re-started the console producer (see below), now configured to use the 
group id "foo".
#
{code}

Consumer:

{code}
# I re-started the console consumer, now configured to use the group id "foo".
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning --group foo
{code}

At this point "describing" the topic gave the following info, indicating that 
there was a problem (e.g. no leader for partition):

{code}
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic testTopic:test  
PartitionCount:1ReplicationFactor:1 Configs:
Topic: test Partition: 0Leader: -1  Replicas: 0 Isr:
{code}

Log files such as {{state-change.log}} showed these error messages:

{code}
[2014-04-29 07:44:47,816] TRACE Broker 0 cached leade

[jira] [Comment Edited] (KAFKA-1029) Zookeeper leader election stuck in ephemeral node retry loop

2014-04-29 Thread Michael Noll (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984265#comment-13984265
 ] 

Michael Noll edited comment on KAFKA-1029 at 4/29/14 1:35 PM:
--

Interestingly I ran into a very similar issue while doing basic validation of 
Kafka 0.8.1.1.  Note that unlike Jason Rosenberg's case I was only using 
code/tools/scripts that are shipped with Kafka (e.g. console producer and 
console consumer).

Here is an example log message, which was repeated indefinitely:

{code}
[2014-04-29 09:48:27,207] INFO conflict in /controller data: 
{"version":1,"brokerid":0,"timestamp":"1398764901156"} stored data: 
{"version":1,"brokerid":0,"timestamp":"1398764894941"} (kafka.utils.ZkUtils$)
[2014-04-29 09:48:27,218] INFO I wrote this conflicted ephemeral node 
[{"version":1,"brokerid":0,"timestamp":"1398764901156"}] at /controller a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry (kafka.utils.ZkUtils$)
{code}

*How to reproduce (unsolved)*

Unfortunately I cannot consistently reproduce the issue and, to be honest, I am 
still not sure what actually triggers the bug.  As such I can only summarize 
what I did before and around the time when this bug was triggered, which I 
could observe through errors in log files and through errors being displayed 
after running certain commands.  So yes, it's a bit like shooting in the dark.

Here's an overview of my test setup:

- I deployed Kafka 0.8.1.1 to one machine {{kafka1}} and ZooKeeper 3.4.5 to a 
second machine {{zookeeper1}}.
- I followed the Kafka 0.8.1 quick start guide to create a topic "test" with 1 
partition and a replication factor of 1.
- I sent test messages to the topic "test" via the console producer.
- I read test messages from the topic "test" via the console consumer.
- Apart from producing and consuming a handful of test messages I also ran some 
supposedly read-only admin commands such as "describing" the topic, and running 
the consumer offset checker tool.
- At "some" point, Kafka was caught in an indefinite loop complaining about 
"conflict in {{/controller}} data".


The following paragraphs list in more detail what I did before the error popped 
up.

Producer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# This command returned no topics at this point = worked as expected
$ bin/kafka-topics.sh --list --zookeeper zookeeper1

# I created a topic, worked as expected
$ bin/kafka-topics.sh --create --zookeeper zookeeper1 --replication-factor 1 
--partitions 1 --topic test

# I requested details of topic "test", worked as expected
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic test
Topic:test  PartitionCount:1ReplicationFactor:1 Configs:
Topic: test Partition: 0Leader: 0   Replicas: 0 Isr: 0

# I started a console producer and manually send a handful of test messages, 
worked as expected (see consumer below)
$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
{code}

Consumer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# I started a console consumer, worked as expected (i.e. could read messages 
sent by producer, see above)
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning
{code}

Up to that point, everything worked.  But then the Kafka broker went the way of 
the dodo.  As I said I can't pinpoint the cause, and re-running the same 
commands on a fresh Kafka/ZooKeeper deployment (fresh VMs etc.) didn't 
consistently trigger the issue like I hoped.

Here's what I did after the commands above, and at some point I eventually did 
observe the original error described in this JIRA ticket.  Again, at the moment 
I cannot tell what actually triggered the bug.

Producer:

{code}
# Test-driving the consumer offset checker
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1
# At this point consumer "foo" was not expected to exist.
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1 --broker-info --group foo

#
# I then re-started the console producer (see below), now configured to use the 
group id "foo".
#
{code}

Consumer:

{code}
# I re-started the console consumer, now configured to use the group id "foo".
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning --group foo
{code}

At this point "describing" the topic gave the following info, indicating that 
there was a problem (e.g. no leader for partition):

{code}
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic test
Topic:test  PartitionCount:1ReplicationFactor:1 Configs:
Topic: test Partition: 0Leader: -1  Replicas: 0 Isr:
{code}

Log files such as {{state-change.log}} showed these error messages:

{code}
[2014-04-29 07:44:47,816] TRACE Broker 0 cached l

[jira] [Comment Edited] (KAFKA-1029) Zookeeper leader election stuck in ephemeral node retry loop

2014-04-29 Thread Michael Noll (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984265#comment-13984265
 ] 

Michael Noll edited comment on KAFKA-1029 at 4/29/14 1:36 PM:
--

Interestingly I ran into a very similar issue while doing basic validation of 
Kafka 0.8.1.1.  Note that unlike Jason Rosenberg's case I was only using 
code/tools/scripts that are shipped with Kafka (e.g. console producer and 
console consumer).

Here is an example log message, which was repeated indefinitely:

{code}
[2014-04-29 09:48:27,207] INFO conflict in /controller data: 
{"version":1,"brokerid":0,"timestamp":"1398764901156"} stored data: 
{"version":1,"brokerid":0,"timestamp":"1398764894941"} (kafka.utils.ZkUtils$)
[2014-04-29 09:48:27,218] INFO I wrote this conflicted ephemeral node 
[{"version":1,"brokerid":0,"timestamp":"1398764901156"}] at /controller a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry (kafka.utils.ZkUtils$)
{code}

*How to reproduce (unsolved)*

Unfortunately I cannot consistently reproduce the issue and, to be honest, I am 
still not sure what actually triggers the bug.  As such I can only summarize 
what I did before and around the time when this bug was triggered, which I 
could observe through errors in log files and through errors being displayed 
after running certain commands.  So yes, it's a bit like shooting in the dark.

Here's an overview of my test setup:

- I deployed Kafka 0.8.1.1 to one machine {{kafka1}} and ZooKeeper 3.4.5 to a 
second machine {{zookeeper1}}.
- I followed the Kafka 0.8.1 quick start guide to create a topic "test" with 1 
partition and a replication factor of 1.
- I sent test messages to the topic "test" via the console producer.
- I read test messages from the topic "test" via the console consumer.
- Apart from producing and consuming a handful of test messages I also ran some 
supposedly read-only admin commands such as "describing" the topic, and running 
the consumer offset checker tool.
- At "some" point, Kafka was caught in an indefinite loop complaining about 
"conflict in {{/controller}} data".


The following paragraphs list in more detail what I did before the error popped 
up.

Producer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# This command returned no topics at this point = worked as expected
$ bin/kafka-topics.sh --list --zookeeper zookeeper1

# I created a topic, worked as expected
$ bin/kafka-topics.sh --create --zookeeper zookeeper1 --replication-factor 1 
--partitions 1 --topic test

# I requested details of topic "test", worked as expected
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic test
Topic:test  PartitionCount:1ReplicationFactor:1 Configs:
Topic: test Partition: 0Leader: 0   Replicas: 0 Isr: 0

# I started a console producer and manually send a handful of test messages, 
worked as expected (see consumer below)
$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
{code}

Consumer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# I started a console consumer, worked as expected (i.e. could read messages 
sent by producer, see above)
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning
{code}

Up to that point, everything worked.  But then the Kafka broker went the way of 
the dodo.  As I said I can't pinpoint the cause, and re-running the same 
commands on a fresh Kafka/ZooKeeper deployment (fresh VMs etc.) didn't 
consistently trigger the issue like I hoped.

Here's what I did after the commands above, and at some point I eventually did 
observe the original error described in this JIRA ticket.  Again, at the moment 
I cannot tell what actually triggered the bug.

Producer:

{code}
# Test-driving the consumer offset checker
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1
# At this point consumer "foo" was not expected to exist.
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1 --broker-info --group foo

#
# I then re-started the console producer (see below), now configured to use the 
group id "foo".
#
{code}

Consumer:

{code}
# I re-started the console consumer, now configured to use the group id "foo".
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning --group foo
{code}

At this point "describing" the topic gave the following info, indicating that 
there was a problem (e.g. no leader for partition, no ISR available):

{code}
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic test
Topic:test  PartitionCount:1ReplicationFactor:1 Configs:
Topic: test Partition: 0Leader: -1  Replicas: 0 Isr:
{code}

Log files such as {{state-change.log}} showed these error messages:

{code}
[2014-04-29 07:44:47,816] TRACE

[jira] [Comment Edited] (KAFKA-1029) Zookeeper leader election stuck in ephemeral node retry loop

2014-04-29 Thread Michael Noll (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984265#comment-13984265
 ] 

Michael Noll edited comment on KAFKA-1029 at 4/29/14 1:45 PM:
--

Interestingly I ran into a very similar issue while doing basic validation of 
Kafka 0.8.1.1.  Note that unlike Jason Rosenberg's case I was only using 
code/tools/scripts that are shipped with Kafka (e.g. console producer and 
console consumer).

Here is an example log message, which was repeated indefinitely:

{code}
[2014-04-29 09:48:27,207] INFO conflict in /controller data: 
{"version":1,"brokerid":0,"timestamp":"1398764901156"} stored data: 
{"version":1,"brokerid":0,"timestamp":"1398764894941"} (kafka.utils.ZkUtils$)
[2014-04-29 09:48:27,218] INFO I wrote this conflicted ephemeral node 
[{"version":1,"brokerid":0,"timestamp":"1398764901156"}] at /controller a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry (kafka.utils.ZkUtils$)
{code}

*How to reproduce (unsolved)*

Unfortunately I cannot consistently reproduce the issue and, to be honest, I am 
still not sure what actually triggers the bug.  As such I can only summarize 
what I did before and around the time when this bug was triggered, which I 
could observe through errors in log files and through errors being displayed 
after running certain commands.  So yes, it's a bit like shooting in the dark.

Here's an overview of my test setup:

- I deployed Kafka 0.8.1.1 to one machine {{kafka1}} and ZooKeeper 3.4.5 to a 
second machine {{zookeeper1}}.
- I followed the Kafka 0.8.1 quick start guide to create a topic "test" with 1 
partition and a replication factor of 1.
- I sent test messages to the topic "test" via the console producer.
- I read test messages from the topic "test" via the console consumer.
- Apart from producing and consuming a handful of test messages I also ran some 
supposedly read-only admin commands such as "describing" the topic, and running 
the consumer offset checker tool.
- At "some" point, Kafka was caught in an indefinite loop complaining about 
"conflict in {{/controller}} data".


The following paragraphs list in more detail what I did before the error popped 
up.

Producer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# This command returned no topics at this point = worked as expected
$ bin/kafka-topics.sh --list --zookeeper zookeeper1

# I created a topic, worked as expected
$ bin/kafka-topics.sh --create --zookeeper zookeeper1 --replication-factor 1 
--partitions 1 --topic test

# I requested details of topic "test", worked as expected
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic test
Topic:test  PartitionCount:1ReplicationFactor:1 Configs:
Topic: test Partition: 0Leader: 0   Replicas: 0 Isr: 0

# I started a console producer and manually send a handful of test messages, 
worked as expected (see consumer below)
$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
{code}

Consumer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# I started a console consumer, worked as expected (i.e. could read messages 
sent by producer, see above)
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning
{code}

Up to that point, everything worked.  But then the Kafka broker went the way of 
the dodo.  As I said I can't pinpoint the cause, and re-running the same 
commands on a fresh Kafka/ZooKeeper deployment (fresh VMs etc.) didn't 
consistently trigger the issue like I hoped.

Here's what I did after the commands above, and at some point I eventually did 
observe the original error described in this JIRA ticket.  Again, at the moment 
I cannot tell what actually triggered the bug.

Producer:

{code}
# Test-driving the consumer offset checker
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1
# At this point consumer "foo" was not expected to exist.
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1 --broker-info --group foo

#
# I then re-started the console producer (see below), now configured to use the 
group id "foo".
#
{code}

Consumer:

{code}
# I re-started the console consumer, now configured to use the group id "foo".
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning --group foo
{code}

At this point "describing" the topic gave the following info, indicating that 
there was a problem (e.g. no leader for partition, no ISR available):

{code}
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic test
Topic:test  PartitionCount:1ReplicationFactor:1 Configs:
Topic: test Partition: 0Leader: -1  Replicas: 0 Isr:
{code}

Log files such as {{state-change.log}} showed these error messages:

{code}


[2014-04-29 07:44:47,673] TRA

[jira] [Comment Edited] (KAFKA-1029) Zookeeper leader election stuck in ephemeral node retry loop

2014-04-29 Thread Michael Noll (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984265#comment-13984265
 ] 

Michael Noll edited comment on KAFKA-1029 at 4/29/14 1:45 PM:
--

Interestingly I ran into a very similar issue while doing basic validation of 
Kafka 0.8.1.1.  Note that unlike Jason Rosenberg's case I was only using 
code/tools/scripts that are shipped with Kafka (e.g. console producer and 
console consumer).

Here is an example log message, which was repeated indefinitely:

{code}
[2014-04-29 09:48:27,207] INFO conflict in /controller data: 
{"version":1,"brokerid":0,"timestamp":"1398764901156"} stored data: 
{"version":1,"brokerid":0,"timestamp":"1398764894941"} (kafka.utils.ZkUtils$)
[2014-04-29 09:48:27,218] INFO I wrote this conflicted ephemeral node 
[{"version":1,"brokerid":0,"timestamp":"1398764901156"}] at /controller a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry (kafka.utils.ZkUtils$)
{code}

*How to reproduce (unsolved)*

Unfortunately I cannot consistently reproduce the issue and, to be honest, I am 
still not sure what actually triggers the bug.  As such I can only summarize 
what I did before and around the time when this bug was triggered, which I 
could observe through errors in log files and through errors being displayed 
after running certain commands.  So yes, it's a bit like shooting in the dark.

Here's an overview of my test setup:

- I deployed Kafka 0.8.1.1 to one machine {{kafka1}} and ZooKeeper 3.4.5 to a 
second machine {{zookeeper1}}.
- I followed the Kafka 0.8.1 quick start guide to create a topic "test" with 1 
partition and a replication factor of 1.
- I sent test messages to the topic "test" via the console producer.
- I read test messages from the topic "test" via the console consumer.
- Apart from producing and consuming a handful of test messages I also ran some 
supposedly read-only admin commands such as "describing" the topic, and running 
the consumer offset checker tool.
- At "some" point, Kafka was caught in an indefinite loop complaining about 
"conflict in {{/controller}} data".


The following paragraphs list in more detail what I did before the error popped 
up.

Producer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# This command returned no topics at this point = worked as expected
$ bin/kafka-topics.sh --list --zookeeper zookeeper1

# I created a topic, worked as expected
$ bin/kafka-topics.sh --create --zookeeper zookeeper1 --replication-factor 1 
--partitions 1 --topic test

# I requested details of topic "test", worked as expected
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic test
Topic:test  PartitionCount:1ReplicationFactor:1 Configs:
Topic: test Partition: 0Leader: 0   Replicas: 0 Isr: 0

# I started a console producer and manually send a handful of test messages, 
worked as expected (see consumer below)
$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
{code}

Consumer:

{code}
$ sudo su - kafka
$ cd /opt/kafka

# I started a console consumer, worked as expected (i.e. could read messages 
sent by producer, see above)
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning
{code}

Up to that point, everything worked.  But then the Kafka broker went the way of 
the dodo.  As I said I can't pinpoint the cause, and re-running the same 
commands on a fresh Kafka/ZooKeeper deployment (fresh VMs etc.) didn't 
consistently trigger the issue like I hoped.

Here's what I did after the commands above, and at some point I eventually did 
observe the original error described in this JIRA ticket.  Again, at the moment 
I cannot tell what actually triggered the bug.

Producer:

{code}
# Test-driving the consumer offset checker
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1
# At this point consumer "foo" was not expected to exist.
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect 
zookeeper1 --broker-info --group foo

#
# I then re-started the console producer (see below), now configured to use the 
group id "foo".
#
{code}

Consumer:

{code}
# I re-started the console consumer, now configured to use the group id "foo".
$ bin/kafka-console-consumer.sh --zookeeper zookeeper1 --topic test 
--from-beginning --group foo
{code}

At this point "describing" the topic gave the following info, indicating that 
there was a problem (e.g. no leader for partition, no ISR available):

{code}
$ bin/kafka-topics.sh --zookeeper zookeeper1 --describe --topic test
Topic:test  PartitionCount:1ReplicationFactor:1 Configs:
Topic: test Partition: 0Leader: -1  Replicas: 0 Isr:
{code}

Log files such as {{state-change.log}} showed these error messages:

{code}


[2014-04-29 07:44:47,673] TRA

[jira] [Comment Edited] (KAFKA-1029) Zookeeper leader election stuck in ephemeral node retry loop

2014-04-30 Thread Michael Noll (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985617#comment-13985617
 ] 

Michael Noll edited comment on KAFKA-1029 at 4/30/14 3:27 PM:
--

I could finally reproduce the bug again, though still not consistently yet.

And I think I have an idea what the root cause might be:  I was test-driving 
Kafka in several VMs locally on a Mac MBP laptop, and the energy saver settings 
(or sth similar) might be responsible for triggering the problem when the 
laptop was idle.  (In the test run below I intentionally let the laptop idle 
for 2-3 hours after creating a "test" topic and sending/consuming a few test 
massages.)  The timestamp of when the problem was triggered (11:01) correlates 
with me unlocking the laptop (user screen).  What's strange is that the laptop 
is configured not to go to sleep when idle, the hard disk is not asked to sleep 
either (and even if you did enable that setting in Mac OS X it does not have an 
effect on SSD's, which is the only disk installed in the laptop I was using), 
etc.  -- only the display is allowed to turn off after 10 minutes.  So even 
though the laptop is configured to be "always on" there apparently is something 
that throws off Kafka/ZooKeeper.  Also, as I said in my earlier comment I still 
cannot reproduce the issue consistently, i.e. sometimes Kafka/ZK work correctly 
after idling/unlocking;  still I think the root cause has ultimately to do with 
idling on Mac OS X.

FWIW I still list further log messages and details just in case that 
information may be useful in the future.


@Jun Rao:  Yes, I did see ZK expiration in {{controller.log}}:

{code}
INFO [SessionExpirationListener on 0], ZK expired; shut down all controller 
components and try to re-elect 
(kafka.controller.KafkaController$SessionExpirationListener)
{code}

Below are some further details.  Things turn bad at 11:01.

*kafka1 (Kafka broker)*

{{server.log}}

{code}
[2014-04-30 07:18:56,481] INFO Completed load of log test-0 with log end offset 
0 (kafka.log.Log)
[2014-04-30 07:18:56,485] INFO Created log for partition [test,0] in 
/app/kafka/log with properties {segment.index.bytes -> 10485760, 
file.delete.delay.ms -> 6, segment.bytes -> 1073741824, flush.ms -> 
9223372036854775807, delete.retention.ms -> 8640, index.interval.bytes -> 
4096, retention.bytes -> -1, cleanup.policy -> delete, segment.ms -> 17280, 
max.message.bytes -> 112, flush.messages -> 9223372036854775807, 
min.cleanable.dirty.ratio -> 0.5, retention.ms -> 17280}. 
(kafka.log.LogManager)
[2014-04-30 07:18:56,486] WARN Partition [test,0] on broker 0: No checkpointed 
highwatermark is found for partition [test,0] (kafka.cluster.Partition)
[2014-04-30 07:19:32,637] INFO Closing socket connection to /127.0.1.1. 
(kafka.network.Processor)
[2014-04-30 07:19:37,328] INFO Closing socket connection to /127.0.0.1. 
(kafka.network.Processor)
[2014-04-30 07:19:56,356] INFO Closing socket connection to /127.0.1.1. 
(kafka.network.Processor)
[2014-04-30 07:20:52,090] ERROR Closing socket for /127.0.1.1 because of error 
(kafka.network.Processor)
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
at sun.nio.ch.IOUtil.read(IOUtil.java:224)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
at kafka.utils.Utils$.read(Utils.scala:375)
at 
kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
at kafka.network.Processor.read(SocketServer.scala:347)
at kafka.network.Processor.run(SocketServer.scala:245)
at java.lang.Thread.run(Thread.java:701)
[2014-04-30 07:21:02,805] INFO Closing socket connection to /127.0.1.1. 
(kafka.network.Processor)
[2014-04-30 07:21:05,803] INFO Closing socket connection to /127.0.0.1. 
(kafka.network.Processor)
[2014-04-30 11:01:57,138] INFO Closing socket connection to /127.0.1.1. 
(kafka.network.Processor)
[2014-04-30 11:01:59,561] INFO Closing socket connection to /127.0.1.1. 
(kafka.network.Processor)
[2014-04-30 11:01:59,648] INFO 0 successfully elected as leader 
(kafka.server.ZookeeperLeaderElector)
[2014-04-30 11:01:59,692] INFO conflict in /controller data: 
{"version":1,"brokerid":0,"timestamp":"1398855719690"} stored data: 
{"version":1,"brokerid":0,"timestamp":"1398855719646"} (kafka.utils.ZkUtils$)
[2014-04-30 11:01:59,699] INFO I wrote this conflicted ephemeral node 
[{"version":1,"brokerid":0,"timestamp":"1398855719690"}] at /controller a while 
back in a different session, hence I will backoff for this node to be deleted 
by Zookeeper and retry (kafka.utils.ZkUtils$)
[2014-04-30 11:02:05,704] INFO conflict in /controller data: 
{"version":1,"brokerid":

[jira] [Comment Edited] (KAFKA-1029) Zookeeper leader election stuck in ephemeral node retry loop

2014-04-30 Thread Michael Noll (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985617#comment-13985617
 ] 

Michael Noll edited comment on KAFKA-1029 at 4/30/14 3:32 PM:
--

I could finally reproduce the bug again, though still not consistently yet.

And I think I have an idea what the root cause might be:  I was test-driving 
Kafka in several VMs locally on a Mac MBP laptop, and the energy saver settings 
(or sth similar) might be responsible for triggering the problem when the 
laptop was idle.  (In the test run below I intentionally let the laptop idle 
for 2-3 hours after creating a "test" topic and sending/consuming a few test 
massages.)  The timestamp of when the problem was triggered (11:01) correlates 
with me unlocking the laptop (user screen).  What's strange is that the laptop 
is configured not to go to sleep when idle, the hard disk is not asked to sleep 
either (and even if you did enable that setting in Mac OS X it does not have an 
effect on SSD's, which is the only disk installed in the laptop I was using), 
etc.  -- only the display is allowed to turn off after 10 minutes.  So even 
though the laptop is configured to be "always on" there apparently is something 
that throws off Kafka/ZooKeeper.  Also, as I said in my earlier comment I still 
cannot reproduce the issue consistently, i.e. sometimes Kafka/ZK work correctly 
after idling/unlocking;  still I think the root cause has ultimately to do with 
idling on Mac OS X.

Lastly, I don't know what the expected failure handling of Kafka/ZK is in such 
a scenario.  From what I can read from the logs below my setup seems to have 
simulated what could happen during a network partitioning (Kafka broker could 
not talk to ZooKeeper for a long time, hence ZK session expired, then things 
turned bad).

FWIW I still list further log messages and details just in case that 
information may be useful in the future.


@Jun Rao:  Yes, I did see ZK expiration in {{controller.log}}:

{code}
INFO [SessionExpirationListener on 0], ZK expired; shut down all controller 
components and try to re-elect 
(kafka.controller.KafkaController$SessionExpirationListener)
{code}

Below are some further details.  Things turn bad at 11:01.

*kafka1 (Kafka broker)*

{{server.log}}

{code}
[2014-04-30 07:18:56,481] INFO Completed load of log test-0 with log end offset 
0 (kafka.log.Log)
[2014-04-30 07:18:56,485] INFO Created log for partition [test,0] in 
/app/kafka/log with properties {segment.index.bytes -> 10485760, 
file.delete.delay.ms -> 6, segment.bytes -> 1073741824, flush.ms -> 
9223372036854775807, delete.retention.ms -> 8640, index.interval.bytes -> 
4096, retention.bytes -> -1, cleanup.policy -> delete, segment.ms -> 17280, 
max.message.bytes -> 112, flush.messages -> 9223372036854775807, 
min.cleanable.dirty.ratio -> 0.5, retention.ms -> 17280}. 
(kafka.log.LogManager)
[2014-04-30 07:18:56,486] WARN Partition [test,0] on broker 0: No checkpointed 
highwatermark is found for partition [test,0] (kafka.cluster.Partition)
[2014-04-30 07:19:32,637] INFO Closing socket connection to /127.0.1.1. 
(kafka.network.Processor)
[2014-04-30 07:19:37,328] INFO Closing socket connection to /127.0.0.1. 
(kafka.network.Processor)
[2014-04-30 07:19:56,356] INFO Closing socket connection to /127.0.1.1. 
(kafka.network.Processor)
[2014-04-30 07:20:52,090] ERROR Closing socket for /127.0.1.1 because of error 
(kafka.network.Processor)
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
at sun.nio.ch.IOUtil.read(IOUtil.java:224)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
at kafka.utils.Utils$.read(Utils.scala:375)
at 
kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
at kafka.network.Processor.read(SocketServer.scala:347)
at kafka.network.Processor.run(SocketServer.scala:245)
at java.lang.Thread.run(Thread.java:701)
[2014-04-30 07:21:02,805] INFO Closing socket connection to /127.0.1.1. 
(kafka.network.Processor)
[2014-04-30 07:21:05,803] INFO Closing socket connection to /127.0.0.1. 
(kafka.network.Processor)
[2014-04-30 11:01:57,138] INFO Closing socket connection to /127.0.1.1. 
(kafka.network.Processor)
[2014-04-30 11:01:59,561] INFO Closing socket connection to /127.0.1.1. 
(kafka.network.Processor)
[2014-04-30 11:01:59,648] INFO 0 successfully elected as leader 
(kafka.server.ZookeeperLeaderElector)
[2014-04-30 11:01:59,692] INFO conflict in /controller data: 
{"version":1,"brokerid":0,"timestamp":"1398855719690"} stored data: 
{"version":1,"brokerid":0,"timestamp":"1398855719646"} (kafka.utils.ZkUtils$)
[2014-04-30 11:01:59,699] INFO I w

[jira] [Comment Edited] (KAFKA-1029) Zookeeper leader election stuck in ephemeral node retry loop

2014-04-30 Thread Michael Noll (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985617#comment-13985617
 ] 

Michael Noll edited comment on KAFKA-1029 at 4/30/14 3:33 PM:
--

I could finally reproduce the bug again, though still not consistently yet.

And I think I have an idea what the root cause might be:  I was test-driving 
Kafka in several VMs locally on a Mac MBP laptop, and the energy saver settings 
(or sth similar) might be responsible for triggering the problem when the 
laptop was idle.  (In the test run below I intentionally let the laptop idle 
for 2-3 hours after creating a "test" topic and sending/consuming a few test 
massages.)  The timestamp of when the problem was triggered (11:01) correlates 
with me unlocking the laptop (user screen).  What's strange is that the laptop 
is configured not to go to sleep when idle, the hard disk is not asked to sleep 
either (and even if you did enable that setting in Mac OS X it does not have an 
effect on SSD's, which is the only disk installed in the laptop I was using), 
etc.  -- only the display is allowed to turn off after 10 minutes.  So even 
though the laptop is configured to be "always on" there apparently is something 
that throws off Kafka/ZooKeeper.  Also, as I said in my earlier comment I still 
cannot reproduce the issue consistently, i.e. sometimes Kafka/ZK work correctly 
after idling/unlocking;  still I think the root cause has ultimately to do with 
idling on Mac OS X.

Lastly, I don't know what the expected failure handling of Kafka/ZK is in such 
a scenario.  From what I can read from the logs below my setup seems to have 
simulated what could happen during a network partitioning (Kafka broker could 
not talk to ZooKeeper for a long time, hence ZK session expired, then Kafka 
could talk again to ZK, but couldn't fully recover).

FWIW I still list further log messages and details just in case that 
information may be useful in the future.


@Jun Rao:  Yes, I did see ZK expiration in {{controller.log}}:

{code}
INFO [SessionExpirationListener on 0], ZK expired; shut down all controller 
components and try to re-elect 
(kafka.controller.KafkaController$SessionExpirationListener)
{code}

Below are some further details.  Things turn bad at 11:01.

*kafka1 (Kafka broker)*

{{server.log}}

{code}
[2014-04-30 07:18:56,481] INFO Completed load of log test-0 with log end offset 
0 (kafka.log.Log)
[2014-04-30 07:18:56,485] INFO Created log for partition [test,0] in 
/app/kafka/log with properties {segment.index.bytes -> 10485760, 
file.delete.delay.ms -> 6, segment.bytes -> 1073741824, flush.ms -> 
9223372036854775807, delete.retention.ms -> 8640, index.interval.bytes -> 
4096, retention.bytes -> -1, cleanup.policy -> delete, segment.ms -> 17280, 
max.message.bytes -> 112, flush.messages -> 9223372036854775807, 
min.cleanable.dirty.ratio -> 0.5, retention.ms -> 17280}. 
(kafka.log.LogManager)
[2014-04-30 07:18:56,486] WARN Partition [test,0] on broker 0: No checkpointed 
highwatermark is found for partition [test,0] (kafka.cluster.Partition)
[2014-04-30 07:19:32,637] INFO Closing socket connection to /127.0.1.1. 
(kafka.network.Processor)
[2014-04-30 07:19:37,328] INFO Closing socket connection to /127.0.0.1. 
(kafka.network.Processor)
[2014-04-30 07:19:56,356] INFO Closing socket connection to /127.0.1.1. 
(kafka.network.Processor)
[2014-04-30 07:20:52,090] ERROR Closing socket for /127.0.1.1 because of error 
(kafka.network.Processor)
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
at sun.nio.ch.IOUtil.read(IOUtil.java:224)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
at kafka.utils.Utils$.read(Utils.scala:375)
at 
kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
at kafka.network.Processor.read(SocketServer.scala:347)
at kafka.network.Processor.run(SocketServer.scala:245)
at java.lang.Thread.run(Thread.java:701)
[2014-04-30 07:21:02,805] INFO Closing socket connection to /127.0.1.1. 
(kafka.network.Processor)
[2014-04-30 07:21:05,803] INFO Closing socket connection to /127.0.0.1. 
(kafka.network.Processor)
[2014-04-30 11:01:57,138] INFO Closing socket connection to /127.0.1.1. 
(kafka.network.Processor)
[2014-04-30 11:01:59,561] INFO Closing socket connection to /127.0.1.1. 
(kafka.network.Processor)
[2014-04-30 11:01:59,648] INFO 0 successfully elected as leader 
(kafka.server.ZookeeperLeaderElector)
[2014-04-30 11:01:59,692] INFO conflict in /controller data: 
{"version":1,"brokerid":0,"timestamp":"1398855719690"} stored data: 
{"version":1,"brokerid":0,"timestamp":"1398855719646"} (kafka.utils.ZkUti