date:20130806

[jira] [Commented] (KAFKA-999) Controlled shutdown never succeeds until the broker is killed

2013-08-06 Thread Swapnil Ghike (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730439#comment-13730439
 ] 

Swapnil Ghike commented on KAFKA-999:
-

Since we need leader broker's host:port to create ReplicaFetcherThread, the 
easiest fix for this ticket's purpose seems to be to pass all leaders through 
LeaderAndIsrRequest:

val leaders = liveOrShuttingDownBrokers.filter(b = leaderIds.contains(b.id))
val leaderAndIsrRequest = new LeaderAndIsrRequest(partitionStateInfos, leaders, 
controllerId, controllerEpoch, correlationId, clientId)

Any suggestions on avoiding a wire protocol change?

 Controlled shutdown never succeeds until the broker is killed
 -

 Key: KAFKA-999
 URL: https://issues.apache.org/jira/browse/KAFKA-999
 Project: Kafka
  Issue Type: Bug
  Components: controller
Affects Versions: 0.8
Reporter: Neha Narkhede
Assignee: Neha Narkhede
Priority: Critical

 A race condition in the way leader and isr request is handled by the broker 
 and controlled shutdown can lead to a situation where controlled shutdown can 
 never succeed and the only way to bounce the broker is to kill it.
 The root cause is that broker uses a smart to avoid fetching from a leader 
 that is not alive according to the controller. This leads to the broker 
 aborting a become follower request. And in cases where replication factor is 
 2, the leader can never be transferred to a follower since it keeps rejecting 
 the become follower request and stays out of the ISR. This causes controlled 
 shutdown to fail forever
 One sequence of events that led to this bug is as follows -
 - Broker 2 is leader and controller
 - Broker 2 is bounced (uncontrolled shutdown)
 - Controller fails over
 - Controlled shutdown is invoked on broker 1
 - Controller starts leader election for partitions that broker 2 led
 - Controller sends become follower request with leader as broker 1 to broker 
 2. At the same time, it does not include broker 1 in alive broker list sent 
 as part of leader and isr request
 - Broker 2 rejects leaderAndIsr request since leader is not in the list of 
 alive brokers
 - Broker 1 fails to transfer leadership to broker 2 since broker 2 is not in 
 ISR
 - Controlled shutdown can never succeed on broker 1
 Since controlled shutdown is a config option, if there are bugs in controlled 
 shutdown, there is no option but to kill the broker

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Fixes for 0.8 in trunk, absent from 0.8 branch

2013-08-06 Thread Cosmin Lehene

Hey guys,

There are a few fixes which are marked for 0.8 and fixed in JIRA but have 
only been pushed to trunk and missing from 0.8

https://issues.apache.org/jira/browse/KAFKA-925
https://issues.apache.org/jira/browse/KAFKA-852
https://issues.apache.org/jira/browse/KAFKA-995
https://issues.apache.org/jira/browse/KAFKA-985 (also affectsVersion is 0.9 and 
fixVersion is 0.8.1)
https://issues.apache.org/jira/browse/KAFKA-615

Were this mislabeled in JIRA or should they be present on the 0.8 branch.

At the same time there seem to be issues on the 0.8 branch which are absent 
from trunk (e.g. KAFKA-989).

Cosmin

[jira] [Commented] (KAFKA-347) change number of partitions of a topic online

2013-08-06 Thread Cosmin Lehene (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730616#comment-13730616
 ] 

Cosmin Lehene commented on KAFKA-347:
-

Is there some operational documentation on how to use this?

 change number of partitions of a topic online
 -

 Key: KAFKA-347
 URL: https://issues.apache.org/jira/browse/KAFKA-347
 Project: Kafka
  Issue Type: Improvement
  Components: core
Affects Versions: 0.8
Reporter: Jun Rao
Assignee: Sriram Subramanian
  Labels: features
 Fix For: 0.8.1

 Attachments: kafka-347.patch, kafka-347-v2.patch, 
 KAFKA-347-v2-rebased.patch, KAFKA-347-v3.patch, KAFKA-347-v4.patch, 
 KAFKA-347-v5.patch


 We will need an admin tool to change the number of partitions of a topic 
 online.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-989) Race condition shutting down high-level consumer results in spinning background thread

2013-08-06 Thread Phil Hargett (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730787#comment-13730787
 ] 

Phil Hargett commented on KAFKA-989:


Thank you!

 Race condition shutting down high-level consumer results in spinning 
 background thread
 --

 Key: KAFKA-989
 URL: https://issues.apache.org/jira/browse/KAFKA-989
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.8
 Environment: Ubuntu Linux x64
Reporter: Phil Hargett
Assignee: Phil Hargett
 Fix For: 0.8

 Attachments: KAFKA-989-failed-to-find-leader.patch, 
 KAFKA-989-failed-to-find-leader-patch2.patch, 
 KAFKA-989-failed-to-find-leader-patch3.patch


 Running an application that uses the Kafka client under load, can often hit 
 this issue within a few hours.
 High-level consumers come and go over this application's lifecycle, but there 
 are a variety of defenses that ensure each high-level consumer lasts several 
 seconds before being shutdown.  Nevertheless, some race is causing this 
 background thread to continue long after the ZKClient it is using has been 
 disconnected.  Since the thread was spawned by a consumer that has already 
 been shutdown, the application has no way to find this thread and stop it.
 Reported on the users-kafka mailing list 6/25 as 0.8 throwing exception 
 'Failed to find leader' and high-level consumer fails to make progress. 
 The only remedy is to shutdown the application and restart it.  Externally 
 detecting that this state has occurred is not pleasant: need to grep log for 
 repeated occurrences of the same exception.
 Stack trace:
 Failed to find leader for Set([topic6,0]): java.lang.NullPointerException
   at org.I0Itec.zkclient.ZkClient$2.call(ZkClient.java:416)
   at org.I0Itec.zkclient.ZkClient$2.call(ZkClient.java:413)
   at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
   at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:413)
   at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:409)
   at kafka.utils.ZkUtils$.getChildrenParentMayNotExist(ZkUtils.scala:438)
   at kafka.utils.ZkUtils$.getAllBrokersInCluster(ZkUtils.scala:75)
   at 
 kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:63)
   at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-992) Double Check on Broker Registration to Avoid False NodeExist Exception

2013-08-06 Thread Jun Rao (JIRA)

[
https://issues.apache.org/jira/browse/KAFKA-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730916#comment-13730916
]

Jun Rao commented on KAFKA-992:
---

Thinking about this more. The same ZK issue can affect the controller and the
consumer re-registration too. Should those be handled too?

Double Check on Broker Registration to Avoid False NodeExist Exception
--

Key: KAFKA-992
URL: https://issues.apache.org/jira/browse/KAFKA-992
Project: Kafka
Issue Type: Bug
Reporter: Neha Narkhede
Assignee: Guozhang Wang
Attachments: KAFKA-992.v1.patch, KAFKA-992.v2.patch,
KAFKA-992.v3.patch, KAFKA-992.v4.patch

The current behavior of zookeeper for ephemeral nodes is that session
expiration and ephemeral node deletion is not an atomic operation.
The side-effect of the above zookeeper behavior in Kafka, for certain corner
cases, is that ephemeral nodes can be lost even if the session is not
expired. The sequence of events that can lead to lossy ephemeral nodes is as
follows -
1. The session expires on the client, it assumes the ephemeral nodes are
deleted, so it establishes a new session with zookeeper and tries to
re-create the ephemeral nodes.
2. However, when it tries to re-create the ephemeral node,zookeeper throws
back a NodeExists error code. Now this is legitimate during a session
disconnect event (since zkclient automatically retries the
operation and raises a NodeExists error). Also by design, Kafka server
doesn't have multiple zookeeper clients create the same ephemeral node, so
Kafka server assumes the NodeExists is normal.
3. However, after a few seconds zookeeper deletes that ephemeral node. So
from the client's perspective, even though the client has a new valid
session, its ephemeral node is gone.
This behavior is triggered due to very long fsync operations on the zookeeper
leader. When the leader wakes up from such a long fsync operation, it has
several sessions to expire. And the time between the session expiration and
the ephemeral node deletion is magnified. Between these 2 operations, a
zookeeper client can issue a ephemeral node creation operation, that could've
appeared to have succeeded, but the leader later deletes the ephemeral node
leading to permanent ephemeral node loss from the client's perspective.
Thread from zookeeper mailing list:
http://zookeeper.markmail.org/search/?q=Zookeeper+3.3.4#query:Zookeeper%203.3.4%20date%3A201307%20+page:1+mid:zma242a2qgp6gxvx+state:results

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: Fixes for 0.8 in trunk, absent from 0.8 branch

2013-08-06 Thread Jay Kreps

Yes, what is perhaps confusing is that 0.8 is essentially the release
branch only fixes required for 0.8 final are going there. All other
development is on trunk which will be kafka.next, which we are calling
0.8.1. Hope that helps.

-Jay


On Tue, Aug 6, 2013 at 9:25 AM, Jun Rao jun...@gmail.com wrote:

 Actually, all of them are marked as fixed in 0.8.1.

 Thanks,

 Jun


 On Tue, Aug 6, 2013 at 3:43 AM, Cosmin Lehene cleh...@adobe.com wrote:

  Hey guys,
 
  There are a few fixes which are marked for 0.8 and fixed in JIRA but
  have only been pushed to trunk and missing from 0.8
 
  https://issues.apache.org/jira/browse/KAFKA-925
  https://issues.apache.org/jira/browse/KAFKA-852
  https://issues.apache.org/jira/browse/KAFKA-995
  https://issues.apache.org/jira/browse/KAFKA-985 (also affectsVersion is
  0.9 and fixVersion is 0.8.1)
  https://issues.apache.org/jira/browse/KAFKA-615
 
  Were this mislabeled in JIRA or should they be present on the 0.8 branch.
 
  At the same time there seem to be issues on the 0.8 branch which are
  absent from trunk (e.g. KAFKA-989).
 
  Cosmin

[jira] [Commented] (KAFKA-999) Controlled shutdown never succeeds until the broker is killed

2013-08-06 Thread Swapnil Ghike (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13731174#comment-13731174
 ] 

Swapnil Ghike commented on KAFKA-999:
-

Actually that's not needed, will get a patch out in a couple hours.

 Controlled shutdown never succeeds until the broker is killed
 -

 Key: KAFKA-999
 URL: https://issues.apache.org/jira/browse/KAFKA-999
 Project: Kafka
  Issue Type: Bug
  Components: controller
Affects Versions: 0.8
Reporter: Neha Narkhede
Assignee: Neha Narkhede
Priority: Critical

 A race condition in the way leader and isr request is handled by the broker 
 and controlled shutdown can lead to a situation where controlled shutdown can 
 never succeed and the only way to bounce the broker is to kill it.
 The root cause is that broker uses a smart to avoid fetching from a leader 
 that is not alive according to the controller. This leads to the broker 
 aborting a become follower request. And in cases where replication factor is 
 2, the leader can never be transferred to a follower since it keeps rejecting 
 the become follower request and stays out of the ISR. This causes controlled 
 shutdown to fail forever
 One sequence of events that led to this bug is as follows -
 - Broker 2 is leader and controller
 - Broker 2 is bounced (uncontrolled shutdown)
 - Controller fails over
 - Controlled shutdown is invoked on broker 1
 - Controller starts leader election for partitions that broker 2 led
 - Controller sends become follower request with leader as broker 1 to broker 
 2. At the same time, it does not include broker 1 in alive broker list sent 
 as part of leader and isr request
 - Broker 2 rejects leaderAndIsr request since leader is not in the list of 
 alive brokers
 - Broker 1 fails to transfer leadership to broker 2 since broker 2 is not in 
 ISR
 - Controlled shutdown can never succeed on broker 1
 Since controlled shutdown is a config option, if there are bugs in controlled 
 shutdown, there is no option but to kill the broker

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Deleted] (KAFKA-999) Controlled shutdown never succeeds until the broker is killed

2013-08-06 Thread Swapnil Ghike (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Swapnil Ghike updated KAFKA-999:


Comment: was deleted

(was: Actually that's not needed, will get a patch out in a couple hours.)

 Controlled shutdown never succeeds until the broker is killed
 -

 Key: KAFKA-999
 URL: https://issues.apache.org/jira/browse/KAFKA-999
 Project: Kafka
  Issue Type: Bug
  Components: controller
Affects Versions: 0.8
Reporter: Neha Narkhede
Assignee: Neha Narkhede
Priority: Critical

 A race condition in the way leader and isr request is handled by the broker 
 and controlled shutdown can lead to a situation where controlled shutdown can 
 never succeed and the only way to bounce the broker is to kill it.
 The root cause is that broker uses a smart to avoid fetching from a leader 
 that is not alive according to the controller. This leads to the broker 
 aborting a become follower request. And in cases where replication factor is 
 2, the leader can never be transferred to a follower since it keeps rejecting 
 the become follower request and stays out of the ISR. This causes controlled 
 shutdown to fail forever
 One sequence of events that led to this bug is as follows -
 - Broker 2 is leader and controller
 - Broker 2 is bounced (uncontrolled shutdown)
 - Controller fails over
 - Controlled shutdown is invoked on broker 1
 - Controller starts leader election for partitions that broker 2 led
 - Controller sends become follower request with leader as broker 1 to broker 
 2. At the same time, it does not include broker 1 in alive broker list sent 
 as part of leader and isr request
 - Broker 2 rejects leaderAndIsr request since leader is not in the list of 
 alive brokers
 - Broker 1 fails to transfer leadership to broker 2 since broker 2 is not in 
 ISR
 - Controlled shutdown can never succeed on broker 1
 Since controlled shutdown is a config option, if there are bugs in controlled 
 shutdown, there is no option but to kill the broker

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-992) Double Check on Broker Registration to Avoid False NodeExist Exception

2013-08-06 Thread Joel Koshy (JIRA)

[
https://issues.apache.org/jira/browse/KAFKA-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13731187#comment-13731187
]

Joel Koshy commented on KAFKA-992:
--

Delayed review - looks good to me, although I still don't see a benefit in

storing the timestamp. i.e., the approach to retry on nodeexists if the host

and port are the same would remain the same. i.e., it seems more for

informative purposes. Let me know if I'm missing something.

@Jun, you have a point about the controller. It seems it may not be a

problem there since controller re-election will happen only after the data

is actually deleted. For consumers it may not be an issue either given that

the consumer id string includes a random uuid.

Double Check on Broker Registration to Avoid False NodeExist Exception
--

[jira] [Updated] (KAFKA-990) Fix ReassignPartitionCommand and improve usability

2013-08-06 Thread Sriram Subramanian (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriram Subramanian updated KAFKA-990:
-

Attachment: KAFKA-990-v1.patch

 Fix ReassignPartitionCommand and improve usability
 --

 Key: KAFKA-990
 URL: https://issues.apache.org/jira/browse/KAFKA-990
 Project: Kafka
  Issue Type: Bug
Reporter: Sriram Subramanian
Assignee: Sriram Subramanian
 Attachments: KAFKA-990-v1.patch


 1. The tool does not register for IsrChangeListener on controller failover.
 2. There is a race condition where the previous listener can fire on 
 controller failover and the replicas can be in ISR. Even after re-registering 
 the ISR listener after failover, it will never be triggered.
 3. The input the tool is a static list which is very hard to use. To improve 
 this, as a first step the tool needs to take a list of topics and list of 
 brokers to do the assignment to and then generate the reassignment plan.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-992) Double Check on Broker Registration to Avoid False NodeExist Exception

2013-08-06 Thread Joel Koshy (JIRA)

[
https://issues.apache.org/jira/browse/KAFKA-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13731282#comment-13731282
]

Joel Koshy commented on KAFKA-992:
--

ok nm the comment about timestamp. I had forgotten that nodeexists wouldn't be
thrown if the data is the same.

Double Check on Broker Registration to Avoid False NodeExist Exception
--

[jira] [Updated] (KAFKA-999) Controlled shutdown never succeeds until the broker is killed

2013-08-06 Thread Swapnil Ghike (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Swapnil Ghike updated KAFKA-999:


Attachment: kafka-999-v1.patch

 Controlled shutdown never succeeds until the broker is killed
 -

 Key: KAFKA-999
 URL: https://issues.apache.org/jira/browse/KAFKA-999
 Project: Kafka
  Issue Type: Bug
  Components: controller
Affects Versions: 0.8
Reporter: Neha Narkhede
Assignee: Swapnil Ghike
Priority: Critical
 Attachments: kafka-999-v1.patch


 A race condition in the way leader and isr request is handled by the broker 
 and controlled shutdown can lead to a situation where controlled shutdown can 
 never succeed and the only way to bounce the broker is to kill it.
 The root cause is that broker uses a smart to avoid fetching from a leader 
 that is not alive according to the controller. This leads to the broker 
 aborting a become follower request. And in cases where replication factor is 
 2, the leader can never be transferred to a follower since it keeps rejecting 
 the become follower request and stays out of the ISR. This causes controlled 
 shutdown to fail forever
 One sequence of events that led to this bug is as follows -
 - Broker 2 is leader and controller
 - Broker 2 is bounced (uncontrolled shutdown)
 - Controller fails over
 - Controlled shutdown is invoked on broker 1
 - Controller starts leader election for partitions that broker 2 led
 - Controller sends become follower request with leader as broker 1 to broker 
 2. At the same time, it does not include broker 1 in alive broker list sent 
 as part of leader and isr request
 - Broker 2 rejects leaderAndIsr request since leader is not in the list of 
 alive brokers
 - Broker 1 fails to transfer leadership to broker 2 since broker 2 is not in 
 ISR
 - Controlled shutdown can never succeed on broker 1
 Since controlled shutdown is a config option, if there are bugs in controlled 
 shutdown, there is no option but to kill the broker

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-992) Double Check on Broker Registration to Avoid False NodeExist Exception

2013-08-06 Thread Joel Koshy (JIRA)

[
https://issues.apache.org/jira/browse/KAFKA-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13731337#comment-13731337
]

Joel Koshy commented on KAFKA-992:
--

and nm for my comments about controller/consumers as well. For consumers, we
don't regenerate the consumer id string.

For controller, what can end up happening is:
- controller session expires and becomes the controller again (with the
stale ephemeral node)
- another broker (whose session may not have expired) receives a watch when the
stale ephemeral node is actually deleted
- so we can end up with two controllers in this scenario.

Double Check on Broker Registration to Avoid False NodeExist Exception
--

[jira] [Commented] (KAFKA-992) Double Check on Broker Registration to Avoid False NodeExist Exception

2013-08-06 Thread Neha Narkhede (JIRA)

[
https://issues.apache.org/jira/browse/KAFKA-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13731352#comment-13731352
]

Neha Narkhede commented on KAFKA-992:
-

We just found a way to reliably reproduce the zookeeper bug and verify that
KAFKA-992 fix works. Now, we can fix the controller and consumer the same way.

Double Check on Broker Registration to Avoid False NodeExist Exception
--

[jira] [Commented] (KAFKA-992) Double Check on Broker Registration to Avoid False NodeExist Exception

2013-08-06 Thread Guozhang Wang (JIRA)

[
https://issues.apache.org/jira/browse/KAFKA-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13731353#comment-13731353
]

Guozhang Wang commented on KAFKA-992:
-

The zookeeper bug can be reproduced as follows:

1. Checkout a clean 0.8 branch, revert back the KAFKA-992 fix:

2. Build and create a server connecting to a Zookeeper instance (make sure
maxClientCnxns = 0 in ZK config so that one IP address can create as many
connections as wanted)

3. Load the Zookeeper with dummy sessions, each creates and maintains a
thousand ephemeral nodes.

4. Write a script that pause and resume the Zookeeper process continuously, for
example:

---
while true
do
kill -STOP $1
sleep 8
kill -CONT $1
sleep 60
done
---

5. Then when the Zookeeper process resumes, it will mark all sessions as
timeout, but since the ephemeral nodes to delete are too many, the server's
registration node may not be deleted yet when the servers tries to re-register
itself and the server think himself as registered successfully.

6. And later Zookeeper will delete the server's registration node without the
server's awareness.

7. If we re-apply KAFKA-992's patch, and redo the same testing setup. Under
similar conditions the server will wait and retry.

Double Check on Broker Registration to Avoid False NodeExist Exception
--

[jira] [Comment Edited] (KAFKA-992) Double Check on Broker Registration to Avoid False NodeExist Exception

2013-08-06 Thread Guozhang Wang (JIRA)

[
https://issues.apache.org/jira/browse/KAFKA-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13731353#comment-13731353
]

Guozhang Wang edited comment on KAFKA-992 at 8/6/13 10:26 PM:
--

The zookeeper bug can be reproduced as follows:

1. Checkout a clean 0.8 branch, revert back the KAFKA-992 fix:

2. Build and create a server connecting to a Zookeeper instance (make sure
maxClientCnxns = 0 in ZK config so that one IP address can create as many
connections as wanted)

3. Load the Zookeeper with dummy sessions, each creates and maintains a
thousand ephemeral nodes.

4. Write a script that pause and resume the Zookeeper process continuously, for
example:

---
while true
do
kill -STOP $1
sleep 8
kill -CONT $1
sleep 60
done
---

6. And later Zookeeper will delete the server's registration node without the
server's awareness.

7. If we re-apply KAFKA-992's patch, and redo the same testing setup. Under
similar conditions the server will wait and retry.

Since we can now re-produce the bug and verify the fix, the same fix will be
applied to Controller and Consumer.

was (Author: guozhang):
The zookeeper bug can be reproduced as follows: