While updating a small Kafka client application to use Kafka 1.1.1 and the KafkaProducer (rather than the deprecated Producer) a maintenance related unit test started failing. The scenario is out of the ordinary, but, the resulting problem may still be of interest:
* Two Kafka nodes * One KafkaProducer * "retry.backoff.ms" = 3000 * Unchanged from the existing code. * Perhaps a bad setting. * Plays an interesting role in the issue. * Send two messages * Shutdown Kafka Node 0 * Send two messages * Restart Kafka Node 0 * Shutdown Kafka Node 1 * Send two messages * This is the step that fails. Detailed investigation shows that: * Setting "retry.backoff.ms" seems to also increase the minimum metadata refresh interval. * When Node 0 goes down a metadata refresh does not occur immediately - it is delayed about 3 seconds. * When it does occur Node 0 is gone and the Cluster is recreated with only Node 1. * When Kafka Node 0 is restarted nothing seems to trigger a refresh right away. * When Kafka Node 1 goes down a metadata refresh is attempted, but, the only node in the Cluster current instance is Node 1 which is now not available. * All metadata refresh attempts fail (forever?) * NetworkClient.maybeUpdate (after call to leastLoadedNode) * log.debug("Give up sending metadata request since no node is available"); I noted the following: * If the "retry.backoff.ms" is not overridden the problem goes away, but, for a strange reason: * The metadata refreshes triggered by the shutdowns happen so fast that it returns both Nodes (even though one is in the process of going down) and the Cluster is recreated with both nodes still available. * If a 3 node configuration is used the problem is much harder to create. * I have done so with a Unit Test added to the Kafka source code based on the code in DynamicBrokerReconfigurationTest * Mostly to better understand the issue. * It requires: * Shutting down two nodes at the same time * Waiting for a metadata update that catches this state (or overriding the "retry.backoff.ms" value) * Cluster now recreated with just one node * Returning them both to service * Shutting down the third node. * Before a metadata update triggers a Cluster rebuild, send a message. I realize the Unit Test could be rewritten to be more realistic, but, I still think the error state it triggers may be of interest. Getting in a state where you can't update the metadata is pretty serious and perhaps there are more realistic ways to trigger this state. I looked at the old Producer code and it does not seem to have an issue like this. It seems to have a fixed list of metadata nodes that it will randomly poll. Therefore as long as at least one is reachable it will eventually get a metadata reply. In one run of the unit test with the old code I observed that it tried the out of service node 5 times in a row before it randomly selected the in service one. Perhaps not elegant, but, the unit test passed. <http://www.tivo.com/> ________________________________ This email and any attachments may contain confidential and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments) by others is prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete this email and any attachments. No employee or agent of TiVo Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo Inc. may only be made by a signed written agreement.