While updating a small Kafka client application to use Kafka 1.1.1 and the 
KafkaProducer (rather than the deprecated Producer) a maintenance related unit 
test started failing.  The scenario is out of the ordinary, but, the resulting 
problem may still be of interest:

 *   Two Kafka nodes
 *   One KafkaProducer
    *   "retry.backoff.ms" = 3000
       *   Unchanged from the existing code.
       *   Perhaps a bad setting.
       *   Plays an interesting role in the issue.
 *   Send two messages
 *   Shutdown Kafka Node 0
 *   Send two messages
 *   Restart Kafka Node 0
 *   Shutdown Kafka Node 1
 *   Send two messages
    *   This is the step that fails.

Detailed investigation shows that:

 *   Setting "retry.backoff.ms" seems to also increase the minimum metadata 
refresh interval.
 *   When Node 0 goes down a metadata refresh does not occur immediately - it 
is delayed about 3 seconds.
 *   When it does occur Node 0 is gone and the Cluster is recreated with only 
Node 1.
 *   When Kafka Node 0 is restarted nothing seems to trigger a refresh right 
away.
 *   When Kafka Node 1 goes down a metadata refresh is attempted, but, the only 
node in the Cluster current instance is Node 1 which is now not available.
    *   All metadata refresh attempts fail (forever?)
    *   NetworkClient.maybeUpdate (after call to leastLoadedNode)
    *   log.debug("Give up sending metadata request since no node is 
available");

I noted the following:

 *   If the "retry.backoff.ms" is not overridden the problem goes away, but, 
for a strange reason:
    *   The metadata refreshes triggered by the shutdowns happen so fast that 
it returns both Nodes (even though one is in the process of going down) and the 
Cluster is recreated with both nodes still available.
 *   If a 3 node configuration is used the problem is much harder to create.
    *   I have done so with a Unit Test added to the Kafka source code based on 
the code in DynamicBrokerReconfigurationTest
    *   Mostly to better understand the issue.
    *   It requires:
       *   Shutting down two nodes at the same time
       *   Waiting for a metadata update that catches this state (or overriding the 
"retry.backoff.ms" value)
          *   Cluster now recreated with just one node
       *   Returning them both to service
       *   Shutting down the third node.
       *   Before a metadata update triggers a Cluster rebuild, send a message.

I realize the Unit Test could be rewritten to be more realistic, but, I still 
think the error state it triggers may be of interest.  Getting in a state where 
you can't update the metadata is pretty serious and perhaps there are more 
realistic ways to trigger this state.

I looked at the old Producer code and it does not seem to have an issue like 
this.  It seems to have a fixed list of metadata nodes that it will randomly 
poll.  Therefore as long as at least one is reachable it will eventually get a 
metadata reply.  In one run of the unit test with the old code I observed that 
it tried the out of service node 5 times in a row before it randomly selected 
the in service one. Perhaps not elegant, but, the unit test passed.
<http://www.tivo.com/>

________________________________

This email and any attachments may contain confidential and privileged material 
for the sole use of the intended recipient. Any review, copying, or 
distribution of this email (or any attachments) by others is prohibited. If you 
are not the intended recipient, please contact the sender immediately and 
permanently delete this email and any attachments. No employee or agent of TiVo 
Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by 
email. Binding agreements with TiVo Inc. may only be made by a signed written 
agreement.

Reply via email to