While updating a small Kafka client application to use Kafka 1.1.1 and the
KafkaProducer (rather than the deprecated Producer) a maintenance related unit
test started failing. The scenario is out of the ordinary, but, the resulting
problem may still be of interest:
* Two Kafka nodes
* One KafkaProducer
* "retry.backoff.ms" = 3000
* Unchanged from the existing code.
* Perhaps a bad setting.
* Plays an interesting role in the issue.
* Send two messages
* Shutdown Kafka Node 0
* Send two messages
* Restart Kafka Node 0
* Shutdown Kafka Node 1
* Send two messages
* This is the step that fails.
Detailed investigation shows that:
* Setting "retry.backoff.ms" seems to also increase the minimum metadata
refresh interval.
* When Node 0 goes down a metadata refresh does not occur immediately - it
is delayed about 3 seconds.
* When it does occur Node 0 is gone and the Cluster is recreated with only
Node 1.
* When Kafka Node 0 is restarted nothing seems to trigger a refresh right
away.
* When Kafka Node 1 goes down a metadata refresh is attempted, but, the only
node in the Cluster current instance is Node 1 which is now not available.
* All metadata refresh attempts fail (forever?)
* NetworkClient.maybeUpdate (after call to leastLoadedNode)
* log.debug("Give up sending metadata request since no node is
available");
I noted the following:
* If the "retry.backoff.ms" is not overridden the problem goes away, but,
for a strange reason:
* The metadata refreshes triggered by the shutdowns happen so fast that
it returns both Nodes (even though one is in the process of going down) and the
Cluster is recreated with both nodes still available.
* If a 3 node configuration is used the problem is much harder to create.
* I have done so with a Unit Test added to the Kafka source code based on
the code in DynamicBrokerReconfigurationTest
* Mostly to better understand the issue.
* It requires:
* Shutting down two nodes at the same time
* Waiting for a metadata update that catches this state (or overriding the
"retry.backoff.ms" value)
* Cluster now recreated with just one node
* Returning them both to service
* Shutting down the third node.
* Before a metadata update triggers a Cluster rebuild, send a message.
I realize the Unit Test could be rewritten to be more realistic, but, I still
think the error state it triggers may be of interest. Getting in a state where
you can't update the metadata is pretty serious and perhaps there are more
realistic ways to trigger this state.
I looked at the old Producer code and it does not seem to have an issue like
this. It seems to have a fixed list of metadata nodes that it will randomly
poll. Therefore as long as at least one is reachable it will eventually get a
metadata reply. In one run of the unit test with the old code I observed that
it tried the out of service node 5 times in a row before it randomly selected
the in service one. Perhaps not elegant, but, the unit test passed.
<http://www.tivo.com/>
________________________________
This email and any attachments may contain confidential and privileged material
for the sole use of the intended recipient. Any review, copying, or
distribution of this email (or any attachments) by others is prohibited. If you
are not the intended recipient, please contact the sender immediately and
permanently delete this email and any attachments. No employee or agent of TiVo
Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by
email. Binding agreements with TiVo Inc. may only be made by a signed written
agreement.