Ewen Cheslack-Postava created KAFKA-2459:
--------------------------------------------
Summary: Connection backoff/blackout period should start when a
connection is disconnected, not when the connection attempt was initiated
Key: KAFKA-2459
URL: https://issues.apache.org/jira/browse/KAFKA-2459
Project: Kafka
Issue Type: Bug
Components: clients, consumer, producer
Affects Versions: 0.8.2.1
Reporter: Ewen Cheslack-Postava
Assignee: Neha Narkhede
Currently the connection code for new clients marks the time when a connection
was initiated (NodeConnectionState.lastConnectMs) and then uses this to compute
blackout periods for nodes, during which connections will not be attempted and
the node is not considered a candidate for leastLoadedNode.
However, in cases where the connection attempt takes longer than the
blackout/backoff period (default 10ms), this results in incorrect behavior. If
a broker is not available and, for example, the broker does not explicitly
reject the connection, instead waiting for a connection timeout (e.g. due to
firewall settings), then the backoff period will have already elapsed and the
node will immediately be considered ready for a new connection attempt and a
node to be selected by leastLoadedNode for metadata updates. I think it should
be easy to reproduce and verify this problem manually by using tc to introduce
enough latency to make connection failures take > 10ms.
The correct behavior would use the disconnection event to mark the end of the
last connection attempt and then wait for the backoff period to elapse after
that.
See
http://mail-archives.apache.org/mod_mbox/kafka-users/201508.mbox/%3CCAJY8EofpeU4%2BAJ%3Dw91HDUx2RabjkWoU00Z%3DcQ2wHcQSrbPT4HA%40mail.gmail.com%3E
for the original description of the problem.
This is related to KAFKA-1843 because leastLoadedNode currently will
consistently choose the same node if this blackout period is not handled
correctly, but is a much smaller issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)