Ivan Babrou created KAFKA-6013:
----------------------------------

             Summary: Controller getting stuck
                 Key: KAFKA-6013
                 URL: https://issues.apache.org/jira/browse/KAFKA-6013
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 0.11.0.0, 0.11.0.1
            Reporter: Ivan Babrou


It looks like a new issue in 0.11.0.0 and 0.11.0.1 still has it.

We upgraded one of the clusters from 0.11.0.0 to 0.11.0.1 by shutting down 28 
machines at once (single rack). When nodes came up none of them progressed 
after these log lines:

{noformat}
Oct 05 02:17:42 mybroker14 kafka[32940]: INFO Kafka version : 0.11.0.1 
(org.apache.kafka.common.utils.AppInfoParser)
Oct 05 02:17:42 mybroker14 kafka[32940]: INFO Kafka commitId : c2a0d5f9b1f45bf5 
(org.apache.kafka.common.utils.AppInfoParser)
Oct 05 02:17:42 mybroker14 kafka[32940]: INFO [Kafka Server 10014], started 
(kafka.server.KafkaServer)
{noformat}

There was no indication in controller node logs that it picked up rebooted 
nodes. This happened multiple times during the upgrade: once per rack plus some 
on top of that.

Reboot took ~20m, all nodes in a single rack rebooted in parallel.

The fix was to restart controller node, but that did not go cleanly too:

{noformat}
ivan@mybroker26:~$ sudo journalctl --since 01:00 -u kafka | fgrep 'Error during 
controlled shutdown' -A1
Oct 05 01:57:41 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Error 
during controlled shutdown, possibly because leader movement took longer than 
the configured controller.socket.timeout.ms and/or request.timeout.ms: 
Connection to 10026 was disconnected before the response was read 
(kafka.server.KafkaServer)
Oct 05 01:57:46 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Retrying 
controlled shutdown after the previous attempt failed... 
(kafka.server.KafkaServer)
--
Oct 05 01:58:16 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Error 
during controlled shutdown, possibly because leader movement took longer than 
the configured controller.socket.timeout.ms and/or request.timeout.ms: 
Connection to 10026 was disconnected before the response was read 
(kafka.server.KafkaServer)
Oct 05 01:58:18 mybroker26 kafka[37409]: INFO Rolled new log segment for 
'requests-40' in 3 ms. (kafka.log.Log)
--
Oct 05 01:58:51 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Error 
during controlled shutdown, possibly because leader movement took longer than 
the configured controller.socket.timeout.ms and/or request.timeout.ms: 
Connection to 10026 was disconnected before the response was read 
(kafka.server.KafkaServer)
Oct 05 01:58:56 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Retrying 
controlled shutdown after the previous attempt failed... 
(kafka.server.KafkaServer)
{noformat}

I'm unable to reproduce the issue by just restarting or even rebooting one 
broker, controller picks it up:

{noformat}
Oct 05 03:18:18 mybroker83 kafka[37402]: INFO [Controller 10083]: Newly added 
brokers: 10001, deleted brokers: , all live brokers: ...
{noformat}

KAFKA-5028 happened in 0.11.0.0, so it's likely related.

cc [~ijuma]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to