[ 
https://issues.apache.org/jira/browse/KAFKA-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Eisele updated KAFKA-6714:
------------------------------
    Description: 
In our Kafka cluster we experienced a situation in wich the Kafka controller 
has all Brokers marked as "Shutting down", though indeed only one Broker has 
been shut down.

The last log entry about the broker state before the entry that states that all 
brokers are shutting down states that no brokers are shutting down.

The consequence of this weird state is, that the Kafka controller is not able 
to elect any partition leader.
{code:java}
kafka.controller Log (Level TRACE):
[2018-03-15 16:28:24,288] INFO [Controller 5]: Shutting down broker 5 
(kafka.controller.KafkaController)
[2018-03-15 16:28:24,288] DEBUG [Controller 5]: All shutting down brokers: 5 
(kafka.controller.KafkaController)
[2018-03-15 16:28:24,288] DEBUG [Controller 5]: Live brokers: 1,2,3,4 
(kafka.controller.KafkaController)
...
[2018-03-15 16:28:36,846] INFO [Controller 3]: Currently active brokers in the 
cluster: Set(1, 2, 3, 4) (kafka.controller.KafkaController)
[2018-03-15 16:28:36,846] INFO [Controller 3]: Currently shutting brokers in 
the cluster: Set() (kafka.controller.KafkaController)
...
[2018-03-19 17:57:22,273] INFO [Controller 3]: Shutting down broker 1 
(kafka.controller.KafkaController)
[2018-03-19 17:57:22,273] DEBUG [Controller 3]: All shutting down brokers: 
1,5,2,3,4 (kafka.controller.KafkaController)
[2018-03-19 17:57:22,273] DEBUG [Controller 3]: Live brokers:  
(kafka.controller.KafkaController)
{code}
{code:java}
state.change.logger Log (Level TRACE):
[2018-03-19 17:57:22,275] ERROR Controller 3 epoch 83 encountered error while 
electing leader for partition 
[zughaltphase_v3_intern_intern_partitioned_by_evanummer,6] due to: No other 
replicas in ISR 1,3,5 for 
[zughaltphase_v3_intern_intern_partitioned_by_evanummer,6] besides shutting 
down brokers 1,5,2,3,4. (state.change.logger) {code}
The question is why the Kafka controller assumes that all brokers are shutting 
down?

The only place in the Kafka code (0.11.0.2) we found in which the shutting down 
broker set is changed is in the class _kafka.controller.KafkaControler_ in line 
1407 in the method _doControlledShutdown_.
{code:java}
info("Shutting down broker " + id)

if (!controllerContext.liveOrShuttingDownBrokerIds.contains(id))
  throw new BrokerNotAvailableException("Broker id %d does not 
exist.".format(id))

controllerContext.shuttingDownBrokerIds.add(id)
{code}
However, we should see the log entry "Shutting down broker n" for all Brokers 
in the log file, but it is not there.

This is a recurring problem, however we cannot reproduce it.

  was:
In our Kafka cluster we experienced a situation in wich the Kafka controller 
has all Brokers marked as "Shutting down", though indeed only one Broker has 
been shut down.

The last log entry about the broker state before the entry that states that all 
brokers are shutting down states that no brokers are shutting down.

The consequence of this weird state is, that the Kafka controller is not able 
to elect any partition leader.
{code:java}
kafka.controller Log (Level TRACE):
[2018-03-15 16:28:24,288] INFO [Controller 5]: Shutting down broker 5 
(kafka.controller.KafkaController)
[2018-03-15 16:28:24,288] DEBUG [Controller 5]: All shutting down brokers: 5 
(kafka.controller.KafkaController)
[2018-03-15 16:28:24,288] DEBUG [Controller 5]: Live brokers: 1,2,3,4 
(kafka.controller.KafkaController)
...
[2018-03-15 16:28:36,846] INFO [Controller 3]: Currently active brokers in the 
cluster: Set(1, 2, 3, 4) (kafka.controller.KafkaController)
[2018-03-15 16:28:36,846] INFO [Controller 3]: Currently shutting brokers in 
the cluster: Set() (kafka.controller.KafkaController)
...
[2018-03-19 17:57:22,273] INFO [Controller 3]: Shutting down broker 1 
(kafka.controller.KafkaController)
[2018-03-19 17:57:22,273] DEBUG [Controller 3]: All shutting down brokers: 
1,5,2,3,4 (kafka.controller.KafkaController)
[2018-03-19 17:57:22,273] DEBUG [Controller 3]: Live brokers:  
(kafka.controller.KafkaController)
{code}
{code:java}
state.change.logger Log (Level TRACE):
[2018-03-19 17:57:22,275] ERROR Controller 3 epoch 83 encountered error while 
electing leader for partition 
[zughaltphase_v3_intern_intern_partitioned_by_evanummer,6] due to: No other 
replicas in ISR 1,3,5 for 
[zughaltphase_v3_intern_intern_partitioned_by_evanummer,6] besides shutting 
down brokers 1,5,2,3,4. (state.change.logger) {code}
The question is why the Kafka controller assumes that all brokers are shutting 
down?

The only place in the Kafka code (0.11.0.2) we found in which the shutting down 
broker set is changed is in the class _kafka.controller.KafkaControler_ in line 
1407 in the method _doControlledShutdown_.
{code:java}
info("Shutting down broker " + id)

if (!controllerContext.liveOrShuttingDownBrokerIds.contains(id))
  throw new BrokerNotAvailableException("Broker id %d does not 
exist.".format(id))

controllerContext.shuttingDownBrokerIds.add(id)
{code}
However, we should see the log entry "Shutting down broker n" for all Brokers 
in the log file, but it is not there.


> KafkaController marks all Brokers as "Shutting down", though only one broker 
> has been shut down
> -----------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-6714
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6714
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller, core
>    Affects Versions: 0.11.0.2
>         Environment: Kafka cluster on Amazon AWS EC2 r4.2xlarge instances 
> with 5 nodes and a Zookeeper cluster on r4.2xlarge instances with 3 nodes. 
> The cluster is distributed across 2 availability zones.
>            Reporter: Uwe Eisele
>            Priority: Critical
>
> In our Kafka cluster we experienced a situation in wich the Kafka controller 
> has all Brokers marked as "Shutting down", though indeed only one Broker has 
> been shut down.
> The last log entry about the broker state before the entry that states that 
> all brokers are shutting down states that no brokers are shutting down.
> The consequence of this weird state is, that the Kafka controller is not able 
> to elect any partition leader.
> {code:java}
> kafka.controller Log (Level TRACE):
> [2018-03-15 16:28:24,288] INFO [Controller 5]: Shutting down broker 5 
> (kafka.controller.KafkaController)
> [2018-03-15 16:28:24,288] DEBUG [Controller 5]: All shutting down brokers: 5 
> (kafka.controller.KafkaController)
> [2018-03-15 16:28:24,288] DEBUG [Controller 5]: Live brokers: 1,2,3,4 
> (kafka.controller.KafkaController)
> ...
> [2018-03-15 16:28:36,846] INFO [Controller 3]: Currently active brokers in 
> the cluster: Set(1, 2, 3, 4) (kafka.controller.KafkaController)
> [2018-03-15 16:28:36,846] INFO [Controller 3]: Currently shutting brokers in 
> the cluster: Set() (kafka.controller.KafkaController)
> ...
> [2018-03-19 17:57:22,273] INFO [Controller 3]: Shutting down broker 1 
> (kafka.controller.KafkaController)
> [2018-03-19 17:57:22,273] DEBUG [Controller 3]: All shutting down brokers: 
> 1,5,2,3,4 (kafka.controller.KafkaController)
> [2018-03-19 17:57:22,273] DEBUG [Controller 3]: Live brokers:  
> (kafka.controller.KafkaController)
> {code}
> {code:java}
> state.change.logger Log (Level TRACE):
> [2018-03-19 17:57:22,275] ERROR Controller 3 epoch 83 encountered error while 
> electing leader for partition 
> [zughaltphase_v3_intern_intern_partitioned_by_evanummer,6] due to: No other 
> replicas in ISR 1,3,5 for 
> [zughaltphase_v3_intern_intern_partitioned_by_evanummer,6] besides shutting 
> down brokers 1,5,2,3,4. (state.change.logger) {code}
> The question is why the Kafka controller assumes that all brokers are 
> shutting down?
> The only place in the Kafka code (0.11.0.2) we found in which the shutting 
> down broker set is changed is in the class _kafka.controller.KafkaControler_ 
> in line 1407 in the method _doControlledShutdown_.
> {code:java}
> info("Shutting down broker " + id)
> if (!controllerContext.liveOrShuttingDownBrokerIds.contains(id))
>   throw new BrokerNotAvailableException("Broker id %d does not 
> exist.".format(id))
> controllerContext.shuttingDownBrokerIds.add(id)
> {code}
> However, we should see the log entry "Shutting down broker n" for all Brokers 
> in the log file, but it is not there.
> This is a recurring problem, however we cannot reproduce it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to