[ 
https://issues.apache.org/jira/browse/KAFKA-3944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma updated KAFKA-3944:
-------------------------------
    Labels: reliability  (was: )

> After the broker restart, fetchers stopped due to a delayed controlled 
> shutdown message
> ---------------------------------------------------------------------------------------
>
>                 Key: KAFKA-3944
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3944
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>            Reporter: Maysam Yabandeh
>            Priority: Minor
>              Labels: reliability
>
> The symptom is that cluster reports under-replicated blocks and some replicas 
> do not seem to catch up ever. It turns out that the corresponding fetchers in 
> those brokers were stopped shortly after the broker's restart. The broker had 
> stopped the fetcher upon receiving stop-replica request from the controller. 
> The controller had issued those request upon processing controlled shutdown 
> request form the same broker. However those requests were all sent before the 
> broker restart but the controller processed them after. Here is the timeline:
> # broker sends controlled shutdown message to controller
> # the process fails and the broker proceeds with an unclean shutdown
> # the broker is restated
> # the controller processes the perviously sent controlled shutdown messages
> # the controller sends stop replica messages to the broker
> # the broker shuts down the fetchers, while it has no intent to shut down 
> again
> # this leads to under-replicated blocks
> Example from logs:
> {code}
> broker19.com:/var/log/kafka$ grep "Retrying controlled shutdow\|unclean 
> shutdown" server.log.2016-07-07.2 
> 2016-07-07 15:58:10,818 WARN server.KafkaServer: [Kafka Server 19], Retrying 
> controlled shutdown after the previous attempt failed...
> 2016-07-07 15:58:45,887 WARN server.KafkaServer: [Kafka Server 19], Retrying 
> controlled shutdown after the previous attempt failed...
> 2016-07-07 15:59:20,927 WARN server.KafkaServer: [Kafka Server 19], Retrying 
> controlled shutdown after the previous attempt failed...
> 2016-07-07 15:59:20,929 WARN server.KafkaServer: [Kafka Server 19], 
> Proceeding to do an unclean shutdown as all the controlled shutdown attempts 
> failed
> broker19.com:/var/log/kafka$ head -1 server.log.2016-07-07.3
> 2016-07-07 16:00:23,191 INFO server.KafkaConfig: KafkaConfig values: 
> {code}
> {code}
> broker13.com:/var/log/kafka$ grep "Shutting down broker 19" 
> controller.log.2016-07-07.1 
> 2016-07-07 15:57:35,822 INFO controller.KafkaController: [Controller 13]: 
> Shutting down broker 19
> 2016-07-07 16:02:45,526 INFO controller.KafkaController: [Controller 13]: 
> Shutting down broker 19
> 2016-07-07 16:05:42,432 INFO controller.KafkaController: [Controller 13]: 
> Shutting down broker 19
> {code}
> which resulted into many stop replica request to broker 19:
> {code}
> broker13.com:/var/log/kafka$ grep "The stop replica request (delete = false) 
> sent to broker 19 is" controller.log.2016-07-07.1 | tail -1
> 2016-07-07 16:06:02,374 DEBUG controller.ControllerBrokerRequestBatch: The 
> stop replica request (delete = false) sent to broker 19 is 
> [Topic=topic-xyz,Partition=6,Replica=19]
> {code}
> broker 19 processes them AFTER its restart:
> {code}
> broker19.com:/var/log/kafka$ grep "handling stop replica (delete=false) for 
> partition .topic-xzy,3." state-change.log.2016-07-07.2 
> 2016-07-07 16:06:00,154 TRACE change.logger: Broker 19 handling stop replica 
> (delete=false) for partition [topic-xzy,3]
> 2016-07-07 16:06:00,154 TRACE change.logger: Broker 19 finished handling stop 
> replica (delete=false) for partition [topic-xyz,3]
> 2016-07-07 16:06:00,155 TRACE change.logger: Broker 19 handling stop replica 
> (delete=false) for partition [topic-xyz,3]
> 2016-07-07 16:06:00,155 TRACE change.logger: Broker 19 finished handling stop 
> replica (delete=false) for partition [topic-xyz,3]
> {code}
> and removes the fetchers:
> {code}
> broker19.com:/var/log/kafka$ grep "Removed fetcher.*topic-xyz.3" 
> server.log.2016-07-07.3 | tail -2
> 2016-07-07 16:06:00,154 INFO server.ReplicaFetcherManager: 
> [ReplicaFetcherManager on broker 19] Removed fetcher for partitions 
> [topic-xyz,3]
> 2016-07-07 16:06:00,155 INFO server.ReplicaFetcherManager: 
> [ReplicaFetcherManager on broker 19] Removed fetcher for partitions 
> [topic-xyz,3]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to