[
https://issues.apache.org/jira/browse/KAFKA-3944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847898#comment-15847898
]
James Cheng commented on KAFKA-3944:
------------------------------------
Is this related to KAFKA-1120?
> After the broker restart, fetchers stopped due to a delayed controlled
> shutdown message
> ---------------------------------------------------------------------------------------
>
> Key: KAFKA-3944
> URL: https://issues.apache.org/jira/browse/KAFKA-3944
> Project: Kafka
> Issue Type: Bug
> Components: core
> Reporter: Maysam Yabandeh
> Priority: Minor
> Labels: reliability
>
> The symptom is that cluster reports under-replicated blocks and some replicas
> do not seem to catch up ever. It turns out that the corresponding fetchers in
> those brokers were stopped shortly after the broker's restart. The broker had
> stopped the fetcher upon receiving stop-replica request from the controller.
> The controller had issued those request upon processing controlled shutdown
> request form the same broker. However those requests were all sent before the
> broker restart but the controller processed them after. Here is the timeline:
> # broker sends controlled shutdown message to controller
> # the process fails and the broker proceeds with an unclean shutdown
> # the broker is restated
> # the controller processes the perviously sent controlled shutdown messages
> # the controller sends stop replica messages to the broker
> # the broker shuts down the fetchers, while it has no intent to shut down
> again
> # this leads to under-replicated blocks
> Example from logs:
> {code}
> broker19.com:/var/log/kafka$ grep "Retrying controlled shutdow\|unclean
> shutdown" server.log.2016-07-07.2
> 2016-07-07 15:58:10,818 WARN server.KafkaServer: [Kafka Server 19], Retrying
> controlled shutdown after the previous attempt failed...
> 2016-07-07 15:58:45,887 WARN server.KafkaServer: [Kafka Server 19], Retrying
> controlled shutdown after the previous attempt failed...
> 2016-07-07 15:59:20,927 WARN server.KafkaServer: [Kafka Server 19], Retrying
> controlled shutdown after the previous attempt failed...
> 2016-07-07 15:59:20,929 WARN server.KafkaServer: [Kafka Server 19],
> Proceeding to do an unclean shutdown as all the controlled shutdown attempts
> failed
> broker19.com:/var/log/kafka$ head -1 server.log.2016-07-07.3
> 2016-07-07 16:00:23,191 INFO server.KafkaConfig: KafkaConfig values:
> {code}
> {code}
> broker13.com:/var/log/kafka$ grep "Shutting down broker 19"
> controller.log.2016-07-07.1
> 2016-07-07 15:57:35,822 INFO controller.KafkaController: [Controller 13]:
> Shutting down broker 19
> 2016-07-07 16:02:45,526 INFO controller.KafkaController: [Controller 13]:
> Shutting down broker 19
> 2016-07-07 16:05:42,432 INFO controller.KafkaController: [Controller 13]:
> Shutting down broker 19
> {code}
> which resulted into many stop replica request to broker 19:
> {code}
> broker13.com:/var/log/kafka$ grep "The stop replica request (delete = false)
> sent to broker 19 is" controller.log.2016-07-07.1 | tail -1
> 2016-07-07 16:06:02,374 DEBUG controller.ControllerBrokerRequestBatch: The
> stop replica request (delete = false) sent to broker 19 is
> [Topic=topic-xyz,Partition=6,Replica=19]
> {code}
> broker 19 processes them AFTER its restart:
> {code}
> broker19.com:/var/log/kafka$ grep "handling stop replica (delete=false) for
> partition .topic-xzy,3." state-change.log.2016-07-07.2
> 2016-07-07 16:06:00,154 TRACE change.logger: Broker 19 handling stop replica
> (delete=false) for partition [topic-xzy,3]
> 2016-07-07 16:06:00,154 TRACE change.logger: Broker 19 finished handling stop
> replica (delete=false) for partition [topic-xyz,3]
> 2016-07-07 16:06:00,155 TRACE change.logger: Broker 19 handling stop replica
> (delete=false) for partition [topic-xyz,3]
> 2016-07-07 16:06:00,155 TRACE change.logger: Broker 19 finished handling stop
> replica (delete=false) for partition [topic-xyz,3]
> {code}
> and removes the fetchers:
> {code}
> broker19.com:/var/log/kafka$ grep "Removed fetcher.*topic-xyz.3"
> server.log.2016-07-07.3 | tail -2
> 2016-07-07 16:06:00,154 INFO server.ReplicaFetcherManager:
> [ReplicaFetcherManager on broker 19] Removed fetcher for partitions
> [topic-xyz,3]
> 2016-07-07 16:06:00,155 INFO server.ReplicaFetcherManager:
> [ReplicaFetcherManager on broker 19] Removed fetcher for partitions
> [topic-xyz,3]
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)