[ https://issues.apache.org/jira/browse/KAFKA-3944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847898#comment-15847898 ]
James Cheng commented on KAFKA-3944: ------------------------------------ Is this related to KAFKA-1120? > After the broker restart, fetchers stopped due to a delayed controlled > shutdown message > --------------------------------------------------------------------------------------- > > Key: KAFKA-3944 > URL: https://issues.apache.org/jira/browse/KAFKA-3944 > Project: Kafka > Issue Type: Bug > Components: core > Reporter: Maysam Yabandeh > Priority: Minor > Labels: reliability > > The symptom is that cluster reports under-replicated blocks and some replicas > do not seem to catch up ever. It turns out that the corresponding fetchers in > those brokers were stopped shortly after the broker's restart. The broker had > stopped the fetcher upon receiving stop-replica request from the controller. > The controller had issued those request upon processing controlled shutdown > request form the same broker. However those requests were all sent before the > broker restart but the controller processed them after. Here is the timeline: > # broker sends controlled shutdown message to controller > # the process fails and the broker proceeds with an unclean shutdown > # the broker is restated > # the controller processes the perviously sent controlled shutdown messages > # the controller sends stop replica messages to the broker > # the broker shuts down the fetchers, while it has no intent to shut down > again > # this leads to under-replicated blocks > Example from logs: > {code} > broker19.com:/var/log/kafka$ grep "Retrying controlled shutdow\|unclean > shutdown" server.log.2016-07-07.2 > 2016-07-07 15:58:10,818 WARN server.KafkaServer: [Kafka Server 19], Retrying > controlled shutdown after the previous attempt failed... > 2016-07-07 15:58:45,887 WARN server.KafkaServer: [Kafka Server 19], Retrying > controlled shutdown after the previous attempt failed... > 2016-07-07 15:59:20,927 WARN server.KafkaServer: [Kafka Server 19], Retrying > controlled shutdown after the previous attempt failed... > 2016-07-07 15:59:20,929 WARN server.KafkaServer: [Kafka Server 19], > Proceeding to do an unclean shutdown as all the controlled shutdown attempts > failed > broker19.com:/var/log/kafka$ head -1 server.log.2016-07-07.3 > 2016-07-07 16:00:23,191 INFO server.KafkaConfig: KafkaConfig values: > {code} > {code} > broker13.com:/var/log/kafka$ grep "Shutting down broker 19" > controller.log.2016-07-07.1 > 2016-07-07 15:57:35,822 INFO controller.KafkaController: [Controller 13]: > Shutting down broker 19 > 2016-07-07 16:02:45,526 INFO controller.KafkaController: [Controller 13]: > Shutting down broker 19 > 2016-07-07 16:05:42,432 INFO controller.KafkaController: [Controller 13]: > Shutting down broker 19 > {code} > which resulted into many stop replica request to broker 19: > {code} > broker13.com:/var/log/kafka$ grep "The stop replica request (delete = false) > sent to broker 19 is" controller.log.2016-07-07.1 | tail -1 > 2016-07-07 16:06:02,374 DEBUG controller.ControllerBrokerRequestBatch: The > stop replica request (delete = false) sent to broker 19 is > [Topic=topic-xyz,Partition=6,Replica=19] > {code} > broker 19 processes them AFTER its restart: > {code} > broker19.com:/var/log/kafka$ grep "handling stop replica (delete=false) for > partition .topic-xzy,3." state-change.log.2016-07-07.2 > 2016-07-07 16:06:00,154 TRACE change.logger: Broker 19 handling stop replica > (delete=false) for partition [topic-xzy,3] > 2016-07-07 16:06:00,154 TRACE change.logger: Broker 19 finished handling stop > replica (delete=false) for partition [topic-xyz,3] > 2016-07-07 16:06:00,155 TRACE change.logger: Broker 19 handling stop replica > (delete=false) for partition [topic-xyz,3] > 2016-07-07 16:06:00,155 TRACE change.logger: Broker 19 finished handling stop > replica (delete=false) for partition [topic-xyz,3] > {code} > and removes the fetchers: > {code} > broker19.com:/var/log/kafka$ grep "Removed fetcher.*topic-xyz.3" > server.log.2016-07-07.3 | tail -2 > 2016-07-07 16:06:00,154 INFO server.ReplicaFetcherManager: > [ReplicaFetcherManager on broker 19] Removed fetcher for partitions > [topic-xyz,3] > 2016-07-07 16:06:00,155 INFO server.ReplicaFetcherManager: > [ReplicaFetcherManager on broker 19] Removed fetcher for partitions > [topic-xyz,3] > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)