[ 
https://issues.apache.org/jira/browse/KAFKA-3693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15319273#comment-15319273
 ] 

Maysam Yabandeh edited comment on KAFKA-3693 at 6/10/16 4:00 AM:
-----------------------------------------------------------------

*update:* the statement "controller is started AFTER the broker 16 has started 
shutting down" in this comment is incorrect. Refer to this 
[comment|https://issues.apache.org/jira/browse/KAFKA-3693?focusedCommentId=15315055&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15315055]
 for updated shutdown time.

Thanks for the grep hints [~junrao]. According to controller logs, it did not 
label broker 16 as dead:
{code}
$ grep "Newly added brokers" controller.log.2016-05-10.2 
2016-05-10 05:58:30,633 INFO 
controller.ReplicaStateMachine$BrokerChangeListener: [BrokerChangeListener on 
Controller 17]: Newly added brokers: 18, deleted brokers: , all live brokers: 
14,20,21,13,17,22,18,16,19,15
2016-05-10 06:17:48,512 INFO 
controller.ReplicaStateMachine$BrokerChangeListener: [BrokerChangeListener on 
Controller 17]: Newly added brokers: , deleted brokers: 14,21,13,22,18,19, all 
live brokers: 15,16
2016-05-10 06:29:57,981 INFO 
controller.ReplicaStateMachine$BrokerChangeListener: [BrokerChangeListener on 
Controller 17]: Newly added brokers: 14,21,13,22,18,19, deleted brokers: , all 
live brokers: 14,21,13,22,18,16,19,15
{code}

I am still not sure what is the mechanism by which the controller is supposed 
to mark the broker dead but it seems that the controller is started AFTER the 
broker 16 has started shutting down, so if this mechanism depends on a message 
from the broker to the controller it might have not received by the new 
controller:
{code}
$ grep "Controller starting up\|Controller-17-to-broker-16-send-thread" 
controller.log.2016-05-10.2 
2016-05-10 05:50:20,645 INFO controller.KafkaController: [Controller 17]: 
Controller starting up
2016-05-10 05:57:27,440 INFO controller.RequestSendThread: 
[Controller-17-to-broker-16-send-thread], Starting 
2016-05-10 05:57:28,309 INFO controller.RequestSendThread: 
[Controller-17-to-broker-16-send-thread], Controller 17 connected to Node(16, 
node16.com, 9092) for sending state change requests
2016-05-10 06:17:01,771 WARN controller.RequestSendThread: 
[Controller-17-to-broker-16-send-thread], Controller 17 epoch 269 fails to send 
request ... to broker Node(16, node16.com, 9092). Reconnecting to broker.
2016-05-10 06:17:02,076 WARN controller.RequestSendThread: 
[Controller-17-to-broker-16-send-thread], Controller 17's connection to broker 
Node(16, node16.com, 9092) was unsuccessful
...
2016-05-10 06:17:33,043 WARN controller.RequestSendThread: 
[Controller-17-to-broker-16-send-thread], Controller 17's connection to broker 
Node(16, node16.com, 9092) was unsuccessful
2016-05-10 06:17:33,344 INFO controller.RequestSendThread: 
[Controller-17-to-broker-16-send-thread], Controller 17 connected to Node(16, 
node16.com, 9092) for sending state change requests
{code}
Regarding the slow shutdown, that is also a question for us but investigating 
it had less priority compared to losing hw checkpoints.


was (Author: maysamyabandeh):
Thanks for the grep hints [~junrao]. According to controller logs, it did not 
label broker 16 as dead:
{code}
$ grep "Newly added brokers" controller.log.2016-05-10.2 
2016-05-10 05:58:30,633 INFO 
controller.ReplicaStateMachine$BrokerChangeListener: [BrokerChangeListener on 
Controller 17]: Newly added brokers: 18, deleted brokers: , all live brokers: 
14,20,21,13,17,22,18,16,19,15
2016-05-10 06:17:48,512 INFO 
controller.ReplicaStateMachine$BrokerChangeListener: [BrokerChangeListener on 
Controller 17]: Newly added brokers: , deleted brokers: 14,21,13,22,18,19, all 
live brokers: 15,16
2016-05-10 06:29:57,981 INFO 
controller.ReplicaStateMachine$BrokerChangeListener: [BrokerChangeListener on 
Controller 17]: Newly added brokers: 14,21,13,22,18,19, deleted brokers: , all 
live brokers: 14,21,13,22,18,16,19,15
{code}

I am still not sure what is the mechanism by which the controller is supposed 
to mark the broker dead but it seems that the controller is started AFTER the 
broker 16 has started shutting down, so if this mechanism depends on a message 
from the broker to the controller it might have not received by the new 
controller:
{code}
$ grep "Controller starting up\|Controller-17-to-broker-16-send-thread" 
controller.log.2016-05-10.2 
2016-05-10 05:50:20,645 INFO controller.KafkaController: [Controller 17]: 
Controller starting up
2016-05-10 05:57:27,440 INFO controller.RequestSendThread: 
[Controller-17-to-broker-16-send-thread], Starting 
2016-05-10 05:57:28,309 INFO controller.RequestSendThread: 
[Controller-17-to-broker-16-send-thread], Controller 17 connected to Node(16, 
node16.com, 9092) for sending state change requests
2016-05-10 06:17:01,771 WARN controller.RequestSendThread: 
[Controller-17-to-broker-16-send-thread], Controller 17 epoch 269 fails to send 
request ... to broker Node(16, node16.com, 9092). Reconnecting to broker.
2016-05-10 06:17:02,076 WARN controller.RequestSendThread: 
[Controller-17-to-broker-16-send-thread], Controller 17's connection to broker 
Node(16, node16.com, 9092) was unsuccessful
...
2016-05-10 06:17:33,043 WARN controller.RequestSendThread: 
[Controller-17-to-broker-16-send-thread], Controller 17's connection to broker 
Node(16, node16.com, 9092) was unsuccessful
2016-05-10 06:17:33,344 INFO controller.RequestSendThread: 
[Controller-17-to-broker-16-send-thread], Controller 17 connected to Node(16, 
node16.com, 9092) for sending state change requests
{code}
Regarding the slow shutdown, that is also a question for us but investigating 
it had less priority compared to losing hw checkpoints.

> Race condition between highwatermark-checkpoint thread and 
> handleLeaderAndIsrRequest at broker start-up
> -------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-3693
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3693
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.9.0.1
>            Reporter: Maysam Yabandeh
>
> Upon broker start-up, a race between highwatermark-checkpoint thread to write 
> replication-offset-checkpoint file and handleLeaderAndIsrRequest thread 
> reading from it causes the highwatermark for some partitions to be reset to 
> 0. In the good case, this results the replica to truncate its entire log to 0 
> and hence initiates fetching of terabytes of data from the lead broker, which 
> sometimes leads to hours of downtime. We observed the bad cases that the 
> reset offset can propagate to recovery-point-offset-checkpoint file, making a 
> lead broker to truncate the file. This seems to have the potential to lead to 
> data loss if the truncation happens at both follower and leader brokers.
> This is the particular faulty scenario manifested in our tests:
> # The broker restarts and receive LeaderAndIsr from the controller
> # LeaderAndIsr message however does not contain all the partitions (probably 
> because other brokers were churning at the same time)
> # becomeLeaderOrFollower calls getOrCreatePartition and updates the 
> allPartitions with the partitions included in the LeaderAndIsr message {code}
>   def getOrCreatePartition(topic: String, partitionId: Int): Partition = {
>     var partition = allPartitions.get((topic, partitionId))
>     if (partition == null) {
>       allPartitions.putIfNotExists((topic, partitionId), new Partition(topic, 
> partitionId, time, this))
> {code}
> # replication-offset-checkpoint jumps in taking a snapshot of (the partial) 
> allReplicas' high watermark into replication-offset-checkpoint file {code}  
> def checkpointHighWatermarks() {
>     val replicas = 
> allPartitions.values.map(_.getReplica(config.brokerId)).collect{case 
> Some(replica) => replica}{code} hence rewriting the previous highwatermarks.
> # Later becomeLeaderOrFollower calls makeLeaders and makeFollowers which read 
> the (now partial) file through Partition::getOrCreateReplica {code}
>           val checkpoint = 
> replicaManager.highWatermarkCheckpoints(log.dir.getParentFile.getAbsolutePath)
>           val offsetMap = checkpoint.read
>           if (!offsetMap.contains(TopicAndPartition(topic, partitionId)))
>             info("No checkpointed highwatermark is found for partition 
> [%s,%d]".format(topic, partitionId))
> {code}
> We are not entirely sure whether the initial LeaderAndIsr message including a 
> subset of partitions is critical in making this race condition manifest or 
> not. But it is an important detail since it clarifies that a solution based 
> on not letting the highwatermark-checkpoint thread jumping in the middle of 
> processing a LeaderAndIsr message would not suffice.
> The solution we are thinking of is to force initializing allPartitions by the 
> partitions listed in the replication-offset-checkpoint (and perhaps 
> recovery-point-offset-checkpoint file too) when a server starts.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to