[ 
https://issues.apache.org/jira/browse/KAFKA-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neha Narkhede updated KAFKA-1647:
---------------------------------
    Priority: Critical  (was: Major)

> Replication offset checkpoints (high water marks) can be lost on hard kills 
> and restarts
> ----------------------------------------------------------------------------------------
>
>                 Key: KAFKA-1647
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1647
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Joel Koshy
>            Priority: Critical
>
> We ran into this scenario recently in a production environment. This can 
> happen when enough brokers in a cluster are taken down. i.e., a rolling 
> bounce done properly should not cause this issue. It can occur if all 
> replicas for any partition are taken down.
> Here is a sample scenario:
> * Cluster of three brokers: b0, b1, b2
> * Two partitions (of some topic) with replication factor two: p0, p1
> * Initial state:
> ** p0: leader = b0, ISR = {b0, b1}
> ** p1: leader = b1, ISR = {b0, b1}
> * Do a parallel hard-kill of all brokers
> * Bring up b2, so it is the new controller
> * b2 initializes its controller context and populates its leader/ISR cache 
> (i.e., controllerContext.partitionLeadershipInfo) from zookeeper. The last 
> known leaders are b0 (for p0) and b1 (for p2)
> * Bring up b1
> * The controller's onBrokerStartup procedure initiates a replica state change 
> for all replicas on b1 to become online. As part of this replica state change 
> it gets the last known leader and ISR and sends a LeaderAndIsrRequest to b1 
> (for p1 and p2). This LeaderAndIsr request contains: {{p0: leader=b0; p1: 
> leader=b1;} leaders=b1}. b0 is indicated as the leader of p0 but it is not 
> included in the leaders field because b0 is down.
> * On receiving the LeaderAndIsrRequest, b1's replica manager will 
> successfully make b2 the leader for p1 (and create the local replica object 
> corresponding to p1). It will however abort the become follower transition 
> for p0 because the designated leader b2 is offline. So it will not create the 
> local replica object for p0.
> * It will then start the high water mark checkpoint thread. Since only p1 has 
> a local replica object, only p1's high water mark will be checkpointed to 
> disk. p0's previously written checkpoint  if any will be lost.
> So in summary it seems we should always create the local replica object even 
> if the online transition does not happen.
> Possible symptoms of the above bug could be one or more of the following (we 
> saw 2 and 3):
> # Data loss; yes on a hard-kill data loss is expected, but this can actually 
> cause loss of nearly all data if the broker becomes follower, truncates, and 
> soon after happens to become leader.
> # High IO on brokers that lose their high water mark then subsequently (on a 
> successful become follower transition) truncate their log to zero and start 
> catching up from the beginning.
> # If the offsets topic is affected, then offsets can get reset. This is 
> because during an offset load we don't read past the high water mark. So if a 
> water mark is missing then we don't load anything (even if the offsets are 
> there in the log).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to