[
https://issues.apache.org/jira/browse/KAFKA-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151955#comment-14151955
]
Joel Koshy commented on KAFKA-1647:
-----------------------------------
Yes that is exactly why it happens and the fix is relatively straightforward.
> Replication offset checkpoints (high water marks) can be lost on hard kills
> and restarts
> ----------------------------------------------------------------------------------------
>
> Key: KAFKA-1647
> URL: https://issues.apache.org/jira/browse/KAFKA-1647
> Project: Kafka
> Issue Type: Bug
> Reporter: Joel Koshy
> Priority: Critical
> Labels: newbie++
>
> We ran into this scenario recently in a production environment. This can
> happen when enough brokers in a cluster are taken down. i.e., a rolling
> bounce done properly should not cause this issue. It can occur if all
> replicas for any partition are taken down.
> Here is a sample scenario:
> * Cluster of three brokers: b0, b1, b2
> * Two partitions (of some topic) with replication factor two: p0, p1
> * Initial state:
> p0: leader = b0, ISR = {b0, b1}
> p1: leader = b1, ISR = {b0, b1}
> * Do a parallel hard-kill of all brokers
> * Bring up b2, so it is the new controller
> * b2 initializes its controller context and populates its leader/ISR cache
> (i.e., controllerContext.partitionLeadershipInfo) from zookeeper. The last
> known leaders are b0 (for p0) and b1 (for p2)
> * Bring up b1
> * The controller's onBrokerStartup procedure initiates a replica state change
> for all replicas on b1 to become online. As part of this replica state change
> it gets the last known leader and ISR and sends a LeaderAndIsrRequest to b1
> (for p1 and p2). This LeaderAndIsr request contains: {{p0: leader=b0; p1:
> leader=b1;} leaders=b1}. b0 is indicated as the leader of p0 but it is not
> included in the leaders field because b0 is down.
> * On receiving the LeaderAndIsrRequest, b1's replica manager will
> successfully make itself (b1) the leader for p1 (and create the local replica
> object corresponding to p1). It will however abort the become follower
> transition for p0 because the designated leader b0 is offline. So it will not
> create the local replica object for p0.
> * It will then start the high water mark checkpoint thread. Since only p1 has
> a local replica object, only p1's high water mark will be checkpointed to
> disk. p0's previously written checkpoint if any will be lost.
> So in summary it seems we should always create the local replica object even
> if the online transition does not happen.
> Possible symptoms of the above bug could be one or more of the following (we
> saw 2 and 3):
> # Data loss; yes on a hard-kill data loss is expected, but this can actually
> cause loss of nearly all data if the broker becomes follower, truncates, and
> soon after happens to become leader.
> # High IO on brokers that lose their high water mark then subsequently (on a
> successful become follower transition) truncate their log to zero and start
> catching up from the beginning.
> # If the offsets topic is affected, then offsets can get reset. This is
> because during an offset load we don't read past the high water mark. So if a
> water mark is missing then we don't load anything (even if the offsets are
> there in the log).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)