[ https://issues.apache.org/jira/browse/KAFKA-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neha Narkhede updated KAFKA-1647: --------------------------------- Priority: Critical (was: Major) > Replication offset checkpoints (high water marks) can be lost on hard kills > and restarts > ---------------------------------------------------------------------------------------- > > Key: KAFKA-1647 > URL: https://issues.apache.org/jira/browse/KAFKA-1647 > Project: Kafka > Issue Type: Bug > Reporter: Joel Koshy > Priority: Critical > > We ran into this scenario recently in a production environment. This can > happen when enough brokers in a cluster are taken down. i.e., a rolling > bounce done properly should not cause this issue. It can occur if all > replicas for any partition are taken down. > Here is a sample scenario: > * Cluster of three brokers: b0, b1, b2 > * Two partitions (of some topic) with replication factor two: p0, p1 > * Initial state: > ** p0: leader = b0, ISR = {b0, b1} > ** p1: leader = b1, ISR = {b0, b1} > * Do a parallel hard-kill of all brokers > * Bring up b2, so it is the new controller > * b2 initializes its controller context and populates its leader/ISR cache > (i.e., controllerContext.partitionLeadershipInfo) from zookeeper. The last > known leaders are b0 (for p0) and b1 (for p2) > * Bring up b1 > * The controller's onBrokerStartup procedure initiates a replica state change > for all replicas on b1 to become online. As part of this replica state change > it gets the last known leader and ISR and sends a LeaderAndIsrRequest to b1 > (for p1 and p2). This LeaderAndIsr request contains: {{p0: leader=b0; p1: > leader=b1;} leaders=b1}. b0 is indicated as the leader of p0 but it is not > included in the leaders field because b0 is down. > * On receiving the LeaderAndIsrRequest, b1's replica manager will > successfully make b2 the leader for p1 (and create the local replica object > corresponding to p1). It will however abort the become follower transition > for p0 because the designated leader b2 is offline. So it will not create the > local replica object for p0. > * It will then start the high water mark checkpoint thread. Since only p1 has > a local replica object, only p1's high water mark will be checkpointed to > disk. p0's previously written checkpoint if any will be lost. > So in summary it seems we should always create the local replica object even > if the online transition does not happen. > Possible symptoms of the above bug could be one or more of the following (we > saw 2 and 3): > # Data loss; yes on a hard-kill data loss is expected, but this can actually > cause loss of nearly all data if the broker becomes follower, truncates, and > soon after happens to become leader. > # High IO on brokers that lose their high water mark then subsequently (on a > successful become follower transition) truncate their log to zero and start > catching up from the beginning. > # If the offsets topic is affected, then offsets can get reset. This is > because during an offset load we don't read past the high water mark. So if a > water mark is missing then we don't load anything (even if the offsets are > there in the log). -- This message was sent by Atlassian JIRA (v6.3.4#6332)