[ 
https://issues.apache.org/jira/browse/KAFKA-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15402622#comment-15402622
 ] 

Jun Rao commented on KAFKA-1211:
--------------------------------

The following is a draft proposal. [~fpj], does that look reasonable to you?

1. In every log directory, we create a new leader-generation-checkpoint file, 
where we store the sequence (LGS) of leader generation and the start offset of 
messages produced in that generation.
2. When a replica becomes a leader, it first adds the leader generation and the 
log end offset of the replica to the end of leader-generation-checkpoint file 
and flushes the file. It then remembers its last leader generation (LLG) and 
becomes the leader.
3. When a replica becomes a follower, it does the following steps.
  3.1 Send a new RetreiveLeaderGeneration request for the partition to the 
leader.
  3.2 The leader responds with its LGS in the RetreiveLeaderGeneration response
  3.3 The follower finds the first leader generation whose start offset differs 
between its local LGS and the leader's LGS. It then truncates its local log to 
the smaller of the start offset of the identified leader generation, if needed.
  3.4 The follower flushes the LGS from the leader to its local 
leader-generation-checkpoint file and also remembers the expected LLG from the 
leader's LGS.
  3.5 The follower starts fetching from the leader from its log end offset.
  3.5.1 During fetching, we extend the FetchResponse to add a new field per 
partition for the LLG in the leader.
  3.5.2 If the follower sees the returned LLG in the FetchResponse not matching 
its expected LLG, go back to 3.1. (This can only happen if the leader changes 
more than once between 2 consecutive fetch requests and should be rare. We 
could also just stop the follower and wait for the next becoming follower 
request from the controller.)
  3.5.3 Otherwise, the follower proceeds to append the fetched data to its 
local log in the normal way.

Implementation wise. We probably need to extend ReplicaFetchThread to maintain 
an additional state per partition. When a partition is added to a 
ReplicaFetchThread, it needs to go through steps 3.1 to 3.4 first before 
starting fetching the data.

> Hold the produce request with ack > 1 in purgatory until replicas' HW has 
> larger than the produce offset
> --------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-1211
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1211
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Guozhang Wang
>            Assignee: Guozhang Wang
>             Fix For: 0.11.0.0
>
>
> Today during leader failover we will have a weakness period when the 
> followers truncate their data before fetching from the new leader, i.e., 
> number of in-sync replicas is just 1. If during this time the leader has also 
> failed then produce requests with ack >1 that have get responded will still 
> be lost. To avoid this scenario we would prefer to hold the produce request 
> in purgatory until replica's HW has larger than the offset instead of just 
> their end-of-log offsets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to