dengziming commented on a change in pull request #10481:
URL: https://github.com/apache/kafka/pull/10481#discussion_r609589579



##########
File path: raft/src/main/java/org/apache/kafka/raft/LeaderState.java
##########
@@ -113,13 +113,18 @@ private boolean updateHighWatermark() {
         Optional<LogOffsetMetadata> highWatermarkUpdateOpt = 
followersByDescendingFetchOffset.get(indexOfHw).endOffset;
 
         if (highWatermarkUpdateOpt.isPresent()) {
-            // When a leader is first elected, it cannot know the high 
watermark of the previous
-            // leader. In order to avoid exposing a non-monotonically 
increasing value, we have
-            // to wait for followers to catch up to the start of the leader's 
epoch.
+
+            // The KRaft protocol requires an extra condition on commitment 
after a leader
+            // election. The leader must commit one record from its own epoch 
before it is
+            // allowed to expose records from any previous epoch. This 
guarantees that its
+            // log will contain the largest record (in terms of epoch/offset) 
in any log
+            // which ensures that any future leader will have replicated this 
record as well
+            // as all records from previous epochs that the current leader has 
committed.
+
             LogOffsetMetadata highWatermarkUpdateMetadata = 
highWatermarkUpdateOpt.get();
             long highWatermarkUpdateOffset = 
highWatermarkUpdateMetadata.offset;
 
-            if (highWatermarkUpdateOffset >= epochStartOffset) {
+            if (highWatermarkUpdateOffset > epochStartOffset) {

Review comment:
       I was also confused by this logic for a while and I got an optimization 
idea from the [Raft 
dissertation](http://wcl.cs.rpi.edu/pilots/library/papers/consensus/RAFTOngaroPhD.pdf)
 about this, The sentence below is taken from section "5.4.2 Committing entries 
from previous terms":
   ```
   There are some situations where a leader could safely conclude that an older 
log entry is committed (for example, if that entry is stored on every server), 
but Raft takes a more conservative approach for simplicity.
   ```
   so we can also update commitIndex(highWatermark) if logEndOffset of all 
followers have passed the highWatermark.
   
   I don't think this is a good idea since it makes the logic opaque but will 
not necessarily really optimize any performance, so I just mention it here and 
feel free to ignore it 🙈.
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to