rondagostino commented on code in PR #13759:
URL: https://github.com/apache/kafka/pull/13759#discussion_r1207013139


##########
metadata/src/main/java/org/apache/kafka/controller/BrokerHeartbeatManager.java:
##########
@@ -223,6 +283,21 @@ public BrokerHeartbeatState next() {
         }
     }
 
+    /**
+     * The maximum number of timed out heartbeats to count.
+     */
+    static final int DEFAULT_TIMED_OUT_HEARTBEAT_COUNT_MAX = 1000;
+
+    /**
+     * The time period over which to track timed out heartbeats.
+     */
+    static final long DEFAULT_TIMED_OUT_HEARTBEAT_COUNT_WINDOW_NS = 
TimeUnit.MINUTES.toNanos(5);
+
+    /**
+     * The number of heartbeats to notice missing before we go into overload.
+     */
+    static final int DEFAULT_TIMED_OUT_HEARTBEAT_OVERLOAD_THRESHOLD = 3;
+

Review Comment:
   Thinking about this some more, I believe the key requirement is to 
accurately understand when we do not have a correct view of the cluster.  This 
allows us to handle  2 important cases: not fencing a broker if its session 
times out but we think we could have missed enough heartbeats to make the 
decision to fence the wrong decision; and fencing a broker if its session times 
out whenever we think we have accurate information.  I think I addressed the 
second part (detect and fence a crashed broker as quick as possible) in the 
comment above: I believe we have accurate information if we see a contiguous 
series of N successfully-processed heartbeats with no intervening timed-out 
heartbeats where N is perhaps the cluster size.  For the first part (don't 
fence if we think the decision to do so could be wrong) assume the broker 
session is 18 seconds and the heartbeat interval is 2 seconds.  That means we 
would need to miss 9 heartbeats for a broker in order to incorrectly fence it.  
Maybe 
 we keep track of the last time we had enough contiguous successful heartbeats 
(which, if we aren't missing any would always be very recent).  But then as 
soon as we miss one we increment the missed count and reset the contiguous 
count to 0.  When we successfully process a heartbeat we increment the 
contiguous count and, if it reaches the necessary threshold N (which is on the 
order of the cluster size) we reset the missed count to 0.  We can fence 
brokers only when the missed count approaches the session/interval ratio (i.e. 
18/2 = 9 by default).
   
   We can tweak this to be a bit more conservative.  Maybe we need N (the 
number of contiguous heartbeats seen to assure us we have good visibility) to 
be 1.5 or 2 times the broker count instead of just the broker count.  Maybe the 
missed count only has to exceed half the session/interval ratio (so only 
missing 5 heartbeats without seeing N successfully-processed ones in a row 
would prevent fencing instead of 9 by default).
   
   WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to