Re: [PR] KAFKA-15221; Fix the race between fetch requests from a rebooted follower. [kafka]

2023-10-11 Thread via GitHub


CalvinConfluent commented on PR #14053:
URL: https://github.com/apache/kafka/pull/14053#issuecomment-1758611548

   @splett2 @hachikuji Ticket created 
https://issues.apache.org/jira/browse/KAFKA-15590


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] KAFKA-15221; Fix the race between fetch requests from a rebooted follower. [kafka]

2023-10-11 Thread via GitHub


hachikuji merged PR #14053:
URL: https://github.com/apache/kafka/pull/14053


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] KAFKA-15221; Fix the race between fetch requests from a rebooted follower. [kafka]

2023-10-10 Thread via GitHub


CalvinConfluent commented on PR #14053:
URL: https://github.com/apache/kafka/pull/14053#issuecomment-1756411287

   Thanks @hachikuji , verified the failed UT can pass locally.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] KAFKA-15221; Fix the race between fetch requests from a rebooted follower. [kafka]

2023-10-10 Thread via GitHub


CalvinConfluent commented on code in PR #14053:
URL: https://github.com/apache/kafka/pull/14053#discussion_r1352848836


##
core/src/main/scala/kafka/cluster/Replica.scala:
##
@@ -98,14 +105,22 @@ class Replica(val brokerId: Int, val topicPartition: 
TopicPartition) extends Log
* fetch request is always smaller than the leader's LEO, which can happen 
if small produce requests are received at
* high frequency.
*/
-  def updateFetchState(
+  def updateFetchStateOrThrow(
 followerFetchOffsetMetadata: LogOffsetMetadata,
 followerStartOffset: Long,
 followerFetchTimeMs: Long,
 leaderEndOffset: Long,
 brokerEpoch: Long
   ): Unit = {
 replicaState.updateAndGet { currentReplicaState =>
+  val cachedBrokerEpoch = if 
(metadataCache.isInstanceOf[KRaftMetadataCache])
+
metadataCache.asInstanceOf[KRaftMetadataCache].getAliveBrokerEpoch(brokerId) 
else Option(-1L)
+  // Fence the update if it provides a stale broker epoch.
+  if (brokerEpoch != -1 && cachedBrokerEpoch.exists(_ > brokerEpoch)) {

Review Comment:
   Maybe allowing the updates with higher broker epochs has one problem. Is it 
possible there is a malfunctioning/bug broker to fetch with a much higher 
broker epoch, then the leader has to restart to get out of the state. Other 
than this, I don't see a problem to allow higher broker epoch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] KAFKA-15221; Fix the race between fetch requests from a rebooted follower. [kafka]

2023-10-10 Thread via GitHub


CalvinConfluent commented on code in PR #14053:
URL: https://github.com/apache/kafka/pull/14053#discussion_r1352848836


##
core/src/main/scala/kafka/cluster/Replica.scala:
##
@@ -98,14 +105,22 @@ class Replica(val brokerId: Int, val topicPartition: 
TopicPartition) extends Log
* fetch request is always smaller than the leader's LEO, which can happen 
if small produce requests are received at
* high frequency.
*/
-  def updateFetchState(
+  def updateFetchStateOrThrow(
 followerFetchOffsetMetadata: LogOffsetMetadata,
 followerStartOffset: Long,
 followerFetchTimeMs: Long,
 leaderEndOffset: Long,
 brokerEpoch: Long
   ): Unit = {
 replicaState.updateAndGet { currentReplicaState =>
+  val cachedBrokerEpoch = if 
(metadataCache.isInstanceOf[KRaftMetadataCache])
+
metadataCache.asInstanceOf[KRaftMetadataCache].getAliveBrokerEpoch(brokerId) 
else Option(-1L)
+  // Fence the update if it provides a stale broker epoch.
+  if (brokerEpoch != -1 && cachedBrokerEpoch.exists(_ > brokerEpoch)) {

Review Comment:
   Maybe allowing the updates with higher broker epochs has one problem. Is it 
possible there is a malfunctioning/bug broker to fetch with a crazy broker 
epoch, then the leader has to restart to get out of the state. Other than this, 
I don't see a problem to allow higher broker epoch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] KAFKA-15221; Fix the race between fetch requests from a rebooted follower. [kafka]

2023-10-09 Thread via GitHub


hachikuji commented on code in PR #14053:
URL: https://github.com/apache/kafka/pull/14053#discussion_r1351005643


##
core/src/main/scala/kafka/cluster/Replica.scala:
##
@@ -98,14 +105,22 @@ class Replica(val brokerId: Int, val topicPartition: 
TopicPartition) extends Log
* fetch request is always smaller than the leader's LEO, which can happen 
if small produce requests are received at
* high frequency.
*/
-  def updateFetchState(
+  def updateFetchStateOrThrow(
 followerFetchOffsetMetadata: LogOffsetMetadata,
 followerStartOffset: Long,
 followerFetchTimeMs: Long,
 leaderEndOffset: Long,
 brokerEpoch: Long
   ): Unit = {
 replicaState.updateAndGet { currentReplicaState =>
+  val cachedBrokerEpoch = if 
(metadataCache.isInstanceOf[KRaftMetadataCache])
+
metadataCache.asInstanceOf[KRaftMetadataCache].getAliveBrokerEpoch(brokerId) 
else Option(-1L)
+  // Fence the update if it provides a stale broker epoch.
+  if (brokerEpoch != -1 && cachedBrokerEpoch.exists(_ > brokerEpoch)) {

Review Comment:
   I was actually debating it. Do we create a race on the leader for a 
restarted broker? A restarted broker will typically not be in the ISR, so 
perhaps a delay for propagation of the registration state would not have any 
adverse effects. What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] KAFKA-15221; Fix the race between fetch requests from a rebooted follower. [kafka]

2023-10-09 Thread via GitHub


CalvinConfluent commented on code in PR #14053:
URL: https://github.com/apache/kafka/pull/14053#discussion_r1350811556


##
core/src/main/scala/kafka/cluster/Replica.scala:
##
@@ -98,14 +105,22 @@ class Replica(val brokerId: Int, val topicPartition: 
TopicPartition) extends Log
* fetch request is always smaller than the leader's LEO, which can happen 
if small produce requests are received at
* high frequency.
*/
-  def updateFetchState(
+  def updateFetchStateOrThrow(
 followerFetchOffsetMetadata: LogOffsetMetadata,
 followerStartOffset: Long,
 followerFetchTimeMs: Long,
 leaderEndOffset: Long,
 brokerEpoch: Long
   ): Unit = {
 replicaState.updateAndGet { currentReplicaState =>
+  val cachedBrokerEpoch = if 
(metadataCache.isInstanceOf[KRaftMetadataCache])
+
metadataCache.asInstanceOf[KRaftMetadataCache].getAliveBrokerEpoch(brokerId) 
else Option(-1L)
+  // Fence the update if it provides a stale broker epoch.
+  if (brokerEpoch != -1 && cachedBrokerEpoch.exists(_ > brokerEpoch)) {

Review Comment:
   Makes sense. For the fetch request with a higher epoch should also be fenced.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] KAFKA-15221; Fix the race between fetch requests from a rebooted follower. [kafka]

2023-10-09 Thread via GitHub


CalvinConfluent commented on code in PR #14053:
URL: https://github.com/apache/kafka/pull/14053#discussion_r1350811307


##
core/src/main/scala/kafka/cluster/Replica.scala:
##
@@ -98,14 +105,22 @@ class Replica(val brokerId: Int, val topicPartition: 
TopicPartition) extends Log
* fetch request is always smaller than the leader's LEO, which can happen 
if small produce requests are received at
* high frequency.
*/
-  def updateFetchState(
+  def updateFetchStateOrThrow(
 followerFetchOffsetMetadata: LogOffsetMetadata,
 followerStartOffset: Long,
 followerFetchTimeMs: Long,
 leaderEndOffset: Long,
 brokerEpoch: Long
   ): Unit = {
 replicaState.updateAndGet { currentReplicaState =>
+  val cachedBrokerEpoch = if 
(metadataCache.isInstanceOf[KRaftMetadataCache])

Review Comment:
   Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] KAFKA-15221; Fix the race between fetch requests from a rebooted follower. [kafka]

2023-10-09 Thread via GitHub


hachikuji commented on code in PR #14053:
URL: https://github.com/apache/kafka/pull/14053#discussion_r1350736079


##
core/src/main/scala/kafka/cluster/Replica.scala:
##
@@ -98,14 +105,22 @@ class Replica(val brokerId: Int, val topicPartition: 
TopicPartition) extends Log
* fetch request is always smaller than the leader's LEO, which can happen 
if small produce requests are received at
* high frequency.
*/
-  def updateFetchState(
+  def updateFetchStateOrThrow(
 followerFetchOffsetMetadata: LogOffsetMetadata,
 followerStartOffset: Long,
 followerFetchTimeMs: Long,
 leaderEndOffset: Long,
 brokerEpoch: Long
   ): Unit = {
 replicaState.updateAndGet { currentReplicaState =>
+  val cachedBrokerEpoch = if 
(metadataCache.isInstanceOf[KRaftMetadataCache])

Review Comment:
   nit: in scala, it's usually cleaner to use a match instead of `isInstanceOf`.



##
core/src/main/scala/kafka/cluster/Replica.scala:
##
@@ -98,14 +105,22 @@ class Replica(val brokerId: Int, val topicPartition: 
TopicPartition) extends Log
* fetch request is always smaller than the leader's LEO, which can happen 
if small produce requests are received at
* high frequency.
*/
-  def updateFetchState(
+  def updateFetchStateOrThrow(
 followerFetchOffsetMetadata: LogOffsetMetadata,
 followerStartOffset: Long,
 followerFetchTimeMs: Long,
 leaderEndOffset: Long,
 brokerEpoch: Long
   ): Unit = {
 replicaState.updateAndGet { currentReplicaState =>
+  val cachedBrokerEpoch = if 
(metadataCache.isInstanceOf[KRaftMetadataCache])
+
metadataCache.asInstanceOf[KRaftMetadataCache].getAliveBrokerEpoch(brokerId) 
else Option(-1L)
+  // Fence the update if it provides a stale broker epoch.
+  if (brokerEpoch != -1 && cachedBrokerEpoch.exists(_ > brokerEpoch)) {

Review Comment:
   Should we check for equality? I guess the basic question is whether we allow 
fetches from a higher epoch than what is in the cache?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org