[GitHub] [kafka] CalvinConfluent commented on a diff in pull request #14053: KAFKA-15221; Fix the race between fetch requests from a rebooted follower.

2023-09-21 Thread via GitHub


CalvinConfluent commented on code in PR #14053:
URL: https://github.com/apache/kafka/pull/14053#discussion_r1333490449


##
core/src/main/scala/kafka/cluster/Replica.scala:
##
@@ -98,14 +103,20 @@ class Replica(val brokerId: Int, val topicPartition: 
TopicPartition) extends Log
* fetch request is always smaller than the leader's LEO, which can happen 
if small produce requests are received at
* high frequency.
*/
-  def updateFetchState(
+  def updateFetchStateOrThrow(
 followerFetchOffsetMetadata: LogOffsetMetadata,
 followerStartOffset: Long,
 followerFetchTimeMs: Long,
 leaderEndOffset: Long,
 brokerEpoch: Long
   ): Unit = {
 replicaState.updateAndGet { currentReplicaState =>
+  // Fence the update if it provides a stale broker epoch.
+  if (verifyBrokerEpoch && brokerEpoch != -1 && 
currentReplicaState.brokerEpoch.exists(_ > brokerEpoch)) {

Review Comment:
   @hachikuji Thanks for the advice. Adjusted to using the metadata cache for 
the epoch validation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] CalvinConfluent commented on a diff in pull request #14053: KAFKA-15221; Fix the race between fetch requests from a rebooted follower.

2023-09-14 Thread via GitHub


CalvinConfluent commented on code in PR #14053:
URL: https://github.com/apache/kafka/pull/14053#discussion_r1326544485


##
core/src/main/scala/kafka/cluster/Replica.scala:
##
@@ -98,14 +103,22 @@ class Replica(val brokerId: Int, val topicPartition: 
TopicPartition) extends Log
* fetch request is always smaller than the leader's LEO, which can happen 
if small produce requests are received at
* high frequency.
*/
-  def updateFetchState(
+  def updateFetchStateOrThrow(
 followerFetchOffsetMetadata: LogOffsetMetadata,
 followerStartOffset: Long,
 followerFetchTimeMs: Long,
 leaderEndOffset: Long,
-brokerEpoch: Long
+brokerEpoch: Long,
+verifyBrokerEpoch: Boolean = false

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] CalvinConfluent commented on a diff in pull request #14053: KAFKA-15221; Fix the race between fetch requests from a rebooted follower.

2023-09-14 Thread via GitHub


CalvinConfluent commented on code in PR #14053:
URL: https://github.com/apache/kafka/pull/14053#discussion_r1326544274


##
core/src/main/scala/kafka/cluster/Partition.scala:
##
@@ -137,7 +137,8 @@ object Partition {
   delayedOperations = delayedOperations,
   metadataCache = replicaManager.metadataCache,
   logManager = replicaManager.logManager,
-  alterIsrManager = replicaManager.alterPartitionManager)
+  alterIsrManager = replicaManager.alterPartitionManager,
+  zkMigrationEnabled = () => replicaManager.config.migrationEnabled)

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] CalvinConfluent commented on a diff in pull request #14053: KAFKA-15221; Fix the race between fetch requests from a rebooted follower.

2023-09-12 Thread via GitHub


CalvinConfluent commented on code in PR #14053:
URL: https://github.com/apache/kafka/pull/14053#discussion_r1323907175


##
core/src/main/scala/kafka/cluster/Partition.scala:
##
@@ -858,7 +859,7 @@ class Partition(val topicPartition: TopicPartition,
 // No need to calculate low watermark if there is no delayed 
DeleteRecordsRequest
 val oldLeaderLW = if (delayedOperations.numDelayedDelete > 0) 
lowWatermarkIfLeader else -1L
 val prevFollowerEndOffset = replica.stateSnapshot.logEndOffset
-replica.updateFetchState(
+replica.updateFetchStateOrThrow(

Review Comment:
   Thanks for the advice, changing to use the read lock.



##
core/src/main/scala/kafka/cluster/Replica.scala:
##
@@ -98,14 +103,21 @@ class Replica(val brokerId: Int, val topicPartition: 
TopicPartition) extends Log
* fetch request is always smaller than the leader's LEO, which can happen 
if small produce requests are received at
* high frequency.
*/
-  def updateFetchState(
+  def updateFetchStateOrThrow(
 followerFetchOffsetMetadata: LogOffsetMetadata,
 followerStartOffset: Long,
 followerFetchTimeMs: Long,
 leaderEndOffset: Long,
 brokerEpoch: Long
   ): Unit = {
 replicaState.updateAndGet { currentReplicaState =>
+  // Fence the update if it provides a stale broker epoch.
+  val expectedBrokerEpoch = currentReplicaState.brokerEpoch.getOrElse(-1L)
+  if (brokerEpoch != -1 && brokerEpoch < expectedBrokerEpoch) {
+throw Errors.NOT_LEADER_OR_FOLLOWER.exception(s"Received stale fetch 
state update. broker epoch=$brokerEpoch " +
+  s"vs expected=$expectedBrokerEpoch")
+  }

Review Comment:
   Thanks, updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] CalvinConfluent commented on a diff in pull request #14053: KAFKA-15221; Fix the race between fetch requests from a rebooted follower.

2023-07-24 Thread via GitHub


CalvinConfluent commented on code in PR #14053:
URL: https://github.com/apache/kafka/pull/14053#discussion_r1272855952


##
core/src/main/scala/kafka/cluster/Replica.scala:
##
@@ -98,31 +101,39 @@ class Replica(val brokerId: Int, val topicPartition: 
TopicPartition) extends Log
* fetch request is always smaller than the leader's LEO, which can happen 
if small produce requests are received at
* high frequency.
*/
-  def updateFetchState(
+  def maybeUpdateFetchState(
 followerFetchOffsetMetadata: LogOffsetMetadata,
 followerStartOffset: Long,
 followerFetchTimeMs: Long,
 leaderEndOffset: Long,
 brokerEpoch: Long
-  ): Unit = {
+  ): Boolean = {
+var updateSuccess = true
 replicaState.updateAndGet { currentReplicaState =>
-  val lastCaughtUpTime = if (followerFetchOffsetMetadata.messageOffset >= 
leaderEndOffset) {
-math.max(currentReplicaState.lastCaughtUpTimeMs, followerFetchTimeMs)
-  } else if (followerFetchOffsetMetadata.messageOffset >= 
currentReplicaState.lastFetchLeaderLogEndOffset) {
-math.max(currentReplicaState.lastCaughtUpTimeMs, 
currentReplicaState.lastFetchTimeMs)
+  // Fence the update if it provides a stale broker epoch.
+  if (brokerEpoch != -1 && brokerEpoch < 
currentReplicaState.brokerEpoch.getOrElse(-1L)) {
+updateSuccess = false

Review Comment:
   Does it require a KIP to add a new exception?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] CalvinConfluent commented on a diff in pull request #14053: KAFKA-15221; Fix the race between fetch requests from a rebooted follower.

2023-07-24 Thread via GitHub


CalvinConfluent commented on code in PR #14053:
URL: https://github.com/apache/kafka/pull/14053#discussion_r1272852459


##
core/src/main/scala/kafka/cluster/Replica.scala:
##
@@ -98,31 +101,39 @@ class Replica(val brokerId: Int, val topicPartition: 
TopicPartition) extends Log
* fetch request is always smaller than the leader's LEO, which can happen 
if small produce requests are received at
* high frequency.
*/
-  def updateFetchState(
+  def maybeUpdateFetchState(
 followerFetchOffsetMetadata: LogOffsetMetadata,
 followerStartOffset: Long,
 followerFetchTimeMs: Long,
 leaderEndOffset: Long,
 brokerEpoch: Long
-  ): Unit = {
+  ): Boolean = {
+var updateSuccess = true
 replicaState.updateAndGet { currentReplicaState =>
-  val lastCaughtUpTime = if (followerFetchOffsetMetadata.messageOffset >= 
leaderEndOffset) {
-math.max(currentReplicaState.lastCaughtUpTimeMs, followerFetchTimeMs)
-  } else if (followerFetchOffsetMetadata.messageOffset >= 
currentReplicaState.lastFetchLeaderLogEndOffset) {
-math.max(currentReplicaState.lastCaughtUpTimeMs, 
currentReplicaState.lastFetchTimeMs)
+  // Fence the update if it provides a stale broker epoch.
+  if (brokerEpoch != -1 && brokerEpoch < 
currentReplicaState.brokerEpoch.getOrElse(-1L)) {
+updateSuccess = false

Review Comment:
   Do we need a KIP for the extra exception?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] CalvinConfluent commented on a diff in pull request #14053: KAFKA-15221; Fix the race between fetch requests from a rebooted follower.

2023-07-24 Thread via GitHub


CalvinConfluent commented on code in PR #14053:
URL: https://github.com/apache/kafka/pull/14053#discussion_r1272852459


##
core/src/main/scala/kafka/cluster/Replica.scala:
##
@@ -98,31 +101,39 @@ class Replica(val brokerId: Int, val topicPartition: 
TopicPartition) extends Log
* fetch request is always smaller than the leader's LEO, which can happen 
if small produce requests are received at
* high frequency.
*/
-  def updateFetchState(
+  def maybeUpdateFetchState(
 followerFetchOffsetMetadata: LogOffsetMetadata,
 followerStartOffset: Long,
 followerFetchTimeMs: Long,
 leaderEndOffset: Long,
 brokerEpoch: Long
-  ): Unit = {
+  ): Boolean = {
+var updateSuccess = true
 replicaState.updateAndGet { currentReplicaState =>
-  val lastCaughtUpTime = if (followerFetchOffsetMetadata.messageOffset >= 
leaderEndOffset) {
-math.max(currentReplicaState.lastCaughtUpTimeMs, followerFetchTimeMs)
-  } else if (followerFetchOffsetMetadata.messageOffset >= 
currentReplicaState.lastFetchLeaderLogEndOffset) {
-math.max(currentReplicaState.lastCaughtUpTimeMs, 
currentReplicaState.lastFetchTimeMs)
+  // Fence the update if it provides a stale broker epoch.
+  if (brokerEpoch != -1 && brokerEpoch < 
currentReplicaState.brokerEpoch.getOrElse(-1L)) {
+updateSuccess = false

Review Comment:
   Do we need a KIP for the extra exception?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] CalvinConfluent commented on a diff in pull request #14053: [KAFKA-15221] Fix the race between fetch requests from a rebooted follower.

2023-07-19 Thread via GitHub


CalvinConfluent commented on code in PR #14053:
URL: https://github.com/apache/kafka/pull/14053#discussion_r1268783217


##
core/src/main/scala/kafka/cluster/Partition.scala:
##
@@ -1366,6 +1376,17 @@ class Partition(val topicPartition: TopicPartition,
   fetchParams.replicaId,
   fetchPartitionData
 )
+
+// Fence the fetch request with stale broker epoch from a rebooted 
follower.
+if (metadataCache.isInstanceOf[KRaftMetadataCache]) {
+  val brokerEpoch = fetchParams.replicaEpoch
+  val currentBrokerEpoch = 
replica.stateSnapshot.brokerEpoch.getOrElse(-1L)
+  if (brokerEpoch != -1 && brokerEpoch < currentBrokerEpoch) {
+throw new StaleBrokerEpochException(s"Received fetch request for 
$topicPartition with stale broker " +
+  s"epoch=$brokerEpoch. The expected broker epoch= 
$currentBrokerEpoch.")
+  }
+}

Review Comment:
   Make sense. Then just abort the fetch state update.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org