Re: [PR] KAFKA-15221; Fix the race between fetch requests from a rebooted follower. [kafka]
CalvinConfluent commented on PR #14053: URL: https://github.com/apache/kafka/pull/14053#issuecomment-1758611548 @splett2 @hachikuji Ticket created https://issues.apache.org/jira/browse/KAFKA-15590 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-15221; Fix the race between fetch requests from a rebooted follower. [kafka]
hachikuji merged PR #14053: URL: https://github.com/apache/kafka/pull/14053 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-15221; Fix the race between fetch requests from a rebooted follower. [kafka]
CalvinConfluent commented on PR #14053: URL: https://github.com/apache/kafka/pull/14053#issuecomment-1756411287 Thanks @hachikuji , verified the failed UT can pass locally. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-15221; Fix the race between fetch requests from a rebooted follower. [kafka]
CalvinConfluent commented on code in PR #14053: URL: https://github.com/apache/kafka/pull/14053#discussion_r1352848836 ## core/src/main/scala/kafka/cluster/Replica.scala: ## @@ -98,14 +105,22 @@ class Replica(val brokerId: Int, val topicPartition: TopicPartition) extends Log * fetch request is always smaller than the leader's LEO, which can happen if small produce requests are received at * high frequency. */ - def updateFetchState( + def updateFetchStateOrThrow( followerFetchOffsetMetadata: LogOffsetMetadata, followerStartOffset: Long, followerFetchTimeMs: Long, leaderEndOffset: Long, brokerEpoch: Long ): Unit = { replicaState.updateAndGet { currentReplicaState => + val cachedBrokerEpoch = if (metadataCache.isInstanceOf[KRaftMetadataCache]) + metadataCache.asInstanceOf[KRaftMetadataCache].getAliveBrokerEpoch(brokerId) else Option(-1L) + // Fence the update if it provides a stale broker epoch. + if (brokerEpoch != -1 && cachedBrokerEpoch.exists(_ > brokerEpoch)) { Review Comment: Maybe allowing the updates with higher broker epochs has one problem. Is it possible there is a malfunctioning/bug broker to fetch with a much higher broker epoch, then the leader has to restart to get out of the state. Other than this, I don't see a problem to allow higher broker epoch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-15221; Fix the race between fetch requests from a rebooted follower. [kafka]
CalvinConfluent commented on code in PR #14053: URL: https://github.com/apache/kafka/pull/14053#discussion_r1352848836 ## core/src/main/scala/kafka/cluster/Replica.scala: ## @@ -98,14 +105,22 @@ class Replica(val brokerId: Int, val topicPartition: TopicPartition) extends Log * fetch request is always smaller than the leader's LEO, which can happen if small produce requests are received at * high frequency. */ - def updateFetchState( + def updateFetchStateOrThrow( followerFetchOffsetMetadata: LogOffsetMetadata, followerStartOffset: Long, followerFetchTimeMs: Long, leaderEndOffset: Long, brokerEpoch: Long ): Unit = { replicaState.updateAndGet { currentReplicaState => + val cachedBrokerEpoch = if (metadataCache.isInstanceOf[KRaftMetadataCache]) + metadataCache.asInstanceOf[KRaftMetadataCache].getAliveBrokerEpoch(brokerId) else Option(-1L) + // Fence the update if it provides a stale broker epoch. + if (brokerEpoch != -1 && cachedBrokerEpoch.exists(_ > brokerEpoch)) { Review Comment: Maybe allowing the updates with higher broker epochs has one problem. Is it possible there is a malfunctioning/bug broker to fetch with a crazy broker epoch, then the leader has to restart to get out of the state. Other than this, I don't see a problem to allow higher broker epoch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-15221; Fix the race between fetch requests from a rebooted follower. [kafka]
hachikuji commented on code in PR #14053: URL: https://github.com/apache/kafka/pull/14053#discussion_r1351005643 ## core/src/main/scala/kafka/cluster/Replica.scala: ## @@ -98,14 +105,22 @@ class Replica(val brokerId: Int, val topicPartition: TopicPartition) extends Log * fetch request is always smaller than the leader's LEO, which can happen if small produce requests are received at * high frequency. */ - def updateFetchState( + def updateFetchStateOrThrow( followerFetchOffsetMetadata: LogOffsetMetadata, followerStartOffset: Long, followerFetchTimeMs: Long, leaderEndOffset: Long, brokerEpoch: Long ): Unit = { replicaState.updateAndGet { currentReplicaState => + val cachedBrokerEpoch = if (metadataCache.isInstanceOf[KRaftMetadataCache]) + metadataCache.asInstanceOf[KRaftMetadataCache].getAliveBrokerEpoch(brokerId) else Option(-1L) + // Fence the update if it provides a stale broker epoch. + if (brokerEpoch != -1 && cachedBrokerEpoch.exists(_ > brokerEpoch)) { Review Comment: I was actually debating it. Do we create a race on the leader for a restarted broker? A restarted broker will typically not be in the ISR, so perhaps a delay for propagation of the registration state would not have any adverse effects. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-15221; Fix the race between fetch requests from a rebooted follower. [kafka]
CalvinConfluent commented on code in PR #14053: URL: https://github.com/apache/kafka/pull/14053#discussion_r1350811556 ## core/src/main/scala/kafka/cluster/Replica.scala: ## @@ -98,14 +105,22 @@ class Replica(val brokerId: Int, val topicPartition: TopicPartition) extends Log * fetch request is always smaller than the leader's LEO, which can happen if small produce requests are received at * high frequency. */ - def updateFetchState( + def updateFetchStateOrThrow( followerFetchOffsetMetadata: LogOffsetMetadata, followerStartOffset: Long, followerFetchTimeMs: Long, leaderEndOffset: Long, brokerEpoch: Long ): Unit = { replicaState.updateAndGet { currentReplicaState => + val cachedBrokerEpoch = if (metadataCache.isInstanceOf[KRaftMetadataCache]) + metadataCache.asInstanceOf[KRaftMetadataCache].getAliveBrokerEpoch(brokerId) else Option(-1L) + // Fence the update if it provides a stale broker epoch. + if (brokerEpoch != -1 && cachedBrokerEpoch.exists(_ > brokerEpoch)) { Review Comment: Makes sense. For the fetch request with a higher epoch should also be fenced. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-15221; Fix the race between fetch requests from a rebooted follower. [kafka]
CalvinConfluent commented on code in PR #14053: URL: https://github.com/apache/kafka/pull/14053#discussion_r1350811307 ## core/src/main/scala/kafka/cluster/Replica.scala: ## @@ -98,14 +105,22 @@ class Replica(val brokerId: Int, val topicPartition: TopicPartition) extends Log * fetch request is always smaller than the leader's LEO, which can happen if small produce requests are received at * high frequency. */ - def updateFetchState( + def updateFetchStateOrThrow( followerFetchOffsetMetadata: LogOffsetMetadata, followerStartOffset: Long, followerFetchTimeMs: Long, leaderEndOffset: Long, brokerEpoch: Long ): Unit = { replicaState.updateAndGet { currentReplicaState => + val cachedBrokerEpoch = if (metadataCache.isInstanceOf[KRaftMetadataCache]) Review Comment: Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-15221; Fix the race between fetch requests from a rebooted follower. [kafka]
hachikuji commented on code in PR #14053: URL: https://github.com/apache/kafka/pull/14053#discussion_r1350736079 ## core/src/main/scala/kafka/cluster/Replica.scala: ## @@ -98,14 +105,22 @@ class Replica(val brokerId: Int, val topicPartition: TopicPartition) extends Log * fetch request is always smaller than the leader's LEO, which can happen if small produce requests are received at * high frequency. */ - def updateFetchState( + def updateFetchStateOrThrow( followerFetchOffsetMetadata: LogOffsetMetadata, followerStartOffset: Long, followerFetchTimeMs: Long, leaderEndOffset: Long, brokerEpoch: Long ): Unit = { replicaState.updateAndGet { currentReplicaState => + val cachedBrokerEpoch = if (metadataCache.isInstanceOf[KRaftMetadataCache]) Review Comment: nit: in scala, it's usually cleaner to use a match instead of `isInstanceOf`. ## core/src/main/scala/kafka/cluster/Replica.scala: ## @@ -98,14 +105,22 @@ class Replica(val brokerId: Int, val topicPartition: TopicPartition) extends Log * fetch request is always smaller than the leader's LEO, which can happen if small produce requests are received at * high frequency. */ - def updateFetchState( + def updateFetchStateOrThrow( followerFetchOffsetMetadata: LogOffsetMetadata, followerStartOffset: Long, followerFetchTimeMs: Long, leaderEndOffset: Long, brokerEpoch: Long ): Unit = { replicaState.updateAndGet { currentReplicaState => + val cachedBrokerEpoch = if (metadataCache.isInstanceOf[KRaftMetadataCache]) + metadataCache.asInstanceOf[KRaftMetadataCache].getAliveBrokerEpoch(brokerId) else Option(-1L) + // Fence the update if it provides a stale broker epoch. + if (brokerEpoch != -1 && cachedBrokerEpoch.exists(_ > brokerEpoch)) { Review Comment: Should we check for equality? I guess the basic question is whether we allow fetches from a higher epoch than what is in the cache? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org