[GitHub] [kafka] jsancio commented on a change in pull request #10278: KAFKA-10526: leader fsync deferral on write
jsancio commented on a change in pull request #10278: URL: https://github.com/apache/kafka/pull/10278#discussion_r596219371 ## File path: raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java ## @@ -1876,7 +1876,7 @@ private long maybeAppendBatches( BatchAccumulator.CompletedBatch batch = iterator.next(); appendBatch(state, batch, currentTimeMs); } -flushLeaderLog(state, currentTimeMs); +//flushLeaderLog(state, currentTimeMs); Review comment: We are tracking the LEO in two places: 1. In `ReplicatedLog::endOffset`. This gets increase every time the log gets appended: https://github.com/apache/kafka/blob/28ee656081d5b7984c324c3ea3fc9c34614d17db/core/src/main/scala/kafka/log/Log.scala#L1302 2. The `LeaderState` also stores what is now the LEO. One suggestion is for `LeaderState` to instead store the "flush offsets". In `LeaderState` the follower's flush offset is the LEO but for the local replica the "flush offset" may not be the LEO. An example of the high-watermark increasing but the LEO not changing: 1. follower: LEO = 10 2. leader: LEO = 100, FlushOffset = 100, HW = 0 Follower successfully fetches for offset 10 => Leader: LEO = 100, FlushOffset = 100, HW = 10. Follower successfully fetches for offset 20 => Leader: LEO = 100, FlushOffset = 100, HW = 20. In this example if the leader already flushed to the LEO then there is no need to flush again when increasing the HW. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] jsancio commented on a change in pull request #10278: KAFKA-10526: leader fsync deferral on write
jsancio commented on a change in pull request #10278: URL: https://github.com/apache/kafka/pull/10278#discussion_r593988386 ## File path: raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java ## @@ -1876,7 +1876,7 @@ private long maybeAppendBatches( BatchAccumulator.CompletedBatch batch = iterator.next(); appendBatch(state, batch, currentTimeMs); } -flushLeaderLog(state, currentTimeMs); +//flushLeaderLog(state, currentTimeMs); Review comment: I think it makes sense to move this flush to `KafkaRaftClient::onUpdateLeaderHighWatermark`. It is possible for the high-watermark to increase without the LEO increasing. To avoid unnecessary flush calls, the leader should remember in `LeaderState` the LEO when it flushed the log. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] jsancio commented on a change in pull request #10278: KAFKA-10526: leader fsync deferral on write
jsancio commented on a change in pull request #10278: URL: https://github.com/apache/kafka/pull/10278#discussion_r593419637 ## File path: raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java ## @@ -1876,7 +1876,7 @@ private long maybeAppendBatches( BatchAccumulator.CompletedBatch batch = iterator.next(); appendBatch(state, batch, currentTimeMs); } -flushLeaderLog(state, currentTimeMs); +//flushLeaderLog(state, currentTimeMs); Review comment: The invariant that the leader most satisfy is that the `highWatermark <= flushOffset`. The current implementation satisfies this by flushing after every append and implicitly defining `flushOffset == logEndOffset`. At a high-level, I think the goals is to allow `highWatermark <= flushOffset <= logEndOffset`. On the follower, things are a little different. On the follower the `flushOffset == logEndOffset` before a `Fetch` request can be sent. This is because the leader assumes that the fetch offset in the `Fetch` request is the offset that the follower has successfully replicated. The advantage of appending without flushing as soon as possible replication latency. The leader cannot replicate record batches to the followers and observers until they have been appended to the log. I am not exactly sure how exactly we want to implement this since I haven't looked at the details but I think you are correct that on the leader side of things we want to increase the `flushOffset` in the `Fetch` request handling code as the leader attempts to increase the high-watermark. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org