[GitHub] [kafka] jsancio commented on pull request #12274: KAFKA-13959: Controller should unfence Broker with busy metadata log
jsancio commented on PR #12274: URL: https://github.com/apache/kafka/pull/12274#issuecomment-1213269855 Build failed with test errors. Rerunning: ``` [Build / JDK 8 and Scala 2.12 / org.apache.kafka.connect.mirror.integration.IdentityReplicationIntegrationTest.testOffsetSyncsTopicsOnTarget()](https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-12274/9/testReport/junit/org.apache.kafka.connect.mirror.integration/IdentityReplicationIntegrationTest/Build___JDK_8_and_Scala_2_12___testOffsetSyncsTopicsOnTarget__/) [Build / JDK 8 and Scala 2.12 / org.apache.kafka.connect.mirror.integration.MirrorConnectorsIntegrationBaseTest.testReplication()](https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-12274/9/testReport/junit/org.apache.kafka.connect.mirror.integration/MirrorConnectorsIntegrationBaseTest/Build___JDK_8_and_Scala_2_12___testReplication__/) [Build / JDK 8 and Scala 2.12 / kafka.server.KRaftClusterTest.testIncrementalAlterConfigs()](https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-12274/9/testReport/junit/kafka.server/KRaftClusterTest/Build___JDK_8_and_Scala_2_12___testIncrementalAlterConfigs__/) [Build / JDK 11 and Scala 2.13 / org.apache.kafka.controller.QuorumControllerTest.testNoOpRecordWriteAfterTimeout()](https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-12274/9/testReport/junit/org.apache.kafka.controller/QuorumControllerTest/Build___JDK_11_and_Scala_2_13___testNoOpRecordWriteAfterTimeout__/) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] jsancio commented on pull request #12274: KAFKA-13959: Controller should unfence Broker with busy metadata log
jsancio commented on PR #12274: URL: https://github.com/apache/kafka/pull/12274#issuecomment-1212113291 @dengziming take a look at the tests failures. All of the failures in `ClusterControlManagerTest` seems related to this change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] jsancio commented on pull request #12274: KAFKA-13959: Controller should unfence Broker with busy metadata log
jsancio commented on PR #12274: URL: https://github.com/apache/kafka/pull/12274#issuecomment-1206920706 @dengziming I marked this issue as a blocker to 3.3.0. Let me know if you want to work on this issue. I don't mind. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] jsancio commented on pull request #12274: KAFKA-13959: Controller should unfence Broker with busy metadata log
jsancio commented on PR #12274: URL: https://github.com/apache/kafka/pull/12274#issuecomment-1189538831 > @jsancio I think the last question is that there may be more than one registration record for each broker after restarting it, so can we rely on the broker epoch? I think I need some time to check the logic before make a final decision. Yes. I think you can rely on broker epoch and broker id. Also the active controller is guaranteed to have read all of the records on the log before handling RPCs like heartbeat. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] jsancio commented on pull request #12274: KAFKA-13959: Controller should unfence Broker with busy metadata log
jsancio commented on PR #12274: URL: https://github.com/apache/kafka/pull/12274#issuecomment-1187345352 > This is a good idea, however, the offset of each registration record is a soft state(not persist to metadata log) @dengziming The offset is persistent along with the record. This offset is provided to the inactive controllers in handleCommit. When processing handleCommit the inactive controllers can use the `Batch::lastOffset` of the batch that contains the registration record. When processing handleSnapshot the inactive controllers can use the `SnapshotReader::lastContainedLogOffset` of the snapshot if the snapshot contains registration records. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] jsancio commented on pull request #12274: KAFKA-13959: Controller should unfence Broker with busy metadata log
jsancio commented on PR #12274: URL: https://github.com/apache/kafka/pull/12274#issuecomment-1183190708 > maybe we should allow the lastAppliedOffset to be behind by A records (or time) to unfence the broker. @dengziming Here are my thoughts ### Improve logic for unfencing brokers How about the controller unfence the broker when the broker's high-watermark has reached the broker registration record for that broker? The a controller first registers for a given incarnation, the controller write a broker registration record. The controller can remember this offset as returned by the raft client. The controller can unfence the broker when the broker's high-watermark is greater than this registration offset. ### Propagate the HWM to the replicas as quickly as possible. I think that the solution above it would allow us to de-prioritize this. Here are my observations anyways. Looking at the KafkaRaftClient implementation we would have to have an index for both the fetch offset and the last sent high-watermark for that replica. Another issue here is that we changed the KafkaRaftManager so that it doesn't set the replica id when it is an observer/broker. Since the HWM is not part of the Fetch request the leader would have to keep track of this in the LeaderState ``` val nodeId = if (config.processRoles.contains(ControllerRole)) { OptionalInt.of(config.nodeId) } else { OptionalInt.empty() } ``` We would need to find a better solution for https://issues.apache.org/jira/browse/KAFKA-13168 or improve the FETCH request so that it includes the HWM. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] jsancio commented on pull request #12274: KAFKA-13959: Controller should unfence Broker with busy metadata log
jsancio commented on PR #12274: URL: https://github.com/apache/kafka/pull/12274#issuecomment-1177949852 > I havn't flesh it, the basic idea is shown in this PR, only read up to Isolation.COMMITTED when reading from observer and invoke fetchPurgatory.maybeComplete(highWatermark.offset, currentTimeMs) on onUpdateLeaderHighWatermark. I see. This would not work for co-located Kafka servers, right? Co-located Kafka server are servers that are running both a controller and a broker. In that case the replica will read uncommitted data and the leader will not send a FETCH response when the HW changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] jsancio commented on pull request #12274: KAFKA-13959: Controller should unfence Broker with busy metadata log
jsancio commented on PR #12274: URL: https://github.com/apache/kafka/pull/12274#issuecomment-1176530495 > In this PR I tried your suggestion and it does solve this problem, however, this will make the logic in RaftClient very complex and we need to save more states in LeaderState and it's also difficult to test @dengziming Do you have a diff for this solution? I am interested in this solution as it would work in both REMOTE and COLOCATED configuration for KRaft. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] jsancio commented on pull request #12274: KAFKA-13959: Controller should unfence Broker with busy metadata log
jsancio commented on PR #12274: URL: https://github.com/apache/kafka/pull/12274#issuecomment-1176512896 > > here is the code change: https://github.com/dengziming/kafka/tree/KAFKA-13959-2 > > Hey @dengziming, I took at look at the commits in this tree. Is this the only commit [dengziming@79dc8ec](https://github.com/dengziming/kafka/commit/79dc8ec423cd74fba462e934f89bdec3dcd8528d)? Can you maybe share a diff/compare. For example, something like [dengziming/kafka@30216ea...KAFKA-13959-2](https://github.com/dengziming/kafka/compare/30216ea1c58761e62f51af40033f24e3ae1c5c2a...KAFKA-13959-2) Never mind. I understand now. The broker sends the active controller the local LEO instead of the last applied offset by the broker listener. I think this will unfence the broker at startup even if the broker hasn't applied the snapshot or any of the log records, right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] jsancio commented on pull request #12274: KAFKA-13959: Controller should unfence Broker with busy metadata log
jsancio commented on PR #12274: URL: https://github.com/apache/kafka/pull/12274#issuecomment-1176510158 > here is the code change: https://github.com/dengziming/kafka/tree/KAFKA-13959-2 Hey @dengziming, I took at look at the commits in this tree. Is this the only commit https://github.com/dengziming/kafka/commit/79dc8ec423cd74fba462e934f89bdec3dcd8528d? Can you maybe share a diff/compare. For example, something like https://github.com/dengziming/kafka/compare/30216ea1c58761e62f51af40033f24e3ae1c5c2a...KAFKA-13959-2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org