[jira] [Commented] (KAFKA-13959) Controller should unfence Broker with busy metadata log
[ https://issues.apache.org/jira/browse/KAFKA-13959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552107#comment-17552107 ] Luke Chen commented on KAFKA-13959: --- So I think the proposed solution should fix the issue. One solution to this problem is to require the broker to only catch up to the last committed offset when they last sent the heartbeat. For example: # Broker sends a heartbeat with current offset of {{{}Y{}}}. The last commit offset is {{{}X{}}}. The controller remember this last commit offset, call it {{X'}} # Broker sends another heartbeat with current offset of {{{}Z{}}}. Unfence the broker if {{Z >= X}} or {{{}Z >= X'{}}}. > Controller should unfence Broker with busy metadata log > --- > > Key: KAFKA-13959 > URL: https://issues.apache.org/jira/browse/KAFKA-13959 > Project: Kafka > Issue Type: Bug > Components: kraft >Affects Versions: 3.3.0 >Reporter: Jose Armando Garcia Sancio >Priority: Blocker > > https://issues.apache.org/jira/browse/KAFKA-13955 showed that it is possible > for the controller to not unfence a broker if the committed offset keeps > increasing. > > One solution to this problem is to require the broker to only catch up to the > last committed offset when they last sent the heartbeat. For example: > # Broker sends a heartbeat with current offset of {{{}Y{}}}. The last commit > offset is {{{}X{}}}. The controller remember this last commit offset, call it > {{X'}} > # Broker sends another heartbeat with current offset of {{{}Z{}}}. Unfence > the broker if {{Z >= X}} or {{{}Z >= X'{}}}. > > This change should also set the default for MetadataMaxIdleIntervalMs back to > 500. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (KAFKA-13959) Controller should unfence Broker with busy metadata log
[ https://issues.apache.org/jira/browse/KAFKA-13959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552106#comment-17552106 ] Luke Chen commented on KAFKA-13959: --- [~dengziming] [~jagsancio] , I did some investigation today, and here's my finding: # broker heartbeat to active controller won't fetch any data or increase the offset. broker just sends the current offset and some broker info to the controller. So, even if we have small interval of heartbeat, it still won't help. # So, when will the broker offset increased? It only happened in broker metadataListener. the metadataListener is listening to raftClient. And raftClient is polling metadata from active controller. # About when the highwatermark will be updated in active controller: Whenever there's record append to the active controller log, it won't update the highwatermark, until there are voters fetch records from active controller and also update the highwatermark. ex: current active controller is in highwatermark 9, and a record append to active controller log to offset 10, it'll wait, until voters send fetch request to active controller to update highwatermark, and then, commit the offset 10 record, update the new highwatermark to 10, to make sure the record is replicated to a majority of the voters. # So, that explains what we saw in the issue: ## active controller send no-op message to metadata topic, active controller append into log, but don't update highwatermark (still 9) ## broker raftClient fetch records from active controller, ## active controller return the records to offset 9, and then update the highwatermark to 10 ## broker metaListener will operate the records ## broker send heartbeat to active controller with offest 9 ## since offset 9 is < active controller highwatermark 10 ## keep trying, and in the meantime, no-op message sent again, and back to step 1 > Controller should unfence Broker with busy metadata log > --- > > Key: KAFKA-13959 > URL: https://issues.apache.org/jira/browse/KAFKA-13959 > Project: Kafka > Issue Type: Bug > Components: kraft >Affects Versions: 3.3.0 >Reporter: Jose Armando Garcia Sancio >Priority: Blocker > > https://issues.apache.org/jira/browse/KAFKA-13955 showed that it is possible > for the controller to not unfence a broker if the committed offset keeps > increasing. > > One solution to this problem is to require the broker to only catch up to the > last committed offset when they last sent the heartbeat. For example: > # Broker sends a heartbeat with current offset of {{{}Y{}}}. The last commit > offset is {{{}X{}}}. The controller remember this last commit offset, call it > {{X'}} > # Broker sends another heartbeat with current offset of {{{}Z{}}}. Unfence > the broker if {{Z >= X}} or {{{}Z >= X'{}}}. > > This change should also set the default for MetadataMaxIdleIntervalMs back to > 500. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (KAFKA-13959) Controller should unfence Broker with busy metadata log
[ https://issues.apache.org/jira/browse/KAFKA-13959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551663#comment-17551663 ] Jose Armando Garcia Sancio commented on KAFKA-13959: [~dengziming], If you haven't, maybe looking at the KRaft side of the implementation may help. Specially at the LEOs reported by KRaft for both the controller and the broker(s), any pending FETCH request(s) and how often brokers sends FETCH requests. > Controller should unfence Broker with busy metadata log > --- > > Key: KAFKA-13959 > URL: https://issues.apache.org/jira/browse/KAFKA-13959 > Project: Kafka > Issue Type: Bug > Components: kraft >Affects Versions: 3.3.0 >Reporter: Jose Armando Garcia Sancio >Priority: Blocker > > https://issues.apache.org/jira/browse/KAFKA-13955 showed that it is possible > for the controller to not unfence a broker if the committed offset keeps > increasing. > > One solution to this problem is to require the broker to only catch up to the > last committed offset when they last sent the heartbeat. For example: > # Broker sends a heartbeat with current offset of {{{}Y{}}}. The last commit > offset is {{{}X{}}}. The controller remember this last commit offset, call it > {{X'}} > # Broker sends another heartbeat with current offset of {{{}Z{}}}. Unfence > the broker if {{Z >= X}} or {{{}Z >= X'{}}}. > > This change should also set the default for MetadataMaxIdleIntervalMs back to > 500. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (KAFKA-13959) Controller should unfence Broker with busy metadata log
[ https://issues.apache.org/jira/browse/KAFKA-13959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551534#comment-17551534 ] dengziming commented on KAFKA-13959: I haven't find the root cause, I just print the brokerOffset and controllerOffset when heartbeat, I find that every time the brokerOffset bump, the controllerOffset will also bump. ``` time: 1654679131904 broker 0 brokerOffset:27 controllerOffset:28 time: 1654679132115 broker 0 brokerOffset:27 controllerOffset:28 time: 1654679132381 broker 0 brokerOffset:28 controllerOffset:29 time: 1654679132592 broker 0 brokerOffset:28 controllerOffset:29 time: 1654679132878 broker 0 brokerOffset:29 controllerOffset:30 time: 1654679133089 broker 0 brokerOffset:29 controllerOffset:30 time: 1654679133299 broker 0 brokerOffset:30 controllerOffset:31 time: 1654679133509 broker 0 brokerOffset:30 controllerOffset:31 ``` I try to increase the interval of heartbeats but got the same result, and if I set numberControllerNodes to 1, this problem disappear. I think this may be related to the logic of how we compute leader hw and follower hw. > Controller should unfence Broker with busy metadata log > --- > > Key: KAFKA-13959 > URL: https://issues.apache.org/jira/browse/KAFKA-13959 > Project: Kafka > Issue Type: Bug > Components: kraft >Affects Versions: 3.3.0 >Reporter: Jose Armando Garcia Sancio >Priority: Blocker > > https://issues.apache.org/jira/browse/KAFKA-13955 showed that it is possible > for the controller to not unfence a broker if the committed offset keeps > increasing. > > One solution to this problem is to require the broker to only catch up to the > last committed offset when they last sent the heartbeat. For example: > # Broker sends a heartbeat with current offset of {{{}Y{}}}. The last commit > offset is {{{}X{}}}. The controller remember this last commit offset, call it > {{X'}} > # Broker sends another heartbeat with current offset of {{{}Z{}}}. Unfence > the broker if {{Z >= X}} or {{{}Z >= X'{}}}. > > This change should also set the default for MetadataMaxIdleIntervalMs back to > 500. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (KAFKA-13959) Controller should unfence Broker with busy metadata log
[ https://issues.apache.org/jira/browse/KAFKA-13959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17550943#comment-17550943 ] Luke Chen commented on KAFKA-13959: --- Sorry, I didn't see your last sentence. Thanks for the investigation! Looking forward to knowing the root cause! :) > Controller should unfence Broker with busy metadata log > --- > > Key: KAFKA-13959 > URL: https://issues.apache.org/jira/browse/KAFKA-13959 > Project: Kafka > Issue Type: Bug > Components: kraft >Affects Versions: 3.3.0 >Reporter: Jose Armando Garcia Sancio >Priority: Blocker > > https://issues.apache.org/jira/browse/KAFKA-13955 showed that it is possible > for the controller to not unfence a broker if the committed offset keeps > increasing. > > One solution to this problem is to require the broker to only catch up to the > last committed offset when they last sent the heartbeat. For example: > # Broker sends a heartbeat with current offset of {{{}Y{}}}. The last commit > offset is {{{}X{}}}. The controller remember this last commit offset, call it > {{X'}} > # Broker sends another heartbeat with current offset of {{{}Z{}}}. Unfence > the broker if {{Z >= X}} or {{{}Z >= X'{}}}. > > This change should also set the default for MetadataMaxIdleIntervalMs back to > 500. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (KAFKA-13959) Controller should unfence Broker with busy metadata log
[ https://issues.apache.org/jira/browse/KAFKA-13959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17550941#comment-17550941 ] Luke Chen commented on KAFKA-13959: --- [~dengziming] , if it's 10 ms heartbeat, how could it not be able to catch up with 500ms no-op records? > Controller should unfence Broker with busy metadata log > --- > > Key: KAFKA-13959 > URL: https://issues.apache.org/jira/browse/KAFKA-13959 > Project: Kafka > Issue Type: Bug > Components: kraft >Affects Versions: 3.3.0 >Reporter: Jose Armando Garcia Sancio >Priority: Blocker > > https://issues.apache.org/jira/browse/KAFKA-13955 showed that it is possible > for the controller to not unfence a broker if the committed offset keeps > increasing. > > One solution to this problem is to require the broker to only catch up to the > last committed offset when they last sent the heartbeat. For example: > # Broker sends a heartbeat with current offset of {{{}Y{}}}. The last commit > offset is {{{}X{}}}. The controller remember this last commit offset, call it > {{X'}} > # Broker sends another heartbeat with current offset of {{{}Z{}}}. Unfence > the broker if {{Z >= X}} or {{{}Z >= X'{}}}. > > This change should also set the default for MetadataMaxIdleIntervalMs back to > 500. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (KAFKA-13959) Controller should unfence Broker with busy metadata log
[ https://issues.apache.org/jira/browse/KAFKA-13959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17550912#comment-17550912 ] dengziming commented on KAFKA-13959: When BrokerLifecycleManager is starting up, it will send heartbeat every 10 milliseconds rather than 2000 milliseconds: `scheduleNextCommunication(NANOSECONDS.convert(10, MILLISECONDS))` which is already smaller than 500ms, so the reason for this bug is more complex, I need more time to investigate. > Controller should unfence Broker with busy metadata log > --- > > Key: KAFKA-13959 > URL: https://issues.apache.org/jira/browse/KAFKA-13959 > Project: Kafka > Issue Type: Bug > Components: kraft >Affects Versions: 3.3.0 >Reporter: Jose Armando Garcia Sancio >Priority: Blocker > > https://issues.apache.org/jira/browse/KAFKA-13955 showed that it is possible > for the controller to not unfence a broker if the committed offset keeps > increasing. > > One solution to this problem is to require the broker to only catch up to the > last committed offset when they last sent the heartbeat. For example: > # Broker sends a heartbeat with current offset of {{{}Y{}}}. The last commit > offset is {{{}X{}}}. The controller remember this last commit offset, call it > {{X'}} > # Broker sends another heartbeat with current offset of {{{}Z{}}}. Unfence > the broker if {{Z >= X}} or {{{}Z >= X'{}}}. > > This change should also set the default for MetadataMaxIdleIntervalMs back to > 500. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (KAFKA-13959) Controller should unfence Broker with busy metadata log
[ https://issues.apache.org/jira/browse/KAFKA-13959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17550173#comment-17550173 ] Luke Chen commented on KAFKA-13959: --- Thanks [~dengziming] ! > Controller should unfence Broker with busy metadata log > --- > > Key: KAFKA-13959 > URL: https://issues.apache.org/jira/browse/KAFKA-13959 > Project: Kafka > Issue Type: Bug > Components: kraft >Affects Versions: 3.3.0 >Reporter: Jose Armando Garcia Sancio >Priority: Blocker > > https://issues.apache.org/jira/browse/KAFKA-13955 showed that it is possible > for the controller to not unfence a broker if the committed offset keeps > increasing. > > One solution to this problem is to require the broker to only catch up to the > last committed offset when they last sent the heartbeat. For example: > # Broker sends a heartbeat with current offset of {{{}Y{}}}. The last commit > offset is {{{}X{}}}. The controller remember this last commit offset, call it > {{X'}} > # Broker sends another heartbeat with current offset of {{{}Z{}}}. Unfence > the broker if {{Z >= X}} or {{{}Z >= X'{}}}. > > This change should also set the default for MetadataMaxIdleIntervalMs back to > 500. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (KAFKA-13959) Controller should unfence Broker with busy metadata log
[ https://issues.apache.org/jira/browse/KAFKA-13959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17548215#comment-17548215 ] dengziming commented on KAFKA-13959: I will take a look at this if no one assigned. > Controller should unfence Broker with busy metadata log > --- > > Key: KAFKA-13959 > URL: https://issues.apache.org/jira/browse/KAFKA-13959 > Project: Kafka > Issue Type: Bug > Components: kraft >Affects Versions: 3.3.0 >Reporter: Jose Armando Garcia Sancio >Priority: Blocker > > https://issues.apache.org/jira/browse/KAFKA-13955 showed that it is possible > for the controller to not unfence a broker if the committed offset keeps > increasing. > > One solution to this problem is to require the broker to only catch up to the > last committed offset when they last sent the heartbeat. For example: > # Broker sends a heartbeat with current offset of {{{}Y{}}}. The last commit > offset is {{{}X{}}}. The controller remember this last commit offset, call it > {{X'}} > # Broker sends another heartbeat with current offset of {{{}Z{}}}. Unfence > the broker if {{Z >= X}} or {{{}Z >= X'{}}}. > > This change should also set the default for MetadataMaxIdleIntervalMs back to > 500. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (KAFKA-13959) Controller should unfence Broker with busy metadata log
[ https://issues.apache.org/jira/browse/KAFKA-13959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17546695#comment-17546695 ] Jose Armando Garcia Sancio commented on KAFKA-13959: [~dengziming] [~showuon] Are you interested in working on this issue? > Controller should unfence Broker with busy metadata log > --- > > Key: KAFKA-13959 > URL: https://issues.apache.org/jira/browse/KAFKA-13959 > Project: Kafka > Issue Type: Bug > Components: kraft >Affects Versions: 3.3.0 >Reporter: Jose Armando Garcia Sancio >Priority: Blocker > > https://issues.apache.org/jira/browse/KAFKA-13955 showed that it is possible > for the controller to not unfence a broker if the committed offset keeps > increasing. > > One solution to this problem is to require the broker to only catch up to the > last committed offset when they last sent the heartbeat. For example: > # Broker sends a heartbeat with current offset of {{{}Y{}}}. The last commit > offset is {{{}X{}}}. The controller remember this last commit offset, call it > {{X'}} > # Broker sends another heartbeat with current offset of {{{}Z{}}}. Unfence > the broker if {{Z >= X}} or {{{}Z >= X'{}}}. > > This change should also set the default for MetadataMaxIdleIntervalMs back to > 500. -- This message was sent by Atlassian Jira (v8.20.7#820007)