[ 
https://issues.apache.org/jira/browse/KAFKA-15230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chia-Ping Tsai reopened KAFKA-15230:
------------------------------------

reopen to make it as duplicate

> ApiVersions data between controllers is not reliable
> ----------------------------------------------------
>
>                 Key: KAFKA-15230
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15230
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: David Arthur
>            Assignee: Colin McCabe
>            Priority: Critical
>             Fix For: 3.7.0
>
>
> While testing ZK migrations, I noticed a case where the controller was not 
> starting the migration due to the missing ApiVersions data from other 
> controllers. This was unexpected because the quorum was running and the 
> followers were replicating the metadata log as expected. After examining a 
> heap dump of the leader, it was in fact the case that the ApiVersions map of 
> NodeApiVersions was empty.
>  
> After further investigation and offline discussion with [~jsancio], we 
> realized that after the initial leader election, the connection from the Raft 
> leader to the followers will become idle and eventually timeout and close. 
> This causes NetworkClient to purge the NodeApiVersions data for the closed 
> connections.
>  
> There are two main side effects of this behavior: 
> 1) If migrations are not started within the idle timeout period (10 minutes, 
> by default), then they will not be able to be started. After this timeout 
> period, I was unable to restart the controllers in such a way that the leader 
> had active connections with all followers.
> 2) Dynamically updating features, such as "metadata.version", is not 
> guaranteed to be safe
>  
> There is a partial workaround for the migration issue. If we set "
> connections.max.idle.ms" to -1, the Raft leader will never disconnect from 
> the followers. However, if a follower restarts, the leader will not 
> re-establish a connection.
>  
> The feature update issue has no safe workarounds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to