David Arthur created KAFKA-15230: ------------------------------------ Summary: ApiVersions data between controllers is not reliable Key: KAFKA-15230 URL: https://issues.apache.org/jira/browse/KAFKA-15230 Project: Kafka Issue Type: Bug Reporter: David Arthur
While testing ZK migrations, I noticed a case where the controller was not starting the migration due to the missing ApiVersions data from other controllers. This was unexpected because the quorum was running and the followers were replicating the metadata log as expected. After examining a heap dump of the leader, it was in fact the case that the ApiVersions map of NodeApiVersions was empty. After further investigation and offline discussion with [~jsancio], we realized that after the initial leader election, the connection from the Raft leader to the followers will become idle and eventually timeout and close. This causes NetworkClient to purge the NodeApiVersions data for the closed connections. There are two main side effects of this behavior: 1) If migrations are not started within the idle timeout period (10 minutes, by default), then they will not be able to be started. After this timeout period, I was unable to restart the controllers in such a way that the leader had active connections with all followers. 2) Dynamically updating features, such as "metadata.version", is not guaranteed to be safe There is a partial workaround for the migration issue. If we set " connections.max.idle.ms" to -1, the Raft leader will never disconnect from the followers. However, if a follower restarts, the leader will not re-establish a connection. The feature update issue has no safe workarounds. -- This message was sent by Atlassian Jira (v8.20.10#820010)