[ https://issues.apache.org/jira/browse/KAFKA-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755997#comment-17755997 ]
Daniel Urban edited comment on KAFKA-15372 at 8/18/23 2:33 PM: --------------------------------------------------------------- [~gharris1727] I have a bit more information: we managed to deterministically reproduce the issue if we use a 2 instance cluster, and make a change in the config, then wait a long time between the instance restarts. In our specific test we wait 3 minutes after stopping the first MM2 instance, then wait 3 minutes after starting the first MM2 instance, then wait 3 minutes after stopping the second MM2 instance, and so on. At the end of this process, the config change does not get applied. I think this all depends on the rebalance - if the rebalance finishes before the MM2 instance bounces back, the leadership will always move to the other node. Again, all of this is tested on a 3.4.1 build which has the MM2 internal REST enabled, but I'm pretty sure that trunk is affected. was (Author: durban): [~gharris1727] I have a bit more information: we managed to deterministically reproduce the issue if we use a 2 instance cluster, and make a change in the config, then wait a long time between the instance restarts. In our specific test we wait 3 minutes after stopping the first MM2 instance, then wait 3 minutes after starting the first MM2 instance, then wait 3 minutes after stopping the second MM2 instance, and so on. At the end of this process, the config change does not get applied. I think this all depends on the rebalance - if the rebalance finishes before the MM2 instance bounces back, the leadership will always move to the other node. > MM2 rolling restart can drop configuration changes silently > ----------------------------------------------------------- > > Key: KAFKA-15372 > URL: https://issues.apache.org/jira/browse/KAFKA-15372 > Project: Kafka > Issue Type: Improvement > Components: mirrormaker > Reporter: Daniel Urban > Priority: Major > > When MM2 is restarted, it tries to update the Connector configuration in all > flows. This is a one-time trial, and fails if the Connect worker is not the > leader of the group. > In a distributed setup and with a rolling restart, it is possible that for a > specific flow, the Connect worker of the just restarted MM2 instance is not > the leader, meaning that Connector configurations can get dropped. > For example, assuming 2 MM2 instances, and one flow A->B: > # MM2 instance 1 is restarted, the worker inside MM2 instance 2 becomes the > leader of A->B Connect group. > # MM2 instance 1 tries to update the Connector configurations, but fails > (instance 2 has the leader, not instance 1) > # MM2 instance 2 is restarted, leadership moves to worker in MM2 instance 1 > # MM2 instance 2 tries to update the Connector configurations, but fails > At this point, the configuration changes before the restart are never > applied. Many times, this can also happen silently, without any indication. -- This message was sent by Atlassian Jira (v8.20.10#820010)