Barnabas Maidics created KAFKA-12798:
----------------------------------------
Summary: Fixing MM2 rebalance timeout issue when source cluster is
not available
Key: KAFKA-12798
URL: https://issues.apache.org/jira/browse/KAFKA-12798
Project: Kafka
Issue Type: Bug
Components: mirrormaker, replication
Reporter: Barnabas Maidics
If the network configuration of a source cluster which is taking part in a
replication flow is changed (change of port number, if, for instance TLS is
enabled or disabled) MirrorMaker2 won't update its internal configuration even
after a reconfiguration followed by a restart.
What happens in MirrorMaker2 after a cluster "identity" (i.e. connectivity
config) changes:
# MM2 driver (MirrorMaker class) starts up with the new config.
# DistributedHerder joins a dedicated consumer group that decides which driver
instance has control over the assignments and the configuration topic.
# The driver caches the consumer group assignment, which indicates that it is
the leader of the group.
# The driver reads the configuration topic (which is still not containing the
new config), and starts the mm connectors.
# Since the old config is invalid, the connectors cannot connect to the
cluster anymore - MirrorSourceConnector tries to query the cluster through the
admin client, but the queries time out after 2 minutes (it contains 2 tasks
affecting the source cluster, the timeout is 1 minute for both).
## In the meantime, the background heartbeat thread checks on the state of the
herder consumer membership. There is a default rebalance timeout of 1 minute.
Since the herder thread was blocked due to the connector query timeouts, it
wasn't able to call poll on the consumer. Heartbeat thread invalidates the
consumer membership and triggers a new consumer creation.
# The herder thread finishes the connector startup, and after realizing that
the configuration has changed, tries to update the config topic.
## The config topic can only be updated by the leader herder.
## The driver checks the group assignment to see if it is the leader.
## In the local cache, the old assignment is present, in which the leader is
the previous consumer with its own ID.
## The current consumer ID of the driver does not match the cached leader ID.
# The driver refuses to update the config topic.
[~durban], thanks for digging deeper into this issue
*The proposed fix for this:*
The rebalance issue can be fixed by decreasing the time that we wait for tasks
that affects the source cluster at the start of MM2. By decreasing the timeout
(from 1 minute to 15 seconds by default), if the kafka config is old, the tasks
affecting the source cluster won't block for too long. With this the herder
will be able to update the config topic. This timout is configurable now and
defaults to 15 seconds.
Also needed to increase the number of threads in the scheduler so that other
tasks won't be blocked.
*Testing done:*
# configure replication between source->target
# checked that the replication is working
# change source kafka cluster broker port
# restart kafka/mirrormaker2, produced new messages in the replicated topic
# after the restart mm2 was trying to use the old kafka configs, and even
after a long time, it couldn't replicate. After applying the fix, the issue was
solved, replication worked.
Also tested with the same scenario, but instead of changing the port, ssl was
turned on the source kafka cluster.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)