[jira] [Comment Edited] (KAFKA-15372) MM2 rolling restart can drop configuration changes silently

Greg Harris (Jira) Fri, 18 Aug 2023 09:39:04 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17756048#comment-17756048
 ]


Greg Harris edited comment on KAFKA-15372 at 8/18/23 4:38 PM:
--------------------------------------------------------------

[~durban] thank you for the sanity check, I see the flaw now.

The NotLeaderException is created/thrown in the Herder, and the caller 
(typically the ConnectorsResource) is responsible for handling the error and 
forwarding to the leader. See HerderRequestHandler#completeOrForwardRequest.
I thought the MirrorMaker class used this handler for applying the connector 
configuration on startup, but it doesn't. It never triggers the forwarding 
logic that normal REST requests have.

More importantly, KIP-710 only added {_}internal rest resources{_}, which are 
for writing task configurations and fencing zombies. It didn't include the 
connector configuration endpoint, which is typically public-api.
This was an oversight in the KIP-710 design/implementation, and should be 
fixed. Currently, KIP-710 only benefits the internal herder requests for 
writing task configurations and fencing zombies, it is not capable of 
forwarding connector configurations.
Since KIP-710 already added the necessary configuration options to 
enable/disable the internal REST, and this arguably should have been covered by 
the original implementation, I think we can address this in a bug-fix.

Alternatively, we could have MirrorMaker periodically apply its local 
configurations, retrying until it receives leadership, and then stopping. This 
wouldn't require changes to the REST API, and would cover some additional error 
scenarios, but would possibly cause configurations to flap between versions as 
leadership changes.

To workaround this issue, users will need to fully stop and fully start MM2, to 
allow the first node to apply configurations successfully.


was (Author: gharris1727):
[~durban] thank you for the sanity check, I see the flaw now.

The NotLeaderException is created/thrown in the Herder, and the caller 
(typically the ConnectorsResource) is responsible for handling the error and 
forwarding to the leader. See HerderRequestHandler#completeOrForwardRequest.
I thought the MirrorMaker class used this handler for applying the connector 
configuration on startup, but it doesn't. It never triggers the forwarding 
logic that normal REST requests have.

More importantly, KIP-710 only added {_}internal rest resources{_}, which are 
for writing task configurations and fencing zombies. It didn't include the 
connector configuration endpoint, which is typically public-api.
This was an oversight in the KIP-710 design/implementation, and should be 
fixed. Currently, KIP-710 only benefits the internal herder requests for 
writing task configurations and fencing zombies, it is not capable of 
forwarding connector configurations.
Since KIP-710 already added the necessary configuration options to 
enable/disable the internal REST, and this arguably should have been covered by 
the original implementation, I think we can address this in a bug-fix.

To workaround this issue, users will need to fully stop and fully start MM2, to 
allow the first node to apply configurations successfully.

> MM2 rolling restart can drop configuration changes silently
> -----------------------------------------------------------
>
>                 Key: KAFKA-15372
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15372
>             Project: Kafka
>          Issue Type: Bug
>          Components: mirrormaker
>            Reporter: Daniel Urban
>            Priority: Major
>             Fix For: 3.6.0
>
>
> When MM2 is restarted, it tries to update the Connector configuration in all 
> flows. This is a one-time trial, and fails if the Connect worker is not the 
> leader of the group.
> In a distributed setup and with a rolling restart, it is possible that for a 
> specific flow, the Connect worker of the just restarted MM2 instance is not 
> the leader, meaning that Connector configurations can get dropped.
> For example, assuming 2 MM2 instances, and one flow A->B:
>  # MM2 instance 1 is restarted, the worker inside MM2 instance 2 becomes the 
> leader of A->B Connect group.
>  # MM2 instance 1 tries to update the Connector configurations, but fails 
> (instance 2 has the leader, not instance 1)
>  # MM2 instance 2 is restarted, leadership moves to worker in MM2 instance 1
>  # MM2 instance 2 tries to update the Connector configurations, but fails
> At this point, the configuration changes before the restart are never 
> applied. Many times, this can also happen silently, without any indication.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (KAFKA-15372) MM2 rolling restart can drop configuration changes silently

Reply via email to