[
https://issues.apache.org/jira/browse/KUDU-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185230#comment-15185230
]
Todd Lipcon commented on KUDU-1338:
-----------------------------------
Looking at the code, I think the above might be the issue:
TryRemoveFollowerTask does:
{code}
WARN_NOT_OK(ChangeConfig(req, Bind(&DoNothingStatusCB), &error_code),
state_->LogPrefixThreadSafe() + "Unable to remove follower " +
uuid);
{code}
(i.e. binds DoNothingStatusCB as 'client_cb'). That does:
{code}
RETURN_NOT_OK(ReplicateConfigChangeUnlocked(committed_config, new_config,
Bind(&RaftConsensus::MarkDirtyOnSuccess,
Unretained(this),
string("Config change
replication complete"),
client_cb)));
{code}
i.e. 'client_cb' is now a wrapper which calls still doesn't handle failure.
ReplicateConfigChangeUnlocked does:
{code}
round->SetConsensusReplicatedCallback(Bind(&RaftConsensus::NonTxRoundReplicationFinished,
Unretained(this),
Unretained(round.get()),
client_cb));
{code}
NonTxRoundReplicationFinished does:
{code}
if (!status.ok()) {
// TODO: Do something with the status on failure?
LOG(INFO) << state_->LogPrefixThreadSafe() << op_type_str << " replication
failed: "
<< status.ToString();
client_cb.Run(status);
return;
}
{code}
where that TODO looks awfully relevant. If a config change gets aborted, we
probably need to go back to using the old config, right?
> Tablet stuck in RaftConfig change currently pending
> ---------------------------------------------------
>
> Key: KUDU-1338
> URL: https://issues.apache.org/jira/browse/KUDU-1338
> Project: Kudu
> Issue Type: Bug
> Components: consensus
> Affects Versions: 0.7.0
> Reporter: Jean-Daniel Cryans
> Priority: Critical
> Attachments: KUDU_TSERVER.node-2.internal.gz,
> KUDU_TSERVER.node-3.internal.gz, KUDU_TSERVER.node-5.internal.gz, logs.tgz
>
>
> We've been adapting the consensus logs for a while and I think we can finally
> get to the bottom of this issue. I'm attaching the logs from the 3 nodes that
> participated in the same config for tablet eaa1877a2b3540cf8202aff844c6ca79.
> ITBLL is driving the load and eventually fails at 2016-02-15 14:53:12,005
> trying to write to node-2 AKA a1081edd2ca24f6b9dcdd7e5000f95ec. The peer that
> gets stuck is node-5 AKA cdec7fdacbac4ad1b095275b3bdbbe5c, starting from this
> line:
> {noformat}
> I0215 14:28:41.585695 2020 raft_consensus_state.cc:459] T
> eaa1877a2b3540cf8202aff844c6ca79 P cdec7fdacbac4ad1b095275b3bdbbe5c [term 69
> FOLLOWER]: Illegal state: RaftConfig change currently pending. Only one is
> allowed at a time.
> {noformat}
> The chaos monkey running on this setup is dropping packets one node at time.
> I'll attach the logs in a moment.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)