[
https://issues.apache.org/jira/browse/KUDU-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613056#comment-15613056
]
Todd Lipcon commented on KUDU-1735:
---
Here's the interesting logs:
When the tablet starts up, we can see that a few CONFIG_CHANGEs are pending
{code}
I1026 22:09:10.486959 9692 tablet_bootstrap.cc:1019] T
e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84:
ReplayState: Previous OpId: 1.2288, Committed OpId: 1.2287, Pending Replicates:
3, Pending Commits: 1
I1026 22:09:10.487023 9692 tablet_bootstrap.cc:1019] T
e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84: Dumping
REPLICATES:
I1026 22:09:10.487025 9692 tablet_bootstrap.cc:1019] T
e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84:type:
REPLICATE replicate { id { term: 1 index: 2286 } timestamp: 6052023659982815232
op_type: WRITE_OP write_request { tablet_id: "e58b90e0125e4fbfb967b85066f2bac1"
schema { columns { name: "
I1026 22:09:10.492393 9692 tablet_bootstrap.cc:1019] T
e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84:type:
REPLICATE replicate { id { term: 1 index: 2287 } timestamp: 6052023660001042432
op_type: CHANGE_CONFIG_OP change_config_record { tablet_id:
"e58b90e0125e4fbfb967b85066f2bac1" old_config
I1026 22:09:10.492408 9692 tablet_bootstrap.cc:1019] T
e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84:type:
REPLICATE replicate { id { term: 1 index: 2288 } timestamp: 6052023660156993536
op_type: CHANGE_CONFIG_OP change_config_record { tablet_id:
"e58b90e0125e4fbfb967b85066f2bac1" old_config
I1026 22:09:10.492410 9692 tablet_bootstrap.cc:1019] T
e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84: Dumping
COMMITS:
I1026 22:09:10.492413 9692 tablet_bootstrap.cc:1019] T
e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84:type:
COMMIT commit { op_type: CHANGE_CONFIG_OP commited_op_id { term: 1 index: 2287
} }
{code}
according to the log, we don't have 2288 committed yet. However, the metadata
is actually present on disk so the committed config has index 2288:
{code}
I1026 22:09:10.657603 9692 raft_consensus.cc:286] T
e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84 [term 1
FOLLOWER]: Replica starting. Triggering 3 pending transactions. Active config:
opid_index: 2288 OBSOLETE_local: false peers { permanent_uuid:
"3c305734ab9d4e0ebfbd0def74841a5d" member_type: V
{code}
so we ignore the pending config change:
{code}
I1026 22:09:10.658336 9692 raft_consensus.cc:630] T
e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84 [term 1
FOLLOWER]: Ignoring setting pending config change with OpId 1.2288 because the
committed config has OpId index 2288. The config change we are ignoring is: Old
config: { opid_index: 2287 OBSOL
{code}
but then when the DeleteTablet comes, we get the crash:
{code}
I1026 22:09:11.081184 14990 tablet_service.cc:660] Processing DeleteTablet for
tablet e58b90e0125e4fbfb967b85066f2bac1 with delete_type TABLET_DATA_TOMBSTONED
(TS c9633273962a4521a32d5e177a118a84 not found in new config with opid_index
7024) from {real_user=kudu, eff_user=} at 10.17.224.34:58742
I1026 22:09:11.081382 14990 raft_consensus_state.cc:370] T
e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84Trying to
abort 1 pending transactions.
I1026 22:09:11.081481 14990 raft_consensus_state.cc:375] T
e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84Aborting
transaction as it isn't in flight: id { term: 1 index: 2288 } timestamp:
6052023660156993536 op_type: CHANGE_CONFIG_OP change_config_record { tablet_id:
"e58b90e0125e4fbfb967b85066f2bac
F1026 22:09:11.081563 14990 raft_consensus.cc:2055] Check failed:
state_->IsConfigChangePendingUnlocked() T e58b90e0125e4fbfb967b85066f2bac1 P
c9633273962a4521a32d5e177a118a84 [term 1 FOLLOWER]: Aborting CHANGE_CONFIG_OP
but there was no pending config set. Op: id { term: 1 index: 2288 } timestamp:
6052023660156993536
{code}
> CHECK failure when aborting an ignored config change operation
> --
>
> Key: KUDU-1735
> URL: https://issues.apache.org/jira/browse/KUDU-1735
> Project: Kudu
> Issue Type: Bug
> Components: consensus
>Affects Versions: 1.0.1
>Reporter: Todd Lipcon
>Priority: Critical
>
> The following sequence causes a CHECK failure:
> - a tablet server receives a CONFIG_CHANGE operation
> - the tablet server commits the operation (writing the new consensus config
> to disk), but crashes before it can write the associated COMMIT message to
> the log
> - the server is down for long enough that it is removed from the
> configuration again while it's down
> - when it comes back up, it sees the CONFIG_CHANGE again as a pending