[ https://issues.apache.org/jira/browse/KUDU-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15613056#comment-15613056 ]
Todd Lipcon commented on KUDU-1735: ----------------------------------- Here's the interesting logs: When the tablet starts up, we can see that a few CONFIG_CHANGEs are pending {code} I1026 22:09:10.486959 9692 tablet_bootstrap.cc:1019] T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84: ReplayState: Previous OpId: 1.2288, Committed OpId: 1.2287, Pending Replicates: 3, Pending Commits: 1 I1026 22:09:10.487023 9692 tablet_bootstrap.cc:1019] T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84: Dumping REPLICATES: I1026 22:09:10.487025 9692 tablet_bootstrap.cc:1019] T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84: type: REPLICATE replicate { id { term: 1 index: 2286 } timestamp: 6052023659982815232 op_type: WRITE_OP write_request { tablet_id: "e58b90e0125e4fbfb967b85066f2bac1" schema { columns { name: " I1026 22:09:10.492393 9692 tablet_bootstrap.cc:1019] T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84: type: REPLICATE replicate { id { term: 1 index: 2287 } timestamp: 6052023660001042432 op_type: CHANGE_CONFIG_OP change_config_record { tablet_id: "e58b90e0125e4fbfb967b85066f2bac1" old_config I1026 22:09:10.492408 9692 tablet_bootstrap.cc:1019] T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84: type: REPLICATE replicate { id { term: 1 index: 2288 } timestamp: 6052023660156993536 op_type: CHANGE_CONFIG_OP change_config_record { tablet_id: "e58b90e0125e4fbfb967b85066f2bac1" old_config I1026 22:09:10.492410 9692 tablet_bootstrap.cc:1019] T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84: Dumping COMMITS: I1026 22:09:10.492413 9692 tablet_bootstrap.cc:1019] T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84: type: COMMIT commit { op_type: CHANGE_CONFIG_OP commited_op_id { term: 1 index: 2287 } } {code} according to the log, we don't have 2288 committed yet. However, the metadata is actually present on disk so the committed config has index 2288: {code} I1026 22:09:10.657603 9692 raft_consensus.cc:286] T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84 [term 1 FOLLOWER]: Replica starting. Triggering 3 pending transactions. Active config: opid_index: 2288 OBSOLETE_local: false peers { permanent_uuid: "3c305734ab9d4e0ebfbd0def74841a5d" member_type: V {code} so we ignore the pending config change: {code} I1026 22:09:10.658336 9692 raft_consensus.cc:630] T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84 [term 1 FOLLOWER]: Ignoring setting pending config change with OpId 1.2288 because the committed config has OpId index 2288. The config change we are ignoring is: Old config: { opid_index: 2287 OBSOL {code} but then when the DeleteTablet comes, we get the crash: {code} I1026 22:09:11.081184 14990 tablet_service.cc:660] Processing DeleteTablet for tablet e58b90e0125e4fbfb967b85066f2bac1 with delete_type TABLET_DATA_TOMBSTONED (TS c9633273962a4521a32d5e177a118a84 not found in new config with opid_index 7024) from {real_user=kudu, eff_user=} at 10.17.224.34:58742 I1026 22:09:11.081382 14990 raft_consensus_state.cc:370] T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84Trying to abort 1 pending transactions. I1026 22:09:11.081481 14990 raft_consensus_state.cc:375] T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84Aborting transaction as it isn't in flight: id { term: 1 index: 2288 } timestamp: 6052023660156993536 op_type: CHANGE_CONFIG_OP change_config_record { tablet_id: "e58b90e0125e4fbfb967b85066f2bac F1026 22:09:11.081563 14990 raft_consensus.cc:2055] Check failed: state_->IsConfigChangePendingUnlocked() T e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84 [term 1 FOLLOWER]: Aborting CHANGE_CONFIG_OP but there was no pending config set. Op: id { term: 1 index: 2288 } timestamp: 6052023660156993536 {code} > CHECK failure when aborting an ignored config change operation > -------------------------------------------------------------- > > Key: KUDU-1735 > URL: https://issues.apache.org/jira/browse/KUDU-1735 > Project: Kudu > Issue Type: Bug > Components: consensus > Affects Versions: 1.0.1 > Reporter: Todd Lipcon > Priority: Critical > > The following sequence causes a CHECK failure: > - a tablet server receives a CONFIG_CHANGE operation > - the tablet server commits the operation (writing the new consensus config > to disk), but crashes before it can write the associated COMMIT message to > the log > - the server is down for long enough that it is removed from the > configuration again while it's down > - when it comes back up, it sees the CONFIG_CHANGE again as a pending > replicate. When it's added to PendingRounds, it is ignored as we can see that > this configuration is already committed. > - the tserver gets the request from the master to DeleteTablet because it's > no longer part of the configuration > -- when trying to abort the operation, it fires a CHECK "Aborting > CHANGE_CONFIG_OP but there was no pending config set." -- This message was sent by Atlassian JIRA (v6.3.4#6332)