[jira] [Commented] (KUDU-1735) CHECK failure when aborting an ignored config change operation

2016-10-27 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613069#comment-15613069
 ] 

Todd Lipcon commented on KUDU-1735:
---

Brainstorming a couple fixes here:
- perhaps we should wait until all COMMIT messages are flushed before we flush 
the new cmeta to disk when a config change is committed?
- the 'Abort a failed config change' code path seems to be somewhat incorrect 
as it also doesn't verify that the pending config change that's being cleared 
matches the one that failed
- maybe the CHECK is invalid if we see that we're aborting a config change that 
is older than the current committed one?

Would be good to figure out why our various consensus stress tests don't 
trigger this behavior and add one that does (in addition to a more specifically 
targeted test case).

[~mpercy] do you have time to take a look at this as the original author of 
this part of the code?

> CHECK failure when aborting an ignored config change operation
> --
>
> Key: KUDU-1735
> URL: https://issues.apache.org/jira/browse/KUDU-1735
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.0.1
>Reporter: Todd Lipcon
>Priority: Critical
>
> The following sequence causes a CHECK failure:
> - a tablet server receives a CONFIG_CHANGE operation
> - the tablet server commits the operation (writing the new consensus config 
> to disk), but crashes before it can write the associated COMMIT message to 
> the log
> - the server is down for long enough that it is removed from the 
> configuration again while it's down
> - when it comes back up, it sees the CONFIG_CHANGE again as a pending 
> replicate. When it's added to PendingRounds, it is ignored as we can see that 
> this configuration is already committed.
> - the tserver gets the request from the master to DeleteTablet because it's 
> no longer part of the configuration
> -- when trying to abort the operation, it fires a CHECK "Aborting 
> CHANGE_CONFIG_OP but there was no pending config set."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KUDU-1735) CHECK failure when aborting an ignored config change operation

2016-10-27 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613056#comment-15613056
 ] 

Todd Lipcon commented on KUDU-1735:
---

Here's the interesting logs:

When the tablet starts up, we can see that a few CONFIG_CHANGEs are pending
{code}
I1026 22:09:10.486959  9692 tablet_bootstrap.cc:1019] T 
e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84: 
ReplayState: Previous OpId: 1.2288, Committed OpId: 1.2287, Pending Replicates: 
3, Pending Commits: 1
I1026 22:09:10.487023  9692 tablet_bootstrap.cc:1019] T 
e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84: Dumping 
REPLICATES: 
I1026 22:09:10.487025  9692 tablet_bootstrap.cc:1019] T 
e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84:type: 
REPLICATE replicate { id { term: 1 index: 2286 } timestamp: 6052023659982815232 
op_type: WRITE_OP write_request { tablet_id: "e58b90e0125e4fbfb967b85066f2bac1" 
schema { columns { name: "
I1026 22:09:10.492393  9692 tablet_bootstrap.cc:1019] T 
e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84:type: 
REPLICATE replicate { id { term: 1 index: 2287 } timestamp: 6052023660001042432 
op_type: CHANGE_CONFIG_OP change_config_record { tablet_id: 
"e58b90e0125e4fbfb967b85066f2bac1" old_config 
I1026 22:09:10.492408  9692 tablet_bootstrap.cc:1019] T 
e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84:type: 
REPLICATE replicate { id { term: 1 index: 2288 } timestamp: 6052023660156993536 
op_type: CHANGE_CONFIG_OP change_config_record { tablet_id: 
"e58b90e0125e4fbfb967b85066f2bac1" old_config 
I1026 22:09:10.492410  9692 tablet_bootstrap.cc:1019] T 
e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84: Dumping 
COMMITS: 
I1026 22:09:10.492413  9692 tablet_bootstrap.cc:1019] T 
e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84:type: 
COMMIT commit { op_type: CHANGE_CONFIG_OP commited_op_id { term: 1 index: 2287 
} }
{code}

according to the log, we don't have 2288 committed yet. However, the metadata 
is actually present on disk so the committed config has index 2288:
{code}
I1026 22:09:10.657603  9692 raft_consensus.cc:286] T 
e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84 [term 1 
FOLLOWER]: Replica starting. Triggering 3 pending transactions. Active config: 
opid_index: 2288 OBSOLETE_local: false peers { permanent_uuid: 
"3c305734ab9d4e0ebfbd0def74841a5d" member_type: V
{code}

so we ignore the pending config change:
{code}
I1026 22:09:10.658336  9692 raft_consensus.cc:630] T 
e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84 [term 1 
FOLLOWER]: Ignoring setting pending config change with OpId 1.2288 because the 
committed config has OpId index 2288. The config change we are ignoring is: Old 
config: { opid_index: 2287 OBSOL
{code}

but then when the DeleteTablet comes, we get the crash:

{code}
I1026 22:09:11.081184 14990 tablet_service.cc:660] Processing DeleteTablet for 
tablet e58b90e0125e4fbfb967b85066f2bac1 with delete_type TABLET_DATA_TOMBSTONED 
(TS c9633273962a4521a32d5e177a118a84 not found in new config with opid_index 
7024) from {real_user=kudu, eff_user=} at 10.17.224.34:58742
I1026 22:09:11.081382 14990 raft_consensus_state.cc:370] T 
e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84Trying to 
abort 1 pending transactions.
I1026 22:09:11.081481 14990 raft_consensus_state.cc:375] T 
e58b90e0125e4fbfb967b85066f2bac1 P c9633273962a4521a32d5e177a118a84Aborting 
transaction as it isn't in flight: id { term: 1 index: 2288 } timestamp: 
6052023660156993536 op_type: CHANGE_CONFIG_OP change_config_record { tablet_id: 
"e58b90e0125e4fbfb967b85066f2bac
F1026 22:09:11.081563 14990 raft_consensus.cc:2055] Check failed: 
state_->IsConfigChangePendingUnlocked() T e58b90e0125e4fbfb967b85066f2bac1 P 
c9633273962a4521a32d5e177a118a84 [term 1 FOLLOWER]: Aborting CHANGE_CONFIG_OP 
but there was no pending config set. Op: id { term: 1 index: 2288 } timestamp: 
6052023660156993536 
{code}

> CHECK failure when aborting an ignored config change operation
> --
>
> Key: KUDU-1735
> URL: https://issues.apache.org/jira/browse/KUDU-1735
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.0.1
>Reporter: Todd Lipcon
>Priority: Critical
>
> The following sequence causes a CHECK failure:
> - a tablet server receives a CONFIG_CHANGE operation
> - the tablet server commits the operation (writing the new consensus config 
> to disk), but crashes before it can write the associated COMMIT message to 
> the log
> - the server is down for long enough that it is removed from the 
> configuration again while it's down
> - when it comes back up, it sees the CONFIG_CHANGE again as a pending