[
https://issues.apache.org/jira/browse/KUDU-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170783#comment-15170783
]
Todd Lipcon commented on KUDU-1354:
-----------------------------------
It seems like we currently clean up transactions in this order:
{noformat}
- TransactionDriver::ApplyTask:
-- transaction_->PreCommit() (calls WriteTransaction::PreCommit)
--- calls release_row_locks()
-- Finalize()
--- calls transaction_->Finish()
---- calls state()->Commit()
----- calls mvcc_tx_->Commit()
{noformat}
given that we release the row lock before we commit the MVCC transaction, it's
quite possible that, before we get a chance to mvcc-commit the UPDATE, a DELETE
comes in for the same row and commits itself, and a compaction captures that
snapshot.
[~dralves] do you recall why we clean up in this order? It seems like we should
be holding the row locks until after marking the mvcc transaction committed..
maybe it was a case of premature optimization for concurrency?
> MVCC Snapshots chosen during flush can contain out-of-order transactions
> ------------------------------------------------------------------------
>
> Key: KUDU-1354
> URL: https://issues.apache.org/jira/browse/KUDU-1354
> Project: Kudu
> Issue Type: Bug
> Components: tablet
> Affects Versions: 0.7.0
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Priority: Critical
>
> I spent a while trying to debug a failure of alter_table-randomized-test and
> found the following interesting logs:
> - We have two operations in the WAL which arrived in short succession (about
> 4ms apart) just before an alter table. I've renumbered the txids for
> readability here:
> {noformat}
> 1.13@2 REPLICATE WRITE_OP
> op 0: MUTATE (int32 key=1643562) SET c6=1107303203
> 1.14@4 REPLICATE WRITE_OP
> op 0: MUTATE (int32 key=1643562) DELETE
> {noformat}
> - and the Flush that was caused by the Altertable has the following snapshots:
> {noformat}
> ... Phase 1 snapshot: MvccSnapshot[committed={T|T < 2 or (T in (4))]
> ...
> ... Phase 2 snapshot: MvccSnapshot[committed={T|T < 2 or (T in (4, 2))]
> {noformat}
> Note that the first snapshot considers the 'DELETE' committed but not the
> 'UPDATE'. We then fill in the 'UPDATE' in the second snapshot.The end result
> here is that we end up flushing REDO deltas as follows:
> REDO file 1 (flushed in phase 1): includes only the DELETE
> REDO file 2 (flushed after ReupdateMissedDeltas); includes only the UPDATE
> When we later proceed to compact this rowset, we get "Check failed:
> !is_deleted Got UPDATE for deleted row."
> Scenarios like this seem to reproduce a few tenths of a percent of the time
> in this stress test.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)