[
https://issues.apache.org/jira/browse/KUDU-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171168#comment-15171168
]
David Alves commented on KUDU-1354:
-----------------------------------
After discussing this on slack, this is apparently a bug with the way we
release locks and then mvcc commit transactions that have intersecting read
sets.
Some preliminary thoughts on possible alternatives:
1 - Forego releasing locks before the mvcc commit:
Todd suggests we could simply release locks after we commit making sure that
the two transactions never overlap. This is the simple option and likely the
one we should implement first.
It has the disadvantage of making all transactions queued to acquire locks the
additional time period between the instant that we would release locks (before
mvcc commit) and the new instant (after mvcc commit). But this is likely not so
bad as this interval as it is usually not proportional to to the txn size and
the prepare thread would block anyway.
2 - Keep track of the dependencies and make sure the transactions commit in
order:
This would make much more sense lock manager was able to have per-lock wait
queues, since we'd be tracking the dependencies already, in a way.
We could do something like:
Tx1 acquires locks with LOCK_EXCLUSIVE, adds marks itself as the lock owner
Tx1 goes though the prepare, apply etc, then changes the locks to LOCK_SHARED
Tx1 mvcc commits and removes itself from all lock queues.
Tx2 when acquiring the locks:
- if it observes a lock with no owner, acquires it as LOCK_EXCLUSIVE and marks
itself as the owner.
- if it observes a lock with LOCK_EXCLUSIVE adds itself to the wait queue and
adds the owner to its dependency set
- if it observes a lock with LOCK_SHARED changes it to LOCK_EXCLUSIVE, marks
itself as the owner and adds the previous owner to its dependency set.
Tx2 can release locks as soon as it applies
Tx2 won't mvcc commit until all txns in it's dependency set have committed.
> MVCC Snapshots chosen during flush can contain out-of-order transactions
> ------------------------------------------------------------------------
>
> Key: KUDU-1354
> URL: https://issues.apache.org/jira/browse/KUDU-1354
> Project: Kudu
> Issue Type: Bug
> Components: tablet
> Affects Versions: 0.7.0
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Priority: Critical
>
> I spent a while trying to debug a failure of alter_table-randomized-test and
> found the following interesting logs:
> - We have two operations in the WAL which arrived in short succession (about
> 4ms apart) just before an alter table. I've renumbered the txids for
> readability here:
> {noformat}
> 1.13@2 REPLICATE WRITE_OP
> op 0: MUTATE (int32 key=1643562) SET c6=1107303203
> 1.14@4 REPLICATE WRITE_OP
> op 0: MUTATE (int32 key=1643562) DELETE
> {noformat}
> - and the Flush that was caused by the Altertable has the following snapshots:
> {noformat}
> ... Phase 1 snapshot: MvccSnapshot[committed={T|T < 2 or (T in (4))]
> ...
> ... Phase 2 snapshot: MvccSnapshot[committed={T|T < 2 or (T in (4, 2))]
> {noformat}
> Note that the first snapshot considers the 'DELETE' committed but not the
> 'UPDATE'. We then fill in the 'UPDATE' in the second snapshot.The end result
> here is that we end up flushing REDO deltas as follows:
> REDO file 1 (flushed in phase 1): includes only the DELETE
> REDO file 2 (flushed after ReupdateMissedDeltas); includes only the UPDATE
> When we later proceed to compact this rowset, we get "Check failed:
> !is_deleted Got UPDATE for deleted row."
> Scenarios like this seem to reproduce a few tenths of a percent of the time
> in this stress test.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)