[
https://issues.apache.org/jira/browse/DERBY-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Knut Anders Hatlen updated DERBY-2991:
--------------------------------------
Attachment: d2991-2b.stat
d2991-2b.diff
Here's an updated patch (d2991-2b.diff) which addresses the two issues I
mentioned that I was aware of in the 2a patch:
1) Call Page.setRepositionNeeded() in BTreePostCommit.purgeCommittedDeletes()
when a row has been purged.
2) Handle the cases where reposition() can return false (that is, second
argument to reposition() is false and the row on the current position has been
purged). This led to the following changes:
* BTreeScan.positionAtDoneScanFromClose()
* BTreeScan.reopenScan()
Removed the calls to reposition(). The only reason I could see for these
methods to call reposition() was that some implementations of
BTreeLockingPolicy.unlockScanRecordAfterRead() had asserts that checked that
the page of the current position was latched. Removing the calls (and the
asserts) made the code simpler and removed the need for special handling if
reposition() was unsuccessful.
* B2IRowLockingRR.unlockScanRecordAfterRead()
* B2IRowLocking2.unlockScanRecordAfterRead()
Don't assert that the current leaf is latched, as there is no need for that
latch in order to unlock the record. (See above.)
* BTreeScan.delete()
* BTreeScan.doesCurrentPositionQualify()
* BTreeScan.fetch()
* BTreeScan.isCurrentPositionDeleted()
Make sure that we don't try to release the latch on the current leaf unless
we have actually latched it, since the leaf won't be latched if reposition()
returns false. No other special handling of purged rows is needed in those
methods, I think. delete() and fetch() throw an exception
(AM_RECORD_NOT_FOUND) if the row has been purged, which sounds reasonable to
me. doesCurrentPositionQualify() and isCurrentPositionDeleted() use the
return value from reposition() to decide what they should return themselves,
which also sounds fine to me (except that I would expect that
isCurrentPositionDeleted() returned true if the row was purged, but currently
it returns false -- will file a separate bug for that).
* BTreeMaxScan.fetchMaxRowFromBeginning()
* BTreeForwardScan.fetchRows()
If the row on the current position of the scan has been purged while we were
waiting for a lock so that reposition(pos,false) returns false, we call
reposition() again with second argument true to reposition on the row
immediately to the left of where the purged row was supposed to be. This
effectively takes one step back in the scan, so therefore we need to jump to
the top of the loop's body to move one step forward past the purged row.
I tested that reposition(pos,false) followed by reposition(pos,true) worked by
setting a breakpoint in the debugger and manually changing values in the page
object and in the position to make the scan code believe that the row had been
purged. As far as I could tell, it worked just as if the scan had found a
deleted row. (There are currently no tests that exercise code paths where
reposition() returns false, and I don't see any easy way to write a test for it
since it would be highly dependent on timing between user threads and service
threads.)
This patch fixes all the issues I'm aware of in the previous patch. Derbyall
and suites.All ran cleanly. Reviews, comments and questions would be
appreciated. Thanks.
> Index split deadlock
> --------------------
>
> Key: DERBY-2991
> URL: https://issues.apache.org/jira/browse/DERBY-2991
> Project: Derby
> Issue Type: Bug
> Components: Store
> Affects Versions: 10.2.2.0, 10.3.1.4
> Environment: Windows XP, Java 6
> Reporter: Bogdan Calmac
> Assignee: Knut Anders Hatlen
> Attachments: d2991-2a.diff, d2991-2a.stat, d2991-2b.diff,
> d2991-2b.stat, d2991-preview-1a.diff, d2991-preview-1a.stat,
> d2991-preview-1b.diff, d2991-preview-1b.stat, d2991-preview-1c.diff,
> d2991-preview-1c.stat, d2991-preview-1d.diff, d2991-preview-1d.stat,
> d2991-preview-1e.diff, derby.log, InsertSelectDeadlock.java, perftest.diff,
> Repro2991.java, stacktraces_during_deadlock.txt, test-1.diff, test-2.diff,
> test-3.diff
>
>
> After doing dome research on the mailing list, it appears that the index
> split deadlock is a known behaviour, so I will start by describing the
> theoretical problem first and then follow with the details of my test case.
> If you have concurrent select and insert transactions on the same table, the
> observed locking behaviour is as follows:
> - the select transaction acquires an S lock on the root block of the index
> and then waits for an S lock on some uncommitted row of the insert transaction
> - the insert transaction acquires X locks on the inserted records and if it
> needs to do an index split creates a sub-transaction that tries to acquire an
> X lock on the root block of the index
> In summary: INDEX LOCK followed by ROW LOCK + ROW LOCK followed by INDEX LOCK
> = deadlock
> In the case of my project this is an important issue (lack of concurrency
> after being forced to use table level locking) and I would like to contribute
> to the project and fix this issue (if possible). I was wondering if someone
> that knows the code can give me a few pointers on the implications of this
> issue:
> - Is this a limitation of the top-down algorithm used?
> - Would fixing it require to use a bottom up algorithm for better
> concurrency (which is certainly non trivial)?
> - Trying to break the circular locking above, I would first question why
> does the select transaction need to acquire (and hold) a lock on the root
> block of the index. Would it be possible to ensure the consistency of the
> select without locking the index?
> -----
> The attached test (InsertSelectDeadlock.java) tries to simulate a typical
> data collection application, it consists of:
> - an insert thread that inserts records in batch
> - a select thread that 'processes' the records inserted by the other thread:
> 'select * from table where id > ?'
> The derby log provides detail about the deadlock trace and
> stacktraces_during_deadlock.txt shows that the inser thread is doing an index
> split.
> The test was run on 10.2.2.0 and 10.3.1.4 with identical behaviour.
> Thanks,
> Bogdan Calmac.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.