[ 
https://issues.apache.org/jira/browse/DERBY-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Knut Anders Hatlen updated DERBY-5632:
--------------------------------------

    Attachment: experimental-v1.diff

I think there are two reasons why RAMAccessManager synchronizes on the 
conglomerate cache instance whenever it accesses it:

1) Because it manually faults in missing items in the cache, and it needs to 
ensure that no others fault it in between its calls to findCached() and 
create().

2) Because conglomCacheUpdateEntry() implements a create-or-replace operation, 
which is not provided by the CacheManager interface, and it needs to ensure no 
others add an item with the same key between findCached() and create().

As mentioned in an earlier comment, I think (1) should be solved by 
implementing CacheableConglomerate.setIdentity(), so that the cache manager 
takes care of faulting in the conglomerate.

(2) might be solved by adding a create-or-replace operation to CacheManager 
interface. However, I'm not sure it is needed. The conglomCacheUpdateEntry() 
method is only called once; by RAMTransaction.addColumnToConglomerate(). That 
method fetches a Conglomerate instance from the cache, modifies it, and 
reinserts it into the cache. The instance that's reinserted into the cache is 
the exact same instance that was fetched from the cache, so the call to 
conglomCacheUpdateEntry() doesn't really update the conglomerate cache, it just 
replaces an existing entry with itself.

It looks to me as if the conglomCacheUpdateEntry() can be removed, and that 
will take care of (2).

I created an experimental patch, attached as experimental-v1.diff. It removes 
conglomCacheUpdateEntry() as suggested. It also makes CacheableConglomerate 
implement setIdentity() so that conglomCacheFind() doesn't need to fault in 
conglomerates manually.

The patch is not ready for commit, as it doesn't pass all regression tests. But 
it could be used for testing, if someone has a test environment where the 
deadlock can be reliably reproduced.

There was only one failure in the regression tests. store/xaOffline1.sql had a 
diff in one of the transaction table listings, where a transaction showed up in 
the ACTIVE state whereas IDLE was expected.

This probably happens because the transaction used in the 
CacheableConglomerate.setIdentity() method is not necessarily the same as the 
one previously used by RAMAccessManager.conglomCacheFind().

The current implementation of setIdentity() in the patch just fetches the first 
transaction it finds on the context stack. That seems to do the trick in most 
cases, but it doesn't know whether conglomCacheFind() was called with a 
top-level transaction or a nested transaction, as setIdentity() cannot access 
conglomCacheFind()'s parameters. Maybe it can be solved by pushing some other 
context type (with a reference to the correct tx) on the context stack before 
accessing the conglomerate cache, and let setIdentity() check that instead?
                
> Logical deadlock happened when freezing/unfreezing the database
> ---------------------------------------------------------------
>
>                 Key: DERBY-5632
>                 URL: https://issues.apache.org/jira/browse/DERBY-5632
>             Project: Derby
>          Issue Type: Bug
>          Components: Documentation, Services
>    Affects Versions: 10.8.2.2
>         Environment: Oracle M3000/Solaris 10
>            Reporter: Brett Bergquist
>              Labels: derby_triage10_10
>         Attachments: experimental-v1.diff, stack.txt
>
>
> Tried to make a quick database backup by freezing the database, performing a 
> ZFS snapshot, and then unfreezing the database.   The database was frozen but 
> then a connection to the database could not be established to unfreeze the 
> database.
> Looking at the stack trace of the network server, , I see 3 threads that are 
> trying to process a connection request.   Each of these is waiting on:
>                 at 
> org.apache.derby.impl.store.access.RAMAccessManager.conglomCacheFind(Unknown 
> Source)
>                 - waiting to lock <0xfffffffd3a7fcc68> (a 
> org.apache.derby.impl.services.cache.ConcurrentCache)
> That object is owned by:
>                 - locked <0xfffffffd3a7fcc68> (a 
> org.apache.derby.impl.services.cache.ConcurrentCache)
>                 at 
> org.apache.derby.impl.store.access.RAMTransaction.findExistingConglomerate(Unknown
>  Source)
>                 at 
> org.apache.derby.impl.store.access.RAMTransaction.openGroupFetchScan(Unknown 
> Source)
>                 at 
> org.apache.derby.impl.services.daemon.IndexStatisticsDaemonImpl.updateIndexStatsMinion(Unknown
>  Source)
>                 at 
> org.apache.derby.impl.services.daemon.IndexStatisticsDaemonImpl.runExplicitly(Unknown
>  Source)
>                 at 
> org.apache.derby.impl.sql.execute.AlterTableConstantAction.updateStatistics(Unknown
>  Source)
> which itself is waiting for the object:
>                 at java.lang.Object.wait(Native Method)
>                 - waiting on <0xfffffffd3ac1d608> (a 
> org.apache.derby.impl.store.raw.log.LogToFile)
>                 at java.lang.Object.wait(Object.java:485)
>                 at 
> org.apache.derby.impl.store.raw.log.LogToFile.flush(Unknown Source)
>                 - locked <0xfffffffd3ac1d608> (a 
> org.apache.derby.impl.store.raw.log.LogToFile)
>                 at 
> org.apache.derby.impl.store.raw.log.LogToFile.flush(Unknown Source)
>                 at 
> org.apache.derby.impl.store.raw.data.BaseDataFileFactory.flush(Unknown Source)
> So basically what I think is happening is that the database is frozen, the 
> statistics are being updated on another thread which has the 
> "org.apache.derby.impl.services.cache.ConcurrentCache" locked and then waits 
> for the LogToFile lock and the connecting threads are waiting to lock 
> "org.apache.derby.impl.services.cache.ConcurrentCache" to connect and these 
> are where the database is going to be unfrozen.    Not a deadlock as far as 
> the JVM is concerned but it will never leave this state either.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to