[ 
https://issues.apache.org/jira/browse/FELIX-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504527#comment-13504527
 ] 

Bertrand Delacretaz edited comment on FELIX-3067 at 11/27/12 11:06 AM:
-----------------------------------------------------------------------

I can now reliably reproduce such deadlocks using my 
https://github.com/bdelacretaz/osgi-stresser stress test tool - requires a few 
manual steps but generates deadlocks after just a few seconds in my tests.

I'm using the Sling Launchpad for this, as that contains a number of bundles 
that can be uninstalled/started/stopped (like crazy) to expose the problem. It 
looks like lots of package refreshes helps expose deadlocks much quicker.

Here's my failure scenario (using a 1.6.0_37 JVM on macosx):
# Build Sling from http://svn.apache.org/repos/asf/sling/trunk, making sure 
it's using the Felix trunk's framework and scr modules (patch follows)
# Start Sling:
## cd launchpad/builder
## rm -rf sling (if needed to remove all previous state)
## java -jar target/org.apache.sling.launchpad-7-SNAPSHOT-standalone.jar
## Optionally add -Dsling.launchpad.log.level=4 to set OSGi log level to DEBUG, 
use with my FELIX-3785 patch to log locking operations
# Build the https://github.com/bdelacretaz/osgi-stresser bundle and install at 
start level 1 (so that it doesn't stop itself) from /system/console
# Connect to the tool's command line using telnet 1234

At this point the tool's stress test tasks can be started using the commands 
described at https://github.com/bdelacretaz/osgi-stresser - or simply use * r 
to start all tasks, at which point the tool should display something like

OSGI stresser> * r
sl task running - cycle time -1000 msec - levels=[3, 45, 8, 19, 30]
rp task running - cycle time 5000 msec - max wait for packages refresh=10000
ss task running - cycle time 0 msec - bundle to stop and 
restart=org.apache.sling.junit.core
bu task running - cycle time -1000 msec - ignored symbolic names 
(patterns)=[commons, org.apache.felix, slf4j, ch.x42, log, org.osgi]
up task running - cycle time 0 msec - bundle to 
update=org.apache.sling.junit.core
OSGI stresser> 

the tasks then do crazy things to the OSGi framework, but (IMO) according to 
spec so should not cause any deadlocks - but they do.

The sling/logs/error.log shows what the tasks are doing, and a good way to 
detect the global/bundle locks deadlock is to try to refresh /system/console, 
that will block if the locks cannot be acquired.
                
      was (Author: bdelacretaz):
    I can now reliably reproduce such deadlocks using my 
https://github.com/bdelacretaz/osgi-stresser stress test tool - requires a few 
manual steps but generates deadlocks after just a few seconds in my tests.

I'm using the Sling Launchpad for this, as that contains a number of bundles 
that can be uninstalled/started/stopped (like crazy) to expose the problem. It 
looks like lots of package refreshes helps expose deadlocks much quicker.

Here's my failure scenario:
# Build Sling from http://svn.apache.org/repos/asf/sling/trunk, making sure 
it's using the Felix trunk's framework and scr modules (patch follows)
# Start Sling:
## cd launchpad/builder
## rm -rf sling (if needed to remove all previous state)
## java -jar target/org.apache.sling.launchpad-7-SNAPSHOT-standalone.jar
## Optionally add -Dsling.launchpad.log.level=4 to set OSGi log level to DEBUG, 
use with my FELIX-3785 patch to log locking operations
# Build the https://github.com/bdelacretaz/osgi-stresser bundle and install at 
start level 1 (so that it doesn't stop itself) from /system/console
# Connect to the tool's command line using telnet 1234

At this point the tool's stress test tasks can be started using the commands 
described at https://github.com/bdelacretaz/osgi-stresser - or simply use * r 
to start all tasks, at which point the tool should display something like

OSGI stresser> * r
sl task running - cycle time -1000 msec - levels=[3, 45, 8, 19, 30]
rp task running - cycle time 5000 msec - max wait for packages refresh=10000
ss task running - cycle time 0 msec - bundle to stop and 
restart=org.apache.sling.junit.core
bu task running - cycle time -1000 msec - ignored symbolic names 
(patterns)=[commons, org.apache.felix, slf4j, ch.x42, log, org.osgi]
up task running - cycle time 0 msec - bundle to 
update=org.apache.sling.junit.core
OSGI stresser> 

the tasks then do crazy things to the OSGi framework, but (IMO) according to 
spec so should not cause any deadlocks.

The sling/logs/error.log shows what the tasks are doing, and a good way to 
detect the global/bundle locks deadlock is to try to refresh /system/console, 
that will block if the locks cannot be acquired.
                  
> Prevent Deadlock Situation in Felix.acquireGlobalLock
> -----------------------------------------------------
>
>                 Key: FELIX-3067
>                 URL: https://issues.apache.org/jira/browse/FELIX-3067
>             Project: Felix
>          Issue Type: Improvement
>          Components: Framework
>    Affects Versions: framework-3.0.7, framework-3.0.8, framework-3.0.9, 
> framework-3.2.0, framework-3.2.1, fileinstall-3.1.10
>            Reporter: Felix Meschberger
>         Attachments: FELIX-3067.patch, FELIX-3067-sling.patch
>
>
> Every now and then we encounter deadlock situations which involve the 
> Felix.acquireGlobalLock method. In our use case we have the following aspects 
> which contribute to this:
> (a) The Apache Felix Declarative Services implementation stops components 
> (and thus causes service unregistration) while the bundle lock is being held 
> because this happens in a SynchronousBundleListener while handling the 
> STOPPING bundle event. We have to do this to ensure the bundle is not really 
> stopped yet to properly stop the bundle's components.
> (b) Implementing a special class loader which involves dynamically resolving 
> packages which in turn uses the global lock
> (c) Eclipse Gemini Blueprint implementation which operates asynchronously
> (d) synchronization in application classes
> Often times, I would assume that we can self-heal such complex deadlck 
> situations, if we let acquireGlobalLock time out. Looking at the calles of 
> acquireGlobalLock there seems to already be provision to handle this case 
> since acquireGlobalLock returns true only if the global lock has actually 
> been acquired.
> This issue is kind of a companion to FELIX-3000 where deadlocks involve 
> sending service registration events while holding the bundle lock.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to