[ https://issues.apache.org/jira/browse/FELIX-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504527#comment-13504527 ]
Bertrand Delacretaz edited comment on FELIX-3067 at 11/27/12 11:04 AM: ----------------------------------------------------------------------- I can now reliably reproduce such deadlocks using my https://github.com/bdelacretaz/osgi-stresser stress test tool - requires a few manual steps but generates deadlocks after just a few seconds in my tests. I'm using the Sling Launchpad for this, as that contains a number of bundles that can be uninstalled/started/stopped (like crazy) to expose the problem. It looks like lots of package refreshes helps expose deadlocks much quicker. Here's my failure scenario: # Build Sling from http://svn.apache.org/repos/asf/sling/trunk, making sure it's using the Felix trunk's framework and scr modules (patch follows) # Start Sling: ## cd launchpad/builder ## rm -rf sling (if needed to remove all previous state) ## java -jar target/org.apache.sling.launchpad-7-SNAPSHOT-standalone.jar ## Optionally add -Dsling.launchpad.log.level=4 to set OSGi log level to DEBUG, use with my FELIX-3785 patch to log locking operations # Build the https://github.com/bdelacretaz/osgi-stresser bundle and install at start level 1 (so that it doesn't stop itself) from /system/console # Connect to the tool's command line using telnet 1234 At this point the tool's stress test tasks can be started using the commands described at https://github.com/bdelacretaz/osgi-stresser - or simply use * r to start all tasks, at which point the tool should display something like OSGI stresser> * r sl task running - cycle time -1000 msec - levels=[3, 45, 8, 19, 30] rp task running - cycle time 5000 msec - max wait for packages refresh=10000 ss task running - cycle time 0 msec - bundle to stop and restart=org.apache.sling.junit.core bu task running - cycle time -1000 msec - ignored symbolic names (patterns)=[commons, org.apache.felix, slf4j, ch.x42, log, org.osgi] up task running - cycle time 0 msec - bundle to update=org.apache.sling.junit.core OSGI stresser> the tasks then do crazy things to the OSGi framework, but (IMO) according to spec so should not cause any deadlocks. The sling/logs/error.log shows what the tasks are doing, and a good way to detect the global/bundle locks deadlock is to try to refresh /system/console, that will block if the locks cannot be acquired. was (Author: bdelacretaz): I can now reliably reproduce such deadlocks using my https://github.com/bdelacretaz/osgi-stresser stress test tool - requires a few manual steps but generates deadlocks after just a few seconds in my tests. I'm using the Sling Launchpad for this, as that contains a number of bundles that can be uninstalled/started/stopped (like crazy) to expose the problem. It looks like lots of package refreshes helps expose deadlocks much quicker. Here's my failure scenario: # Build Sling from http://svn.apache.org/repos/asf/sling/trunk, making sure it's using the Felix trunk's framework and scr modules (patch follows) # Start Sling: ## cd launchpad/builder ## rm -rf sling (if needed to remove all previous state) ## java -jar target/org.apache.sling.launchpad-7-SNAPSHOT-standalone.jar ## Optionally add -Dsling.launchpad.log.level=4 to set OSGi log level to DEBUG, use with my FELIX-3785 patch to log locking operations # Build the https://github.com/bdelacretaz/osgi-stresser bundle and install at start level 1 (so that it doesn't stop itself) from /system/console # Connect to the tool's command line using telnet 1234 At this point the tool's stress test tasks can be started using the commands described at https://github.com/bdelacretaz/osgi-stresser - or simply use {code} * r {code} to start all tasks, at which point the tool should display something like {code} OSGI stresser> * r sl task running - cycle time -1000 msec - levels=[3, 45, 8, 19, 30] rp task running - cycle time 5000 msec - max wait for packages refresh=10000 ss task running - cycle time 0 msec - bundle to stop and restart=org.apache.sling.junit.core bu task running - cycle time -1000 msec - ignored symbolic names (patterns)=[commons, org.apache.felix, slf4j, ch.x42, log, org.osgi] up task running - cycle time 0 msec - bundle to update=org.apache.sling.junit.core OSGI stresser> {code} the tasks then do crazy things to the OSGi framework, but (IMO) according to spec so should not cause any deadlocks. The sling/logs/error.log shows what the tasks are doing, and a good way to detect the global/bundle locks deadlock is to try to refresh /system/console, that will block if the locks cannot be acquired. > Prevent Deadlock Situation in Felix.acquireGlobalLock > ----------------------------------------------------- > > Key: FELIX-3067 > URL: https://issues.apache.org/jira/browse/FELIX-3067 > Project: Felix > Issue Type: Improvement > Components: Framework > Affects Versions: framework-3.0.7, framework-3.0.8, framework-3.0.9, > framework-3.2.0, framework-3.2.1, fileinstall-3.1.10 > Reporter: Felix Meschberger > Attachments: FELIX-3067.patch, FELIX-3067-sling.patch > > > Every now and then we encounter deadlock situations which involve the > Felix.acquireGlobalLock method. In our use case we have the following aspects > which contribute to this: > (a) The Apache Felix Declarative Services implementation stops components > (and thus causes service unregistration) while the bundle lock is being held > because this happens in a SynchronousBundleListener while handling the > STOPPING bundle event. We have to do this to ensure the bundle is not really > stopped yet to properly stop the bundle's components. > (b) Implementing a special class loader which involves dynamically resolving > packages which in turn uses the global lock > (c) Eclipse Gemini Blueprint implementation which operates asynchronously > (d) synchronization in application classes > Often times, I would assume that we can self-heal such complex deadlck > situations, if we let acquireGlobalLock time out. Looking at the calles of > acquireGlobalLock there seems to already be provision to handle this case > since acquireGlobalLock returns true only if the global lock has actually > been acquired. > This issue is kind of a companion to FELIX-3000 where deadlocks involve > sending service registration events while holding the bundle lock. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira