[
https://issues.apache.org/jira/browse/FELIX-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504527#comment-13504527
]
Bertrand Delacretaz edited comment on FELIX-3067 at 11/27/12 11:06 AM:
-----------------------------------------------------------------------
I can now reliably reproduce such deadlocks using my
https://github.com/bdelacretaz/osgi-stresser stress test tool - requires a few
manual steps but generates deadlocks after just a few seconds in my tests.
I'm using the Sling Launchpad for this, as that contains a number of bundles
that can be uninstalled/started/stopped (like crazy) to expose the problem. It
looks like lots of package refreshes helps expose deadlocks much quicker.
Here's my failure scenario (using a 1.6.0_37 JVM on macosx):
# Build Sling from http://svn.apache.org/repos/asf/sling/trunk, making sure
it's using the Felix trunk's framework and scr modules (patch follows)
# Start Sling:
## cd launchpad/builder
## rm -rf sling (if needed to remove all previous state)
## java -jar target/org.apache.sling.launchpad-7-SNAPSHOT-standalone.jar
## Optionally add -Dsling.launchpad.log.level=4 to set OSGi log level to DEBUG,
use with my FELIX-3785 patch to log locking operations
# Build the https://github.com/bdelacretaz/osgi-stresser bundle and install at
start level 1 (so that it doesn't stop itself) from /system/console
# Connect to the tool's command line using telnet 1234
At this point the tool's stress test tasks can be started using the commands
described at https://github.com/bdelacretaz/osgi-stresser - or simply use * r
to start all tasks, at which point the tool should display something like
OSGI stresser> * r
sl task running - cycle time -1000 msec - levels=[3, 45, 8, 19, 30]
rp task running - cycle time 5000 msec - max wait for packages refresh=10000
ss task running - cycle time 0 msec - bundle to stop and
restart=org.apache.sling.junit.core
bu task running - cycle time -1000 msec - ignored symbolic names
(patterns)=[commons, org.apache.felix, slf4j, ch.x42, log, org.osgi]
up task running - cycle time 0 msec - bundle to
update=org.apache.sling.junit.core
OSGI stresser>
the tasks then do crazy things to the OSGi framework, but (IMO) according to
spec so should not cause any deadlocks - but they do.
The sling/logs/error.log shows what the tasks are doing, and a good way to
detect the global/bundle locks deadlock is to try to refresh /system/console,
that will block if the locks cannot be acquired.
was (Author: bdelacretaz):
I can now reliably reproduce such deadlocks using my
https://github.com/bdelacretaz/osgi-stresser stress test tool - requires a few
manual steps but generates deadlocks after just a few seconds in my tests.
I'm using the Sling Launchpad for this, as that contains a number of bundles
that can be uninstalled/started/stopped (like crazy) to expose the problem. It
looks like lots of package refreshes helps expose deadlocks much quicker.
Here's my failure scenario:
# Build Sling from http://svn.apache.org/repos/asf/sling/trunk, making sure
it's using the Felix trunk's framework and scr modules (patch follows)
# Start Sling:
## cd launchpad/builder
## rm -rf sling (if needed to remove all previous state)
## java -jar target/org.apache.sling.launchpad-7-SNAPSHOT-standalone.jar
## Optionally add -Dsling.launchpad.log.level=4 to set OSGi log level to DEBUG,
use with my FELIX-3785 patch to log locking operations
# Build the https://github.com/bdelacretaz/osgi-stresser bundle and install at
start level 1 (so that it doesn't stop itself) from /system/console
# Connect to the tool's command line using telnet 1234
At this point the tool's stress test tasks can be started using the commands
described at https://github.com/bdelacretaz/osgi-stresser - or simply use * r
to start all tasks, at which point the tool should display something like
OSGI stresser> * r
sl task running - cycle time -1000 msec - levels=[3, 45, 8, 19, 30]
rp task running - cycle time 5000 msec - max wait for packages refresh=10000
ss task running - cycle time 0 msec - bundle to stop and
restart=org.apache.sling.junit.core
bu task running - cycle time -1000 msec - ignored symbolic names
(patterns)=[commons, org.apache.felix, slf4j, ch.x42, log, org.osgi]
up task running - cycle time 0 msec - bundle to
update=org.apache.sling.junit.core
OSGI stresser>
the tasks then do crazy things to the OSGi framework, but (IMO) according to
spec so should not cause any deadlocks.
The sling/logs/error.log shows what the tasks are doing, and a good way to
detect the global/bundle locks deadlock is to try to refresh /system/console,
that will block if the locks cannot be acquired.
> Prevent Deadlock Situation in Felix.acquireGlobalLock
> -----------------------------------------------------
>
> Key: FELIX-3067
> URL: https://issues.apache.org/jira/browse/FELIX-3067
> Project: Felix
> Issue Type: Improvement
> Components: Framework
> Affects Versions: framework-3.0.7, framework-3.0.8, framework-3.0.9,
> framework-3.2.0, framework-3.2.1, fileinstall-3.1.10
> Reporter: Felix Meschberger
> Attachments: FELIX-3067.patch, FELIX-3067-sling.patch
>
>
> Every now and then we encounter deadlock situations which involve the
> Felix.acquireGlobalLock method. In our use case we have the following aspects
> which contribute to this:
> (a) The Apache Felix Declarative Services implementation stops components
> (and thus causes service unregistration) while the bundle lock is being held
> because this happens in a SynchronousBundleListener while handling the
> STOPPING bundle event. We have to do this to ensure the bundle is not really
> stopped yet to properly stop the bundle's components.
> (b) Implementing a special class loader which involves dynamically resolving
> packages which in turn uses the global lock
> (c) Eclipse Gemini Blueprint implementation which operates asynchronously
> (d) synchronization in application classes
> Often times, I would assume that we can self-heal such complex deadlck
> situations, if we let acquireGlobalLock time out. Looking at the calles of
> acquireGlobalLock there seems to already be provision to handle this case
> since acquireGlobalLock returns true only if the global lock has actually
> been acquired.
> This issue is kind of a companion to FELIX-3000 where deadlocks involve
> sending service registration events while holding the bundle lock.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira