[
https://issues.apache.org/jira/browse/SLING-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14172469#comment-14172469
]
Robert Munteanu commented on SLING-4061:
There is an issue is in the {{DiscoveryServiceImpl}} activate method, but I'm
not sure it's the root cause. The class leaks a reference to itself before the
{{activate()}} method completes:
{code:java}
// make sure the first heartbeat is issued as soon as possible - which
// is right after this service starts. since the two (discoveryservice
// and heartbeatHandler need to know each other, the discoveryservice
// is passed on to the heartbeatHandler in this initialize call).
heartbeatHandler.initialize(this,
clusterViewService.getIsolatedClusterViewId());
final TopologyEventListener[] registeredServices;
synchronized (lock) {
registeredServices = this.eventListeners;
doUpdateProperties();
TopologyViewImpl newView = (TopologyViewImpl) getTopology();
TopologyEvent event = new TopologyEvent(Type.TOPOLOGY_INIT, null,
newView);
for (final TopologyEventListener da : registeredServices) {
sendTopologyEvent(da, event);
}
activated = true;
oldView = newView;
}
{code}
The deadlock itself is a lock ordering issue
- in thread pool-5-thread-1 the HeartbeatHandler wants to issue an update and
thread and holds the DiscoveryServiceImpl.lock lock but can't lock the
SegmentNodeStoreService lock
- in thread CM Event Dispatcher... the SegmentNodeStoreService holds its own
lock and the call stack ends up trying to invoke
DiscoveryServiceImpl.bindTopologyEventListener, which needs the
DiscoveryServiceImpl.lock
I wonder whether we need more fine-grained locking in the DiscoveryServiceImpl
- a single lock object seems to coarse-grained, especially since a lot seems to
happen during calls like updateProperties(), including invocation of foreign
code ( notifying event listeners ) which is a bit worrisome - invoking foreign
code with locks held is prone to deadlocks.
Another alternative is to make make use of concurrent collections for e.g.
event listeners, but I'm not sure we don't get bitten by the fact that they are
weakly consistent.
Deadlock involving discovery services at startup with Oak
-
Key: SLING-4061
URL: https://issues.apache.org/jira/browse/SLING-4061
Project: Sling
Issue Type: Bug
Components: Extensions
Reporter: Bertrand Delacretaz
Attachments: discovery-deadlock.txt
I just got a deadlock at startup when starting the launchpad integration
tests instance on sling trunk revision 1632058 (so starting with Oak):
{code}
export DBG=-Xmx1G -XX:MaxPermSize=256m
-agentlib:jdwp=transport=dt_socket,address=30303,server=y,suspend=n
export MAVEN_OPTS=-Xmx1G -XX:MaxPermSize=256m $DBG -Dsling.run.modes=oak
cd launchpad/testing
mvn launchpad:run
{code}
I'll attach the stack trace. The discovery HeartbeatHandler, and
DiscoveryServiceImpl classes are involved.
The deadlock happens often on my box (macosx 10.9.5, java version
1.7.0_45), with the same deadlock pattern AFAICS.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)