[
https://issues.apache.org/jira/browse/SLING-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172469#comment-14172469
]
Robert Munteanu commented on SLING-4061:
----------------------------------------
There is an issue is in the {{DiscoveryServiceImpl}} activate method, but I'm
not sure it's the root cause. The class leaks a reference to itself before the
{{activate()}} method completes:
{code:java}
// make sure the first heartbeat is issued as soon as possible - which
// is right after this service starts. since the two (discoveryservice
// and heartbeatHandler need to know each other, the discoveryservice
// is passed on to the heartbeatHandler in this initialize call).
heartbeatHandler.initialize(this,
clusterViewService.getIsolatedClusterViewId());
final TopologyEventListener[] registeredServices;
synchronized (lock) {
registeredServices = this.eventListeners;
doUpdateProperties();
TopologyViewImpl newView = (TopologyViewImpl) getTopology();
TopologyEvent event = new TopologyEvent(Type.TOPOLOGY_INIT, null,
newView);
for (final TopologyEventListener da : registeredServices) {
sendTopologyEvent(da, event);
}
activated = true;
oldView = newView;
}
{code}
The deadlock itself is a lock ordering issue
- in thread "pool-5-thread-1" the HeartbeatHandler wants to issue an update and
thread and holds the DiscoveryServiceImpl.lock lock but can't lock the
SegmentNodeStoreService lock
- in thread "CM Event Dispatcher..." the SegmentNodeStoreService holds its own
lock and the call stack ends up trying to invoke
DiscoveryServiceImpl.bindTopologyEventListener, which needs the
DiscoveryServiceImpl.lock
I wonder whether we need more fine-grained locking in the DiscoveryServiceImpl
- a single lock object seems to coarse-grained, especially since a lot seems to
happen during calls like updateProperties(), including invocation of foreign
code ( notifying event listeners ) which is a bit worrisome - invoking foreign
code with locks held is prone to deadlocks.
Another alternative is to make make use of concurrent collections for e.g.
event listeners, but I'm not sure we don't get bitten by the fact that they are
weakly consistent.
> Deadlock involving discovery services at startup with Oak
> ---------------------------------------------------------
>
> Key: SLING-4061
> URL: https://issues.apache.org/jira/browse/SLING-4061
> Project: Sling
> Issue Type: Bug
> Components: Extensions
> Reporter: Bertrand Delacretaz
> Attachments: discovery-deadlock.txt
>
>
> I just got a deadlock at startup when starting the launchpad integration
> tests instance on sling trunk revision 1632058 (so starting with Oak):
> {code}
> export DBG="-Xmx1G -XX:MaxPermSize=256m
> -agentlib:jdwp=transport=dt_socket,address=30303,server=y,suspend=n"
> export MAVEN_OPTS="-Xmx1G -XX:MaxPermSize=256m $DBG -Dsling.run.modes=oak"
> cd launchpad/testing
> mvn launchpad:run
> {code}
> I'll attach the stack trace. The discovery HeartbeatHandler, and
> DiscoveryServiceImpl classes are involved.
> The deadlock happens often on my box (macosx 10.9.5, java version
> "1.7.0_45"), with the same deadlock pattern AFAICS.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)