[jira] [Commented] (SLING-4061) Deadlock involving discovery services at startup with Oak

Robert Munteanu (JIRA) Wed, 15 Oct 2014 08:24:32 -0700

    [ 
https://issues.apache.org/jira/browse/SLING-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172469#comment-14172469
 ]


Robert Munteanu commented on SLING-4061:
----------------------------------------

There is an issue is in the {{DiscoveryServiceImpl}} activate method, but I'm 
not sure it's the root cause. The class leaks a reference to itself before the 
{{activate()}} method completes:

{code:java}
        // make sure the first heartbeat is issued as soon as possible - which
        // is right after this service starts. since the two (discoveryservice
        // and heartbeatHandler need to know each other, the discoveryservice
        // is passed on to the heartbeatHandler in this initialize call).
        heartbeatHandler.initialize(this,
                clusterViewService.getIsolatedClusterViewId());

        final TopologyEventListener[] registeredServices;
        synchronized (lock) {
            registeredServices = this.eventListeners;
            doUpdateProperties();

            TopologyViewImpl newView = (TopologyViewImpl) getTopology();
            TopologyEvent event = new TopologyEvent(Type.TOPOLOGY_INIT, null,
                    newView);
            for (final TopologyEventListener da : registeredServices) {
                sendTopologyEvent(da, event);
            }
            activated = true;
            oldView = newView;
        }
{code}

The deadlock itself is a lock ordering issue

- in thread "pool-5-thread-1" the HeartbeatHandler wants to issue an update and 
thread and holds the DiscoveryServiceImpl.lock lock but can't lock the 
SegmentNodeStoreService lock
- in thread "CM Event Dispatcher..." the SegmentNodeStoreService holds its own 
lock and the call stack ends up trying to invoke 
DiscoveryServiceImpl.bindTopologyEventListener, which needs the 
DiscoveryServiceImpl.lock

I wonder whether we need more fine-grained locking in the DiscoveryServiceImpl 
- a single lock object seems to coarse-grained, especially since a lot seems to 
happen during calls like updateProperties(), including invocation of foreign 
code ( notifying event listeners ) which is a bit worrisome - invoking foreign 
code with locks held is prone to deadlocks.

Another alternative is to make make use of concurrent collections for e.g. 
event listeners, but I'm not sure we don't get bitten by the fact that they are 
weakly consistent.


> Deadlock involving discovery services at startup with Oak
> ---------------------------------------------------------
>
>                 Key: SLING-4061
>                 URL: https://issues.apache.org/jira/browse/SLING-4061
>             Project: Sling
>          Issue Type: Bug
>          Components: Extensions
>            Reporter: Bertrand Delacretaz
>         Attachments: discovery-deadlock.txt
>
>
> I just got a deadlock at startup when starting the launchpad integration 
> tests instance on sling trunk revision 1632058 (so starting with Oak):
> {code}
> export DBG="-Xmx1G  -XX:MaxPermSize=256m 
> -agentlib:jdwp=transport=dt_socket,address=30303,server=y,suspend=n"
> export MAVEN_OPTS="-Xmx1G  -XX:MaxPermSize=256m $DBG -Dsling.run.modes=oak"
> cd launchpad/testing
> mvn launchpad:run
> {code}
> I'll attach the stack trace. The discovery HeartbeatHandler, and 
> DiscoveryServiceImpl classes are involved.
> The deadlock happens often on my box (macosx 10.9.5, java version 
> "1.7.0_45"), with the same deadlock pattern AFAICS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SLING-4061) Deadlock involving discovery services at startup with Oak

Reply via email to