Stefan Egli created SLING-5280:
----------------------------------

             Summary: HeartbeatHandler can also be blocked while holding 
AnnouncementRegistry lock
                 Key: SLING-5280
                 URL: https://issues.apache.org/jira/browse/SLING-5280
             Project: Sling
          Issue Type: Bug
          Components: Extensions
    Affects Versions: Discovery Impl 1.2.0
            Reporter: Stefan Egli
            Assignee: Stefan Egli
            Priority: Critical
             Fix For: Discovery Impl 1.2.2


SLING-5195 introduces a separate thread that periodically checks if the local 
HeartbeatHandler is properly functional and has written a heartbeat 'recently 
enough'. If not, it marks the local topology as TOPOLOGY_CHANGING and listeners 
are thus stepping back correctly (as the local instance is likely to be treated 
as dead by other instances in the local cluster soon).

Now this works fine as long as the new, separate thread can operate 
uninterruptively. But besides the necessity that it is properly scheduled 
regularly (which it might not, thus we should probably rather use a dedicated 
thread than rely on the scheduler's pool to be available) it also requires that 
the check can go ahead without running into a locked monitor.

https://builds.apache.org/job/sling-trunk-1.8/org.apache.sling$org.apache.sling.discovery.impl/1960/testReport/org.apache.sling.discovery.impl.cluster/RepositoryDelaysTest/testSlowSessionSaves/
 however reports a situation where the blocked commit was not in 
{{HeartbeatHandler.issueClusterLocalHeartbeat}} - but was in 
{{AnnouncementRegistry.checkExpiredAnnouncements}} - and that one does a 
{{synchronized(this)}} - which in turn blocks precisely this 2nd 
HeartbeatHandler's monitoring thread - yielding it useless..

The following log excerpt (grepping thread-3/-4/main) proofs the above 
described scenario:
{code}
06.11.2015 23:35:22.864 *INFO * [main] RepositoryDelaysTest: <main> both 
instances marked as delaying 1min - but with new background checks we should go 
changing within 3sec
06.11.2015 23:35:22.870 *DEBUG* [pool-1-thread-4] HeartbeatHandler: 
issueConnectorPings: not issuing remote heartbeat yet, startup not yet finished
06.11.2015 23:35:22.870 *INFO * [pool-1-thread-4] ArtificialDelay: delay: 
delaying [secondInstance] 'pre.commit' for 60000ms...
06.11.2015 23:35:22.870 *INFO * [pool-1-thread-4] ArtificialDelay: delay: 
delaying [secondInstance] 'pre.commit' now for 60000ms...
06.11.2015 23:35:22.909 *INFO * [Test-ViewCheckerRunner [firstInstance]] 
ArtificialDelay: delay: delaying [firstInstance] 'pre.commit' for 59876ms...
06.11.2015 23:35:22.909 *INFO * [Test-ViewCheckerRunner [firstInstance]] 
ArtificialDelay: delay: delaying [firstInstance] 'pre.commit' now for 59876ms...
06.11.2015 23:35:23.613 *DEBUG* [pool-1-thread-3] ViewHelper: 
getEstablishedView: no established view found: 
/var/discovery/clusterC/establishedView
06.11.2015 23:35:23.613 *DEBUG* [pool-1-thread-3] ClusterViewServiceImpl: 
getClusterView: no view established at the moment. isolated mode
06.11.2015 23:35:23.613 *INFO * [pool-1-thread-3] BaseDiscoveryService: 
getTopology: undefined cluster view: NO_ESTABLISHED_VIEW] 
org.apache.sling.discovery.base.commons.UndefinedClusterViewException: no 
established view at the moment
06.11.2015 23:35:23.613 *INFO * [pool-1-thread-3] MinEventDelayHandler: 
asyncDelay.run: done delaying. new view (still/again) not current, delaying 
again
06.11.2015 23:35:23.613 *TRACE* [pool-1-thread-3] MinEventDelayHandler: 
runAfter: trying with scheduler.fireJob
06.11.2015 23:35:23.613 *INFO * [pool-1-thread-3] MinEventDelayHandler: 
triggerAsyncDelaying: asynch delaying of 1 triggered: true
06.11.2015 23:35:23.705 *INFO * [pool-1-thread-3] ClusterViewServiceImpl: 
getClusterView: the local instance (1b6e7530-0d41-46e7-8471-6ea7f5d3c2bc) is 
currently not included in the existing established view! This is normal at 
startup. At other times is pseudo-network-partitioning is an indicator for 
repository/network-delays or clocks-out-of-sync (SLING-3432). (increasing the 
heartbeatTimeout can help as a workaround too) The local instance will stay in 
TOPOLOGY_CHANGING or pre _INIT mode until a new vote was successful.
06.11.2015 23:35:23.705 *INFO * [pool-1-thread-3] BaseDiscoveryService: 
getTopology: undefined cluster view: ISOLATED_FROM_TOPOLOGY] 
org.apache.sling.discovery.base.commons.UndefinedClusterViewException: 
established view does not include local instance - isolated
06.11.2015 23:35:23.705 *INFO * [pool-1-thread-3] MinEventDelayHandler: 
asyncDelay.run: done delaying. new view (still/again) not current, delaying 
again
06.11.2015 23:35:23.705 *TRACE* [pool-1-thread-3] MinEventDelayHandler: 
runAfter: trying with scheduler.fireJob
06.11.2015 23:35:23.705 *INFO * [pool-1-thread-3] MinEventDelayHandler: 
triggerAsyncDelaying: asynch delaying of 1 triggered: true
06.11.2015 23:35:23.840 *DEBUG* [pool-1-thread-3] HeartbeatHandler: 
checkForTopologyChange/.run: going to check for topology change...
06.11.2015 23:35:30.865 *DEBUG* [main] ClusterViewServiceImpl: getClusterView: 
current establishedView is marked as invalid: 
d5f1101c-f693-44a4-9c79-f7277230591f
06.11.2015 23:35:30.865 *INFO * [main] BaseDiscoveryService: getTopology: 
undefined cluster view: NO_ESTABLISHED_VIEW] 
org.apache.sling.discovery.base.commons.UndefinedClusterViewException: current 
established view was marked as invalid
06.11.2015 23:36:22.786 *INFO * [Test-ViewCheckerRunner [firstInstance]] 
ArtificialDelay: delay: delaying [firstInstance]'pre.commit' for 59876ms done.
06.11.2015 23:36:22.809 *INFO * [Test-ViewCheckerRunner [firstInstance]] 
ArtificialDelay: delay: delaying [firstInstance] 'pre.commit' for 59876ms...
06.11.2015 23:36:22.809 *INFO * [Test-ViewCheckerRunner [firstInstance]] 
ArtificialDelay: delay: delaying [firstInstance] 'pre.commit' now for 59876ms...
06.11.2015 23:36:22.871 *INFO * [pool-1-thread-4] ArtificialDelay: delay: 
delaying [secondInstance]'pre.commit' for 60000ms done.
06.11.2015 23:36:22.871 *DEBUG* [pool-1-thread-4] 
8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb: analyzeVotings: start. slingId: 
8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb
06.11.2015 23:36:22.871 *DEBUG* [main] AnnouncementRegistryImpl: listInstances: 
start. localClusterView: a ClusterView[2 instances: 
8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb, 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5]
06.11.2015 23:36:22.871 *DEBUG* [main] AnnouncementRegistryImpl: 
listAnnouncementsInSameCluster: start. localClusterView: a ClusterView[2 
instances: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb, 
8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5]
06.11.2015 23:36:22.871 *DEBUG* [pool-1-thread-4] VotingHelper: listVotings: 
votings found: 0
06.11.2015 23:36:22.871 *DEBUG* [pool-1-thread-4] 
8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb: analyzeVotings: no ongoing votings at the 
moment. done.
06.11.2015 23:36:22.871 *DEBUG* [main] AnnouncementRegistryImpl: 
listAnnouncementsInSameCluster: handling clusterInstance: 
8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5
06.11.2015 23:36:22.871 *DEBUG* [main] AnnouncementRegistryImpl: 
listAnnouncementsInSameCluster: instance has announcements: 
8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5
06.11.2015 23:36:22.872 *DEBUG* [main] AnnouncementRegistryImpl: 
listAnnouncementsInSameCluster: handling clusterInstance: 
8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb
06.11.2015 23:36:22.872 *DEBUG* [main] AnnouncementRegistryImpl: 
listAnnouncementsInSameCluster: matched localInstance, filling with cache: 
8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb
06.11.2015 23:36:22.872 *INFO * [pool-1-thread-4] HeartbeatHandler: 
doCheckView: established view does not match. (details: old: 
8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5, old: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb)
06.11.2015 23:36:22.872 *INFO * [pool-1-thread-4] HeartbeatHandler: 
doCheckViewWith: no matching established view, marking topology as changing
06.11.2015 23:36:22.872 *INFO * [pool-1-thread-4] HeartbeatHandler: 
invalidateCurrentEstablishedView: invalidating 
slingId=8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb, 
lastEstablishedViewId=d5f1101c-f693-44a4-9c79-f7277230591f
06.11.2015 23:36:22.872 *INFO * [pool-1-thread-4] ClusterViewServiceImpl: 
invalidateEstablishedViewId: marking established view as invalid: 
d5f1101c-f693-44a4-9c79-f7277230591f
06.11.2015 23:36:22.872 *DEBUG* [pool-1-thread-4] DiscoveryServiceImpl: 
handleTopologyChanging: invoking viewStateManager.handlechanging
06.11.2015 23:36:22.872 *TRACE* [pool-1-thread-4] ViewStateManagerImpl: 
handleChanging: start
06.11.2015 23:36:22.872 *DEBUG* [main] AnnouncementRegistryImpl: 
listAnnouncementsInSameCluster: result: 0
06.11.2015 23:36:22.872 *DEBUG* [main] AnnouncementRegistryImpl: listInstances: 
announcements added. end. instances: []
06.11.2015 23:36:22.873 *DEBUG* [pool-1-thread-3] AnnouncementRegistryImpl: 
listInstances: start. localClusterView: a ClusterView[2 instances: 
8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb, 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5]
06.11.2015 23:36:22.873 *DEBUG* [pool-1-thread-3] AnnouncementRegistryImpl: 
listAnnouncementsInSameCluster: start. localClusterView: a ClusterView[2 
instances: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb, 
8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5]
06.11.2015 23:36:22.874 *INFO * [pool-1-thread-4] ViewStateManagerImpl: 
enqueueForAll: sending topologyEvent TopologyEvent [type=TOPOLOGY_CHANGING, 
oldView=DefaultTopologyView[current=false, num=2, 
instances=8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5[local=false,leader=false],8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb[local=true,leader=true]],
 newView=null], to all (0) listeners
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to