Stefan Egli created SLING-5280:
----------------------------------
Summary: HeartbeatHandler can also be blocked while holding
AnnouncementRegistry lock
Key: SLING-5280
URL: https://issues.apache.org/jira/browse/SLING-5280
Project: Sling
Issue Type: Bug
Components: Extensions
Affects Versions: Discovery Impl 1.2.0
Reporter: Stefan Egli
Assignee: Stefan Egli
Priority: Critical
Fix For: Discovery Impl 1.2.2
SLING-5195 introduces a separate thread that periodically checks if the local
HeartbeatHandler is properly functional and has written a heartbeat 'recently
enough'. If not, it marks the local topology as TOPOLOGY_CHANGING and listeners
are thus stepping back correctly (as the local instance is likely to be treated
as dead by other instances in the local cluster soon).
Now this works fine as long as the new, separate thread can operate
uninterruptively. But besides the necessity that it is properly scheduled
regularly (which it might not, thus we should probably rather use a dedicated
thread than rely on the scheduler's pool to be available) it also requires that
the check can go ahead without running into a locked monitor.
https://builds.apache.org/job/sling-trunk-1.8/org.apache.sling$org.apache.sling.discovery.impl/1960/testReport/org.apache.sling.discovery.impl.cluster/RepositoryDelaysTest/testSlowSessionSaves/
however reports a situation where the blocked commit was not in
{{HeartbeatHandler.issueClusterLocalHeartbeat}} - but was in
{{AnnouncementRegistry.checkExpiredAnnouncements}} - and that one does a
{{synchronized(this)}} - which in turn blocks precisely this 2nd
HeartbeatHandler's monitoring thread - yielding it useless..
The following log excerpt (grepping thread-3/-4/main) proofs the above
described scenario:
{code}
06.11.2015 23:35:22.864 *INFO * [main] RepositoryDelaysTest: <main> both
instances marked as delaying 1min - but with new background checks we should go
changing within 3sec
06.11.2015 23:35:22.870 *DEBUG* [pool-1-thread-4] HeartbeatHandler:
issueConnectorPings: not issuing remote heartbeat yet, startup not yet finished
06.11.2015 23:35:22.870 *INFO * [pool-1-thread-4] ArtificialDelay: delay:
delaying [secondInstance] 'pre.commit' for 60000ms...
06.11.2015 23:35:22.870 *INFO * [pool-1-thread-4] ArtificialDelay: delay:
delaying [secondInstance] 'pre.commit' now for 60000ms...
06.11.2015 23:35:22.909 *INFO * [Test-ViewCheckerRunner [firstInstance]]
ArtificialDelay: delay: delaying [firstInstance] 'pre.commit' for 59876ms...
06.11.2015 23:35:22.909 *INFO * [Test-ViewCheckerRunner [firstInstance]]
ArtificialDelay: delay: delaying [firstInstance] 'pre.commit' now for 59876ms...
06.11.2015 23:35:23.613 *DEBUG* [pool-1-thread-3] ViewHelper:
getEstablishedView: no established view found:
/var/discovery/clusterC/establishedView
06.11.2015 23:35:23.613 *DEBUG* [pool-1-thread-3] ClusterViewServiceImpl:
getClusterView: no view established at the moment. isolated mode
06.11.2015 23:35:23.613 *INFO * [pool-1-thread-3] BaseDiscoveryService:
getTopology: undefined cluster view: NO_ESTABLISHED_VIEW]
org.apache.sling.discovery.base.commons.UndefinedClusterViewException: no
established view at the moment
06.11.2015 23:35:23.613 *INFO * [pool-1-thread-3] MinEventDelayHandler:
asyncDelay.run: done delaying. new view (still/again) not current, delaying
again
06.11.2015 23:35:23.613 *TRACE* [pool-1-thread-3] MinEventDelayHandler:
runAfter: trying with scheduler.fireJob
06.11.2015 23:35:23.613 *INFO * [pool-1-thread-3] MinEventDelayHandler:
triggerAsyncDelaying: asynch delaying of 1 triggered: true
06.11.2015 23:35:23.705 *INFO * [pool-1-thread-3] ClusterViewServiceImpl:
getClusterView: the local instance (1b6e7530-0d41-46e7-8471-6ea7f5d3c2bc) is
currently not included in the existing established view! This is normal at
startup. At other times is pseudo-network-partitioning is an indicator for
repository/network-delays or clocks-out-of-sync (SLING-3432). (increasing the
heartbeatTimeout can help as a workaround too) The local instance will stay in
TOPOLOGY_CHANGING or pre _INIT mode until a new vote was successful.
06.11.2015 23:35:23.705 *INFO * [pool-1-thread-3] BaseDiscoveryService:
getTopology: undefined cluster view: ISOLATED_FROM_TOPOLOGY]
org.apache.sling.discovery.base.commons.UndefinedClusterViewException:
established view does not include local instance - isolated
06.11.2015 23:35:23.705 *INFO * [pool-1-thread-3] MinEventDelayHandler:
asyncDelay.run: done delaying. new view (still/again) not current, delaying
again
06.11.2015 23:35:23.705 *TRACE* [pool-1-thread-3] MinEventDelayHandler:
runAfter: trying with scheduler.fireJob
06.11.2015 23:35:23.705 *INFO * [pool-1-thread-3] MinEventDelayHandler:
triggerAsyncDelaying: asynch delaying of 1 triggered: true
06.11.2015 23:35:23.840 *DEBUG* [pool-1-thread-3] HeartbeatHandler:
checkForTopologyChange/.run: going to check for topology change...
06.11.2015 23:35:30.865 *DEBUG* [main] ClusterViewServiceImpl: getClusterView:
current establishedView is marked as invalid:
d5f1101c-f693-44a4-9c79-f7277230591f
06.11.2015 23:35:30.865 *INFO * [main] BaseDiscoveryService: getTopology:
undefined cluster view: NO_ESTABLISHED_VIEW]
org.apache.sling.discovery.base.commons.UndefinedClusterViewException: current
established view was marked as invalid
06.11.2015 23:36:22.786 *INFO * [Test-ViewCheckerRunner [firstInstance]]
ArtificialDelay: delay: delaying [firstInstance]'pre.commit' for 59876ms done.
06.11.2015 23:36:22.809 *INFO * [Test-ViewCheckerRunner [firstInstance]]
ArtificialDelay: delay: delaying [firstInstance] 'pre.commit' for 59876ms...
06.11.2015 23:36:22.809 *INFO * [Test-ViewCheckerRunner [firstInstance]]
ArtificialDelay: delay: delaying [firstInstance] 'pre.commit' now for 59876ms...
06.11.2015 23:36:22.871 *INFO * [pool-1-thread-4] ArtificialDelay: delay:
delaying [secondInstance]'pre.commit' for 60000ms done.
06.11.2015 23:36:22.871 *DEBUG* [pool-1-thread-4]
8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb: analyzeVotings: start. slingId:
8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb
06.11.2015 23:36:22.871 *DEBUG* [main] AnnouncementRegistryImpl: listInstances:
start. localClusterView: a ClusterView[2 instances:
8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb, 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5]
06.11.2015 23:36:22.871 *DEBUG* [main] AnnouncementRegistryImpl:
listAnnouncementsInSameCluster: start. localClusterView: a ClusterView[2
instances: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb,
8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5]
06.11.2015 23:36:22.871 *DEBUG* [pool-1-thread-4] VotingHelper: listVotings:
votings found: 0
06.11.2015 23:36:22.871 *DEBUG* [pool-1-thread-4]
8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb: analyzeVotings: no ongoing votings at the
moment. done.
06.11.2015 23:36:22.871 *DEBUG* [main] AnnouncementRegistryImpl:
listAnnouncementsInSameCluster: handling clusterInstance:
8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5
06.11.2015 23:36:22.871 *DEBUG* [main] AnnouncementRegistryImpl:
listAnnouncementsInSameCluster: instance has announcements:
8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5
06.11.2015 23:36:22.872 *DEBUG* [main] AnnouncementRegistryImpl:
listAnnouncementsInSameCluster: handling clusterInstance:
8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb
06.11.2015 23:36:22.872 *DEBUG* [main] AnnouncementRegistryImpl:
listAnnouncementsInSameCluster: matched localInstance, filling with cache:
8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb
06.11.2015 23:36:22.872 *INFO * [pool-1-thread-4] HeartbeatHandler:
doCheckView: established view does not match. (details: old:
8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5, old: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb)
06.11.2015 23:36:22.872 *INFO * [pool-1-thread-4] HeartbeatHandler:
doCheckViewWith: no matching established view, marking topology as changing
06.11.2015 23:36:22.872 *INFO * [pool-1-thread-4] HeartbeatHandler:
invalidateCurrentEstablishedView: invalidating
slingId=8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb,
lastEstablishedViewId=d5f1101c-f693-44a4-9c79-f7277230591f
06.11.2015 23:36:22.872 *INFO * [pool-1-thread-4] ClusterViewServiceImpl:
invalidateEstablishedViewId: marking established view as invalid:
d5f1101c-f693-44a4-9c79-f7277230591f
06.11.2015 23:36:22.872 *DEBUG* [pool-1-thread-4] DiscoveryServiceImpl:
handleTopologyChanging: invoking viewStateManager.handlechanging
06.11.2015 23:36:22.872 *TRACE* [pool-1-thread-4] ViewStateManagerImpl:
handleChanging: start
06.11.2015 23:36:22.872 *DEBUG* [main] AnnouncementRegistryImpl:
listAnnouncementsInSameCluster: result: 0
06.11.2015 23:36:22.872 *DEBUG* [main] AnnouncementRegistryImpl: listInstances:
announcements added. end. instances: []
06.11.2015 23:36:22.873 *DEBUG* [pool-1-thread-3] AnnouncementRegistryImpl:
listInstances: start. localClusterView: a ClusterView[2 instances:
8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb, 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5]
06.11.2015 23:36:22.873 *DEBUG* [pool-1-thread-3] AnnouncementRegistryImpl:
listAnnouncementsInSameCluster: start. localClusterView: a ClusterView[2
instances: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb,
8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5]
06.11.2015 23:36:22.874 *INFO * [pool-1-thread-4] ViewStateManagerImpl:
enqueueForAll: sending topologyEvent TopologyEvent [type=TOPOLOGY_CHANGING,
oldView=DefaultTopologyView[current=false, num=2,
instances=8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5[local=false,leader=false],8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb[local=true,leader=true]],
newView=null], to all (0) listeners
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)