[jira] [Commented] (SLING-10489) Ignore partially started, newly joining instances to avoid disturbing discovery (for a while)
[ https://issues.apache.org/jira/browse/SLING-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504480#comment-17504480 ] Stefan Egli commented on SLING-10489: - * merged [discovery.oak PR#4|https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/4] > Ignore partially started, newly joining instances to avoid disturbing > discovery (for a while) > - > > Key: SLING-10489 > URL: https://issues.apache.org/jira/browse/SLING-10489 > Project: Sling > Issue Type: Improvement > Components: Discovery >Affects Versions: Discovery Commons 1.0.24, Discovery Base 2.0.10, > Discovery Oak 1.2.34 >Reporter: Stefan Egli >Assignee: Stefan Egli >Priority: Major > Fix For: Discovery Base 2.0.12, Discovery Commons 1.0.26, > Discovery Oak 1.2.36 > > Time Spent: 8h 10m > Remaining Estimate: 0h > > Discovery.oak requires that both Oak and Sling are operating normally in > order to declare victory and announce a new topology. > The startup phase is especially tricky in this regard, since there are > multiple elements that need to get updated (some are in the Oak layer, some > in Sling) : > * lease & clusterNodeId : this is maintained by Oak > * idMap : this is maintained by IdMapService (Sling) > * leaderElectionId : this is maintained by OakViewChecker (Sling) > * syncToken : this is maintained by SyncTokenService (Sling) > Situations have been seen where Oak starts up fine, but higher level (eg > Sling) bundles were not activated within a reasonable amount of time. This > lead to discovery staying in TOPOLOGY_CHANGING state for longer than expected. > There should be a mechanism that ignores (suppresses) newly joining instances > if they start up only partially. However, after a certain timeout this > mechanism should give up. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (SLING-10489) Ignore partially started, newly joining instances to avoid disturbing discovery (for a while)
[ https://issues.apache.org/jira/browse/SLING-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503598#comment-17503598 ] Stefan Egli commented on SLING-10489: - * started release vote for discovery.commons 1.0.26 and discovery.base 2.0.12 > Ignore partially started, newly joining instances to avoid disturbing > discovery (for a while) > - > > Key: SLING-10489 > URL: https://issues.apache.org/jira/browse/SLING-10489 > Project: Sling > Issue Type: Improvement > Components: Discovery >Affects Versions: Discovery Commons 1.0.24, Discovery Base 2.0.10, > Discovery Oak 1.2.34 >Reporter: Stefan Egli >Assignee: Stefan Egli >Priority: Major > Fix For: Discovery Base 2.0.12, Discovery Commons 1.0.26, > Discovery Oak 1.2.36 > > Time Spent: 7h 20m > Remaining Estimate: 0h > > Discovery.oak requires that both Oak and Sling are operating normally in > order to declare victory and announce a new topology. > The startup phase is especially tricky in this regard, since there are > multiple elements that need to get updated (some are in the Oak layer, some > in Sling) : > * lease & clusterNodeId : this is maintained by Oak > * idMap : this is maintained by IdMapService (Sling) > * leaderElectionId : this is maintained by OakViewChecker (Sling) > * syncToken : this is maintained by SyncTokenService (Sling) > Situations have been seen where Oak starts up fine, but higher level (eg > Sling) bundles were not activated within a reasonable amount of time. This > lead to discovery staying in TOPOLOGY_CHANGING state for longer than expected. > There should be a mechanism that ignores (suppresses) newly joining instances > if they start up only partially. However, after a certain timeout this > mechanism should give up. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (SLING-10489) Ignore partially started, newly joining instances to avoid disturbing discovery (for a while)
[ https://issues.apache.org/jira/browse/SLING-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403834#comment-17403834 ] Stefan Egli commented on SLING-10489: - * created [discovery.commons PR#4|https://github.com/apache/sling-org-apache-sling-discovery-commons/pull/4] - which improves partial-startup-suppression stability and introduces the LogSilencer * updated [discovery.oak PR#4|https://github.com/apache/sling-org-apache-sling-discovery-oak/pull/4] - which now doesn't build because it wants the LogSilencer from above PR .. (so these 2 bundles need to be built together and travis can't know that) > Ignore partially started, newly joining instances to avoid disturbing > discovery (for a while) > - > > Key: SLING-10489 > URL: https://issues.apache.org/jira/browse/SLING-10489 > Project: Sling > Issue Type: Improvement > Components: Discovery >Affects Versions: Discovery Oak 1.2.34 >Reporter: Stefan Egli >Assignee: Stefan Egli >Priority: Major > Fix For: Discovery Oak 1.2.36 > > Time Spent: 2h 50m > Remaining Estimate: 0h > > Discovery.oak requires that both Oak and Sling are operating normally in > order to declare victory and announce a new topology. > The startup phase is especially tricky in this regard, since there are > multiple elements that need to get updated (some are in the Oak layer, some > in Sling) : > * lease & clusterNodeId : this is maintained by Oak > * idMap : this is maintained by IdMapService (Sling) > * leaderElectionId : this is maintained by OakViewChecker (Sling) > * syncToken : this is maintained by SyncTokenService (Sling) > Situations have been seen where Oak starts up fine, but higher level (eg > Sling) bundles were not activated within a reasonable amount of time. This > lead to discovery staying in TOPOLOGY_CHANGING state for longer than expected. > There should be a mechanism that ignores (suppresses) newly joining instances > if they start up only partially. However, after a certain timeout this > mechanism should give up. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (SLING-10489) Ignore partially started, newly joining instances to avoid disturbing discovery (for a while)
[ https://issues.apache.org/jira/browse/SLING-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403828#comment-17403828 ] Stefan Egli commented on SLING-10489: - * merged a minor [discovery.base PR#6|https://github.com/apache/sling-org-apache-sling-discovery-base/pull/6] > Ignore partially started, newly joining instances to avoid disturbing > discovery (for a while) > - > > Key: SLING-10489 > URL: https://issues.apache.org/jira/browse/SLING-10489 > Project: Sling > Issue Type: Improvement > Components: Discovery >Affects Versions: Discovery Oak 1.2.34 >Reporter: Stefan Egli >Assignee: Stefan Egli >Priority: Major > Fix For: Discovery Oak 1.2.36 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Discovery.oak requires that both Oak and Sling are operating normally in > order to declare victory and announce a new topology. > The startup phase is especially tricky in this regard, since there are > multiple elements that need to get updated (some are in the Oak layer, some > in Sling) : > * lease & clusterNodeId : this is maintained by Oak > * idMap : this is maintained by IdMapService (Sling) > * leaderElectionId : this is maintained by OakViewChecker (Sling) > * syncToken : this is maintained by SyncTokenService (Sling) > Situations have been seen where Oak starts up fine, but higher level (eg > Sling) bundles were not activated within a reasonable amount of time. This > lead to discovery staying in TOPOLOGY_CHANGING state for longer than expected. > There should be a mechanism that ignores (suppresses) newly joining instances > if they start up only partially. However, after a certain timeout this > mechanism should give up. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (SLING-10489) Ignore partially started, newly joining instances to avoid disturbing discovery (for a while)
[ https://issues.apache.org/jira/browse/SLING-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376522#comment-17376522 ] Stefan Egli commented on SLING-10489: - * merged [discovery.commons PR|https://github.com/apache/sling-org-apache-sling-discovery-commons/pull/3] > Ignore partially started, newly joining instances to avoid disturbing > discovery (for a while) > - > > Key: SLING-10489 > URL: https://issues.apache.org/jira/browse/SLING-10489 > Project: Sling > Issue Type: Improvement > Components: Discovery >Affects Versions: Discovery Oak 1.2.34 >Reporter: Stefan Egli >Assignee: Stefan Egli >Priority: Major > Fix For: Discovery Oak 1.2.36 > > Time Spent: 2.5h > Remaining Estimate: 0h > > Discovery.oak requires that both Oak and Sling are operating normally in > order to declare victory and announce a new topology. > The startup phase is especially tricky in this regard, since there are > multiple elements that need to get updated (some are in the Oak layer, some > in Sling) : > * lease & clusterNodeId : this is maintained by Oak > * idMap : this is maintained by IdMapService (Sling) > * leaderElectionId : this is maintained by OakViewChecker (Sling) > * syncToken : this is maintained by SyncTokenService (Sling) > Situations have been seen where Oak starts up fine, but higher level (eg > Sling) bundles were not activated within a reasonable amount of time. This > lead to discovery staying in TOPOLOGY_CHANGING state for longer than expected. > There should be a mechanism that ignores (suppresses) newly joining instances > if they start up only partially. However, after a certain timeout this > mechanism should give up. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (SLING-10489) Ignore partially started, newly joining instances to avoid disturbing discovery (for a while)
[ https://issues.apache.org/jira/browse/SLING-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366195#comment-17366195 ] Stefan Egli commented on SLING-10489: - PRs updated with the following * 'joiner delay' introduced : when an instance joins a discovery cluster with existing members, it waits before sending the first TOPOLOGY_INIT - by default 30sec. This helps reducing a race-condition with partial startup suppressing : consider instances A and B up. Then B dies and simultaneously C starts up partially. In this case A will suppress C but at the same time notice that B has left, thus make a topology change with B leaving. Ie it will write the new sync token as per this new state. For A this means that as soon as it would finish the startup successfully, it would notice the sync token of A already written and it would immediately send a TOPOLOGY_INIT with (C and A). However, A might not be so fast, A still thinks the topology is just (A). Once it notices that C joined (which happens once every second) it will include C and declare a topology with (A and C). But there is a small time window where different cluster instances could have different views of the topology. And to avoid this, the joiner C does an artificial delay (hence stays without any topology) to give A enough time to read C's sync token. * the default of the partial startup suppression got changed to infinity : there's no reason to stop suppressing a partial startup - if that instance doesn't write eg its syncToken, then it doesn't belong to the topology - no matter how long that takes. Having a timeout would be a compromise to eventually acknowledge that another instance is joining - but realistically that instance should be considered not part of the topology until it finishes the startup. > Ignore partially started, newly joining instances to avoid disturbing > discovery (for a while) > - > > Key: SLING-10489 > URL: https://issues.apache.org/jira/browse/SLING-10489 > Project: Sling > Issue Type: Improvement > Components: Discovery >Affects Versions: Discovery Oak 1.2.34 >Reporter: Stefan Egli >Assignee: Stefan Egli >Priority: Major > Fix For: Discovery Oak 1.2.36 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > Discovery.oak requires that both Oak and Sling are operating normally in > order to declare victory and announce a new topology. > The startup phase is especially tricky in this regard, since there are > multiple elements that need to get updated (some are in the Oak layer, some > in Sling) : > * lease & clusterNodeId : this is maintained by Oak > * idMap : this is maintained by IdMapService (Sling) > * leaderElectionId : this is maintained by OakViewChecker (Sling) > * syncToken : this is maintained by SyncTokenService (Sling) > Situations have been seen where Oak starts up fine, but higher level (eg > Sling) bundles were not activated within a reasonable amount of time. This > lead to discovery staying in TOPOLOGY_CHANGING state for longer than expected. > There should be a mechanism that ignores (suppresses) newly joining instances > if they start up only partially. However, after a certain timeout this > mechanism should give up. -- This message was sent by Atlassian Jira (v8.3.4#803005)