Re: [ClusterLabs] Problems with master/slave failovers

Ken Gaillot Mon, 01 Jul 2019 07:32:58 -0700

On Sat, 2019-06-29 at 03:01 +0000, Harvey Shepherd wrote:
> Thank you so much Ken, your explanation of the crm_simulate output is
> really helpful. Regarding your suggestion of setting a migration-
> threshold of 1 for the king resource, I did in fact have that in
> place as a workaround. But ideally I don't want the failed instance
> to be delayed from restarting by having to time out its failure - I'd
> just like the resource to failover and restart a new slave
> immediately, That's because on my system there is a hit to
> performance if the slave instance is not running.
> 
> My suspicion is that Pacemaker is trying to do the right thing, but
> is failing either because the operation is timing out, or because it
> is getting confused in some way due to the colocation and ordering
> constraints placing dependencies between the servant resources and
> the master king resource. Either of those possibilities might explain
> why I see logs like these during the eight or so attempts that
> Pacemaker makes to perform a failover after the king master resource
> fails.
> 
> Jun 29 02:33:03 ctr_qemu pacemaker-controld  [1224]
> (run_graph)         debug: Transition 10 (Complete=0, Pending=1,
> Fired=3, Skipped=0, Incomplete=61,
> Source=/var/lib/pacemaker/pengine/pe-input-10.bz2): In-progress
> Jun 29 02:33:03 ctr_qemu pacemaker-controld  [1224]
> (run_graph)         debug: Transition 10 (Complete=2, Pending=1,
> Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-10.bz2): In-progress
> Jun 29 02:33:03 ctr_qemu pacemaker-controld  [1224]
> (run_graph)         debug: Transition 10 (Complete=3, Pending=1,
> Fired=1, Skipped=0, Incomplete=2,
> Source=/var/lib/pacemaker/pengine/pe-input-10.bz2): In-progress
> Jun 29 02:33:03 ctr_qemu pacemaker-controld  [1224]
> (run_graph)         debug: Transition 10 (Complete=4, Pending=1,
> Fired=1, Skipped=0, Incomplete=9,
> Source=/var/lib/pacemaker/pengine/pe-input-10.bz2): In-progress
> Jun 29 02:33:04 ctr_qemu pacemaker-controld  [1224]
> (run_graph)         debug: Transition 10 (Complete=5, Pending=1,
> Fired=3, Skipped=0, Incomplete=29,
> Source=/var/lib/pacemaker/pengine/pe-input-10.bz2): In-progress
> Jun 29 02:33:04 ctr_qemu pacemaker-controld  [1224]
> (run_graph)         debug: Transition 10 (Complete=7, Pending=1,
> Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-10.bz2): In-progress
> Jun 29 02:33:04 ctr_qemu pacemaker-controld  [1224]
> (run_graph)         debug: Transition 10 (Complete=8, Pending=1,
> Fired=1, Skipped=0, Incomplete=5,
> Source=/var/lib/pacemaker/pengine/pe-input-10.bz2): In-progress


You can see the "Complete" counter going up in each message above.
These are actions in the transition completing (successfully, otherwise
there would be messages about failures).

> Jun 29 02:33:04 ctr_qemu pacemaker-controld  [1224]
> (abort_transition_graph)    notice: Transition 10 aborted by deletion
> of nvpair[@id='status-1-master-king_resource']: Transient attribute
> change | cib=0.4.208 source=abort_unless_down:345
> path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/in
> stance_attributes[@id='status-1']/nvpair[@id='status-1-master-
> king_resource'] complete=false

Transitions are aborted anytime new information comes in, in this case
one of the master attributes changed. It's not a problem in any way,
pacemaker will simply recalculate what still needs to be done, taking
into account the actions that have already completed.

> Jun 29 02:33:04 ctr_qemu pacemaker-controld  [1224]
> (run_graph)         debug: Transition 10 (Complete=8, Pending=1,
> Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-10.bz2): In-progress
> Jun 29 02:33:04 ctr_qemu pacemaker-controld  [1224]
> (run_graph)         debug: Transition 10 (Complete=9, Pending=0,
> Fired=3, Skipped=12, Incomplete=108,
> Source=/var/lib/pacemaker/pengine/pe-input-10.bz2): In-progress
> Jun 29 02:33:04 ctr_qemu pacemaker-controld  [1224]
> (run_graph)         debug: Transition 10 (Complete=12, Pending=0,
> Fired=1, Skipped=14, Incomplete=107,
> Source=/var/lib/pacemaker/pengine/pe-input-10.bz2): In-progress
> Jun 29 02:33:04 ctr_qemu pacemaker-controld  [1224]
> (run_graph)         debug: Transition 10 (Complete=13, Pending=1,
> Fired=1, Skipped=0, Incomplete=7,
> Source=/var/lib/pacemaker/pengine/pe-input-10.bz2): In-progress
> Jun 29 02:33:04 ctr_qemu pacemaker-controld  [1224]
> (run_graph)         debug: Transition 10 (Complete=14, Pending=0,
> Fired=2, Skipped=14, Incomplete=104,
> Source=/var/lib/pacemaker/pengine/pe-input-10.bz2): In-progress
> Jun 29 02:33:04 ctr_qemu pacemaker-controld  [1224]
> (run_graph)         notice: Transition 10 (Complete=16, Pending=0,
> Fired=0, Skipped=15, Incomplete=104,
> Source=/var/lib/pacemaker/pengine/pe-input-10.bz2): Stopped
> Jun 29 02:33:04 ctr_qemu pacemaker-controld  [1224]
> (te_graph_trigger)  debug: Transition 10 is now complete
> Jun 29 02:33:04 ctr_qemu pacemaker-controld  [1224]
> (notify_crmd)       debug: Transition 10 status: restart - Transient
> attribute change

Even though the transition aborted, you still see actions being
completed -- these are actions that were already initiated when the
transition was aborted, so we were waiting on their results before
proceeding.

> As you can see, it eventually gives up in the transition attempt and
> starts a new one. Eventually the failed king resource master has had
> time to come back online and it then just promotes it again and
> forgets about trying to failover. I'm not sure if the cluster
> transition actions listed by crm_simulate are in the order in which
> Pacemaker tries to carry out the operations, but if so the order is 

The "transition summary" is just a resource-by-resource list, not the
order things will be done. The "executing cluster transition" section
is the order things are being done.

> wrong. It should be stopping all servant resources on the failed king
> master, then failing over the king resource, then migrating the
> servant resources to the new master node. Instead it seems to be
> trying to migrate all the servant resources over first, with the king
> master failover scheduled near the bottom, which won't work due to
> the colocation constraint with the king master.
> 
> Current cluster status:
> Online: [ primary secondary ]
> 
>  stk_shared_ip  (ocf::heartbeat:IPaddr2):       Started secondary
>  Clone Set: ms_king_resource [king_resource] (promotable)
>      king_resource      (ocf::aviat:king-resource-ocf):    FAILED
> primary
>      Slaves: [ secondary ]
>  Clone Set: ms_servant1 [servant1]
>      Started: [ primary secondary ]
>  Clone Set: ms_servant2 [servant2] (promotable)
>      Masters: [ primary ]
>      Slaves: [ secondary ]
>  Clone Set: ms_servant3 [servant3] (promotable)
>      Masters: [ primary ]
>      Slaves: [ secondary ]
>  servant4        (lsb:servant4):  Started primary
>  servant5  (lsb:servant5):    Started primary
>  servant6      (lsb:servant6):        Started primary
>  servant7      (lsb:servant7):      Started primary
>  servant8      (lsb:servant8):        Started primary
>  Resource Group: servant9_active_disabled
>      servant9_resource1      (lsb:servant9_resource1):    Started
> primary
>      servant9_resource2   (lsb:servant9_resource2): Started primary
>  servant10 (lsb:servant10):   Started primary
>  servant11 (lsb:servant11):      Started primary
>  servant12    (lsb:servant12):      Started primary
>  servant13        (lsb:servant13):  Started primary
> 
> Transition Summary:
>  * Recover    king_resource:0     (             Slave primary )  
>  * Promote    king_resource:1     ( Slave -> Master secondary )  
>  * Demote     servant2:0          (   Master -> Slave primary )  
>  * Promote    servant2:1          ( Slave -> Master secondary )  
>  * Demote     servant3:0          (   Master -> Slave primary )  
>  * Promote    servant3:1          ( Slave -> Master secondary )  
>  * Move       servant4             (      primary -> secondary )  
>  * Move       servant5               (      primary -> secondary )  
>  * Move       servant6           (      primary -> secondary )  
>  * Move       servant7           (      primary -> secondary )  
>  * Move       servant8           (      primary -> secondary )  
>  * Move       servant9_resource1               (      primary ->
> secondary )  
>  * Move       servant9_resource2    (      primary -> secondary )  
>  * Move       servant10              (      primary -> secondary )  
>  * Move       servant11              (      primary -> secondary )  
>  * Move       servant12                 (      primary -> secondary
> )  
>  * Move       servant13             (      primary -> secondary )  
> 
> Executing cluster transition:
>  * Pseudo action:   ms_king_resource_pre_notify_stop_0
>  * Pseudo action:   ms_servant2_pre_notify_demote_0
>  * Resource action: servant3        cancel=10000 on primary
>  * Resource action: servant3        cancel=11000 on secondary
>  * Pseudo action:   ms_servant3_pre_notify_demote_0
>  * Resource action: servant4         stop on primary
>  * Resource action: servant5           stop on primary
>  * Resource action: servant6       stop on primary
>  * Resource action: servant7       stop on primary
>  * Resource action: servant8       stop on primary
>  * Pseudo action:   servant9_active_disabled_stop_0
>  * Resource action: servant9_resource2 stop on primary
>  * Resource action: servant10          stop on primary
>  * Resource action: servant11          stop on primary
>  * Resource action: servant12             stop on primary
>  * Resource action: servant13         stop on primary
>  * Resource action: king_resource   notify on primary
>  * Resource action: king_resource   notify on secondary
>  * Pseudo action:   ms_king_resource_confirmed-pre_notify_stop_0
>  * Pseudo action:   ms_king_resource_stop_0
>  * Resource action: servant2        notify on primary
>  * Resource action: servant2        notify on secondary
>  * Pseudo action:   ms_servant2_confirmed-pre_notify_demote_0
>  * Pseudo action:   ms_servant2_demote_0
>  * Resource action: servant3        notify on primary
>  * Resource action: servant3        notify on secondary
>  * Pseudo action:   ms_servant3_confirmed-pre_notify_demote_0
>  * Pseudo action:   ms_servant3_demote_0
>  * Resource action: servant4         start on secondary
>  * Resource action: servant5           start on secondary
>  * Resource action: servant6       start on secondary
>  * Resource action: servant7       start on secondary
>  * Resource action: servant8       start on secondary
>  * Resource action: servant9_resource1           stop on primary
>  * Resource action: servant10          start on secondary
>  * Resource action: servant11          start on secondary
>  * Resource action: servant12             start on secondary
>  * Resource action: servant13         start on secondary
>  * Resource action: king_resource   stop on primary
>  * Pseudo action:   ms_king_resource_stopped_0
>  * Resource action: servant2        demote on primary
>  * Pseudo action:   ms_servant2_demoted_0
>  * Resource action: servant3        demote on primary
>  * Pseudo action:   ms_servant3_demoted_0
>  * Resource action: servant4         monitor=10000 on secondary
>  * Resource action: servant5           monitor=10000 on secondary
>  * Resource action: servant6       monitor=10000 on secondary
>  * Resource action: servant7       monitor=10000 on secondary
>  * Resource action: servant8       monitor=10000 on secondary
>  * Pseudo action:   servant9_active_disabled_stopped_0
>  * Pseudo action:   servant9_active_disabled_start_0
>  * Resource action: servant9_resource1           start on secondary
>  * Resource action: servant9_resource2 start on secondary
>  * Resource action: servant10          monitor=10000 on secondary
>  * Resource action: servant11          monitor=10000 on secondary
>  * Resource action: servant12             monitor=10000 on secondary
>  * Resource action: servant13         monitor=10000 on secondary
>  * Pseudo action:   ms_king_resource_post_notify_stopped_0
>  * Pseudo action:   ms_servant2_post_notify_demoted_0
>  * Pseudo action:   ms_servant3_post_notify_demoted_0
>  * Pseudo action:   servant9_active_disabled_running_0
>  * Resource action: servant9_resource1           monitor=10000 on
> secondary
>  * Resource action: servant9_resource2 monitor=10000 on secondary
>  * Resource action: king_resource   notify on secondary
>  * Pseudo action:   ms_king_resource_confirmed-post_notify_stopped_0
>  * Pseudo action:   ms_king_resource_pre_notify_start_0
>  * Resource action: servant2        notify on primary
>  * Resource action: servant2        notify on secondary
>  * Pseudo action:   ms_servant2_confirmed-post_notify_demoted_0
>  * Pseudo action:   ms_servant2_pre_notify_promote_0
>  * Resource action: servant3        notify on primary
>  * Resource action: servant3        notify on secondary
>  * Pseudo action:   ms_servant3_confirmed-post_notify_demoted_0
>  * Pseudo action:   ms_servant3_pre_notify_promote_0
>  * Resource action: king_resource   notify on secondary
>  * Pseudo action:   ms_king_resource_confirmed-pre_notify_start_0
>  * Pseudo action:   ms_king_resource_start_0
>  * Resource action: servant2        notify on primary
>  * Resource action: servant2        notify on secondary
>  * Pseudo action:   ms_servant2_confirmed-pre_notify_promote_0
>  * Pseudo action:   ms_servant2_promote_0
>  * Resource action: servant3        notify on primary
>  * Resource action: servant3        notify on secondary
>  * Pseudo action:   ms_servant3_confirmed-pre_notify_promote_0
>  * Pseudo action:   ms_servant3_promote_0
>  * Resource action: king_resource   start on primary
>  * Pseudo action:   ms_king_resource_running_0
>  * Resource action: servant2        promote on secondary
>  * Pseudo action:   ms_servant2_promoted_0
>  * Resource action: servant3        promote on secondary
>  * Pseudo action:   ms_servant3_promoted_0
>  * Pseudo action:   ms_king_resource_post_notify_running_0
>  * Pseudo action:   ms_servant2_post_notify_promoted_0
>  * Pseudo action:   ms_servant3_post_notify_promoted_0
>  * Resource action: king_resource   notify on primary
>  * Resource action: king_resource   notify on secondary
>  * Pseudo action:   ms_king_resource_confirmed-post_notify_running_0
>  * Resource action: servant2        notify on primary
>  * Resource action: servant2        notify on secondary
>  * Pseudo action:   ms_servant2_confirmed-post_notify_promoted_0
>  * Resource action: servant3        notify on primary
>  * Resource action: servant3        notify on secondary
>  * Pseudo action:   ms_servant3_confirmed-post_notify_promoted_0
>  * Pseudo action:   ms_king_resource_pre_notify_promote_0
>  * Resource action: servant2        monitor=11000 on primary
>  * Resource action: servant2        monitor=10000 on secondary
>  * Resource action: servant3        monitor=11000 on primary
>  * Resource action: servant3        monitor=10000 on secondary
>  * Resource action: king_resource   notify on primary
>  * Resource action: king_resource   notify on secondary
>  * Pseudo action:   ms_king_resource_confirmed-pre_notify_promote_0
>  * Pseudo action:   ms_king_resource_promote_0
>  * Resource action: king_resource   promote on secondary
>  * Pseudo action:   ms_king_resource_promoted_0
>  * Pseudo action:   ms_king_resource_post_notify_promoted_0
>  * Resource action: king_resource   notify on primary
>  * Resource action: king_resource   notify on secondary
>  * Pseudo action:   ms_king_resource_confirmed-post_notify_promoted_0
>  * Resource action: king_resource   monitor=11000 on primary
>  * Resource action: king_resource   monitor=10000 on secondary
> Using the original execution date of: 2019-06-29 02:33:03Z
> 
> Revised cluster status:
> Online: [ primary secondary ]
> 
>  stk_shared_ip  (ocf::heartbeat:IPaddr2):       Started secondary
>  Clone Set: ms_king_resource [king_resource] (promotable)
>      Masters: [ secondary ]
>      Slaves: [ primary ]
>  Clone Set: ms_servant1 [servant1]
>      Started: [ primary secondary ]
>  Clone Set: ms_servant2 [servant2] (promotable)
>      Masters: [ secondary ]
>      Slaves: [ primary ]
>  Clone Set: ms_servant3 [servant3] (promotable)
>      Masters: [ secondary ]
>      Slaves: [ primary ]
>  servant4        (lsb:servant4):  Started secondary
>  servant5  (lsb:servant5):    Started secondary
>  servant6      (lsb:servant6):        Started secondary
>  servant7      (lsb:servant7):      Started secondary
>  servant8      (lsb:servant8):        Started secondary
>  Resource Group: servant9_active_disabled
>      servant9_resource1      (lsb:servant9_resource1):    Started
> secondary
>      servant9_resource2   (lsb:servant9_resource2): Started secondary
>  servant10 (lsb:servant10):   Started secondary
>  servant11 (lsb:servant11):      Started secondary
>  servant12    (lsb:servant12):      Started secondary
>  servant13        (lsb:servant13):  Started secondary
> 
> 
> I don't think that there is an issue with the CIB constraints
> configuration, otherwise the resources would not be able to start
> upon bootup, but I'll keep digging and report back if I find any
> cause.
> 
> Thanks again,
> Harvey
> 
> ________________________________________
> From: Users <users-boun...@clusterlabs.org> on behalf of Ken Gaillot
> <kgail...@redhat.com>
> Sent: Saturday, 29 June 2019 3:10 a.m.
> To: Cluster Labs - All topics related to open-source clustering
> welcomed
> Subject: EXTERNAL: Re: [ClusterLabs] Problems with master/slave
> failovers
> 
> On Fri, 2019-06-28 at 07:36 +0000, Harvey Shepherd wrote:
> > Thanks for your reply Andrei. Whilst I understand what you say
> > about
> > the difficulties of diagnosing issues without all of the info, it's
> > a
> > compromise between a mailing list posting being very verbose in
> > which
> > case nobody wants to read it, and containing enough relevant
> > information for someone to be able to help. With 20+ resources
> > involved during a failover there are literally thousands of logs
> > generated, and it would be pointless to post them all.
> > 
> > I've tried to focus in on the king resource only to keep things
> > simple, as that is the only resource that can initiate a failover.
> > I
> > provided the real master scores and transition decisions made by
> > pacemaker at the times that I killed the king master resource by
> > showing the crm_simulator output from both tests, and the CIB
> > config
> > is ss described. As I mentioned, migration-threshold is set to zero
> > for all resources, so it shouldn't prevent a second failover.
> > 
> > Regarding the resource agent return codes, the failure is detected
> > by
> > the 10s king resource master instance monitor operation, which
> > returns OCF_ERR_GENERIC because the resource is expected to be
> > running and isn't (the OCF resource agent developers guide states
> > that monitor should only return OCF_NOT_RUNNING if there is no
> > error
> > condition that caused the resource to stop).
> > 
> > What would be really helpful would be if you or someone else could
> > help me decipher the crm_simulate output:
> 
> I've been working with Pacemaker for years and still look at those
> scores only after exhausting all other investigation.
> 
> It isn't AI, but the complexity is somewhat similar in that it's not
> really possible to boil down the factors that went into a decision in
> a
> few human-readable sentences. We do have a project planned to provide
> some insight in human-readable form.
> 
> But if you really want the headache:
> 
> > 1. What is the difference between clone_color and native_color?
> 
> native_color is scores added by the resource as a primitive resource,
> i.e. the resource being cloned. clone_color is scores added by the
> resource as a clone, i.e. the internal abstraction that allows a
> primitive resource to run in multiple places. All it really means is
> that different C functions added the scores, which is pretty useless
> without staring at the source code of those functions.
> 
> > 2. What is the difference between "promotion scores" and
> > "allocation
> > scores" and why does the output show several instances of each?
> 
> Allocation is placement of particular resources (including individual
> clone instances) to particular nodes; promotion is selecting an
> instance to be master.
> 
> The multiple occurrences are due to multiple factors going into the
> final score.
> 
> > 3. How does pacemaker use those scores to decide whether to
> > failover?
> 
> It doesn't -- it uses them to determine where to failover. Whether to
> failover is determined by fail-count and resource operation history
> (and affected by configured policies such as on-fail, failure-
> timeout,
> and migration-threshold).
> 
> > 4. Why is there a -INFINITY score on one node?
> 
> That sometimes requires trace-level debugging and following the path
> through the source code. Which I don't recommend unless you're
> wanting
> to make this a full-time gig :)
> 
> At this level of investigation, I usually start with giving
> crm_simulate -VVVV, which will show up to info-level logs. If that
> doesn't make it clear, add another -V for debug logs, and then
> another
> -V for trace logs, but that stretches the bounds of human
> intelligibility. Somewhat more helpful is PCMK_trace_tags=<resource-
> name> before crm_simulate, which will give some trace-level output
> for
> the given resource without swamping you with infinite detail. For
> clones it's best to use PCMK_trace_tags=<resource-name>,<clone-name>
> and sometimes even <resource-name>:0, etc.
> 
> > Thanks again for your help.
> > 
> > 
> > 
> > On 28 Jun 2019 6:46 pm, Andrei Borzenkov <arvidj...@gmail.com>
> > wrote:
> > > On Fri, Jun 28, 2019 at 7:24 AM Harvey Shepherd
> > > <harvey.sheph...@aviatnet.com> wrote:
> > > > 
> > > > Hi All,
> > > > 
> > > > 
> > > > I'm running Pacemaker 2.0.2 on a two node cluster. It runs one
> > > 
> > > master/slave resource (I'll refer to it as the king resource) and
> > > about 20 other resources which are a mixture of:
> > > > 
> > > > 
> > > > - resources that only run on the king resource master node
> > > 
> > > (colocation constraint with a score of INFINITY)
> > > > 
> > > > - clone resources that run on both nodes
> > > > 
> > > > - two other master/slave resources where the masters runs on
> > > > the
> > > 
> > > same node as the king resource master (colocation constraint with
> > > a
> > > score of INFINITY)
> > > > 
> > > > 
> > > > I'll refer to the above set of resources as servant resources.
> > > > 
> > > > 
> > > > All servant resources have a resource-stickiness of zero and
> > > > the
> > > 
> > > king resource has a resource-stickiness of 100. There is an
> > > ordering constraint that the king resource must start before all
> > > servant resources. The king resource is controlled by an OCF
> > > script
> > > that uses crm_master to set the preferred master for the king
> > > resource (current master has value 100, current slave is 5,
> > > unassigned role or resource failure is 1) - I've verified that
> > > these values are being set as expected upon
> > > promotion/demotion/failure etc, via the logs. That's pretty much
> > > all of the configuration - there is no configuration around node
> > > preferences and migration-threshold is zero for everything.
> > > > 
> > > > 
> > > > What I'm trying to achieve is fairly simple:
> > > > 
> > > > 
> > > > 1. If any servant resource fails on either node, it is simply
> > > 
> > > restarted. These resources should never failover onto the other
> > > node because of colocation with the king resource, and they
> > > should
> > > not contribute in any way to deciding whether the king resource
> > > should failover (which is why they have a resource-stickiness of
> > > zero).
> > > > 
> > > > 2. If the slave instance of the king resource fails, it should
> > > 
> > > simply be restarted and again no failover should occur.
> > > > 
> > > > 3. If the master instance of the king resource fails, then its
> > > 
> > > slave instance should immediately be promoted, and the failed
> > > instance should be restarted. Failover of all servant resources
> > > should then occur due to the colocation dependency.
> > > > 
> > > > 
> > > > It's number 3 above that I'm having trouble with. If I kill the
> > > 
> > > master king resource instance it behaves as I expect - everything
> > > fails over and the king resource is restarted on the new slave.
> > > If
> > > I then kill the master instance of the king resource again
> > > however,
> > > instead of failing back over to its original node, it restarts
> > > and
> > > promotes back to master on the same node. This is not what I
> > > want.
> > > > 
> > > 
> > > migration-threshold is the first thing that comes in mind.
> > > Another
> > > possibility is hard error returned by resource agent that forces
> > > resource off node.
> > > 
> > > But please realize that without actual configuration and logs at
> > > the
> > > time undesired behavior happens it just becomes game of riddles.
> > > 
> > > > 
> > > > The relevant output from crm_simulate for the two tests is
> > > > shown
> > > 
> > > below. Can anyone suggest what might be going wrong? Whilst I
> > > really like the concept of crm_simulate, I can't find a good
> > > description of how to interpret the output and I don't understand
> > > the difference between clone_color and native_color, or the
> > > difference between "promotion scores" and the various instances
> > > of
> > > "allocation scores", nor does it really tell me what is
> > > contributing to the scores. Where does the -INFINITY allocation
> > > score come from for example?
> > > > 
> > > > 
> > > > Thanks,
> > > > 
> > > > Harvey
> > > > 
> > > > 
> > > > 
> > > > FIRST KING RESOURCE MASTER FAILURE (CORRECT BEHAVIOUR - MASTER
> > > 
> > > NODE FAILOVER OCCURS)
> > > > 
> > > > 
> > > >  Clone Set: ms_king_resource [king_resource] (promotable)
> > > >      king_resource      (ocf::aviat:king-resource-
> > > > ocf):    FAILED
> > > 
> > > Master secondary
> > > > clone_color: ms_king_resource allocation score on primary: 0
> > > > clone_color: ms_king_resource allocation score on secondary: 0
> > > > clone_color: king_resource:0 allocation score on primary: 0
> > > > clone_color: king_resource:0 allocation score on secondary: 101
> > > > clone_color: king_resource:1 allocation score on primary: 200
> > > > clone_color: king_resource:1 allocation score on secondary: 0
> > > > native_color: king_resource:1 allocation score on primary: 200
> > > > native_color: king_resource:1 allocation score on secondary: 0
> > > > native_color: king_resource:0 allocation score on primary:
> > > 
> > > -INFINITY
> > > > native_color: king_resource:0 allocation score on secondary:
> > > > 101
> > > > king_resource:1 promotion score on primary: 100
> > > > king_resource:0 promotion score on secondary: 1
> > > >  * Recover    king_resource:0      ( Master -> Slave secondary
> > > > )
> > > >  * Promote    king_resource:1      (   Slave -> Master primary
> > > > )
> > > >  * Resource action: king_resource   cancel=10000 on secondary
> > > >  * Resource action: king_resource   cancel=11000 on primary
> > > >  * Pseudo action:   ms_king_resource_pre_notify_demote_0
> > > >  * Resource action: king_resource   notify on secondary
> > > >  * Resource action: king_resource   notify on primary
> > > >  * Pseudo action:   ms_king_resource_confirmed-
> > > 
> > > pre_notify_demote_0
> > > >  * Pseudo action:   ms_king_resource_demote_0
> > > >  * Resource action: king_resource   demote on secondary
> > > >  * Pseudo action:   ms_king_resource_demoted_0
> > > >  * Pseudo action:   ms_king_resource_post_notify_demoted_0
> > > >  * Resource action: king_resource   notify on secondary
> > > >  * Resource action: king_resource   notify on primary
> > > >  * Pseudo action:   ms_king_resource_confirmed-
> > > 
> > > post_notify_demoted_0
> > > >  * Pseudo action:   ms_king_resource_pre_notify_stop_0
> > > >  * Resource action: king_resource   notify on secondary
> > > >  * Resource action: king_resource   notify on primary
> > > >  * Pseudo action:   ms_king_resource_confirmed-
> > > > pre_notify_stop_0
> > > >  * Pseudo action:   ms_king_resource_stop_0
> > > >  * Resource action: king_resource   stop on secondary
> > > >  * Pseudo action:   ms_king_resource_stopped_0
> > > >  * Pseudo action:   ms_king_resource_post_notify_stopped_0
> > > >  * Resource action: king_resource   notify on primary
> > > >  * Pseudo action:   ms_king_resource_confirmed-
> > > 
> > > post_notify_stopped_0
> > > >  * Pseudo action:   ms_king_resource_pre_notify_start_0
> > > >  * Resource action: king_resource   notify on primary
> > > >  * Pseudo action:   ms_king_resource_confirmed-
> > > > pre_notify_start_0
> > > >  * Pseudo action:   ms_king_resource_start_0
> > > >  * Resource action: king_resource   start on secondary
> > > >  * Pseudo action:   ms_king_resource_running_0
> > > >  * Pseudo action:   ms_king_resource_post_notify_running_0
> > > >  * Resource action: king_resource   notify on secondary
> > > >  * Resource action: king_resource   notify on primary
> > > >  * Pseudo action:   ms_king_resource_confirmed-
> > > 
> > > post_notify_running_0
> > > >  * Pseudo action:   ms_king_resource_pre_notify_promote_0
> > > >  * Resource action: king_resource   notify on secondary
> > > >  * Resource action: king_resource   notify on primary
> > > >  * Pseudo action:   ms_king_resource_confirmed-
> > > 
> > > pre_notify_promote_0
> > > >  * Pseudo action:   ms_king_resource_promote_0
> > > >  * Resource action: king_resource   promote on primary
> > > >  * Pseudo action:   ms_king_resource_promoted_0
> > > >  * Pseudo action:   ms_king_resource_post_notify_promoted_0
> > > >  * Resource action: king_resource   notify on secondary
> > > >  * Resource action: king_resource   notify on primary
> > > >  * Pseudo action:   ms_king_resource_confirmed-
> > > 
> > > post_notify_promoted_0
> > > >  * Resource action: king_resource   monitor=11000 on secondary
> > > >  * Resource action: king_resource   monitor=10000 on primary
> > > >  Clone Set: ms_king_resource [king_resource] (promotable)
> > > > 
> > > > 
> > > > SECOND KING RESOURCE MASTER FAILURE (INCORRECT BEHAVIOUR - SAME
> > > 
> > > NODE IS PROMOTED TO MASTER)
> > > > 
> > > > 
> > > >  Clone Set: ms_king_resource [king_resource] (promotable)
> > > >      king_resource      (ocf::aviat:king-resource-
> > > > ocf):    FAILED
> > > 
> > > Master primary
> > > > clone_color: ms_king_resource allocation score on primary: 0
> > > > clone_color: ms_king_resource allocation score on secondary: 0
> > > > clone_color: king_resource:0 allocation score on primary: 0
> > > > clone_color: king_resource:0 allocation score on secondary: 200
> > > > clone_color: king_resource:1 allocation score on primary: 101
> > > > clone_color: king_resource:1 allocation score on secondary: 0
> > > > native_color: king_resource:0 allocation score on primary: 0
> > > > native_color: king_resource:0 allocation score on secondary:
> > > > 200
> > > > native_color: king_resource:1 allocation score on primary: 101
> > > > native_color: king_resource:1 allocation score on secondary:
> > > 
> > > -INFINITY
> > > > king_resource:1 promotion score on primary: 1
> > > > king_resource:0 promotion score on secondary: 1
> > > >  * Recover    king_resource:1     ( Master primary )
> > > >  * Pseudo action:   ms_king_resource_pre_notify_demote_0
> > > >  * Resource action: king_resource   notify on secondary
> > > >  * Resource action: king_resource   notify on primary
> > > >  * Pseudo action:   ms_king_resource_confirmed-
> > > 
> > > pre_notify_demote_0
> > > >  * Pseudo action:   ms_king_resource_demote_0
> > > >  * Resource action: king_resource   demote on primary
> > > >  * Pseudo action:   ms_king_resource_demoted_0
> > > >  * Pseudo action:   ms_king_resource_post_notify_demoted_0
> > > >  * Resource action: king_resource   notify on secondary
> > > >  * Resource action: king_resource   notify on primary
> > > >  * Pseudo action:   ms_king_resource_confirmed-
> > > 
> > > post_notify_demoted_0
> > > >  * Pseudo action:   ms_king_resource_pre_notify_stop_0
> > > >  * Resource action: king_resource   notify on secondary
> > > >  * Resource action: king_resource   notify on primary
> > > >  * Pseudo action:   ms_king_resource_confirmed-
> > > > pre_notify_stop_0
> > > >  * Pseudo action:   ms_king_resource_stop_0
> > > >  * Resource action: king_resource   stop on primary
> > > >  * Pseudo action:   ms_king_resource_stopped_0
> > > >  * Pseudo action:   ms_king_resource_post_notify_stopped_0
> > > >  * Resource action: king_resource   notify on secondary
> > > >  * Pseudo action:   ms_king_resource_confirmed-
> > > 
> > > post_notify_stopped_0
> > > >  * Pseudo action:   ms_king_resource_pre_notify_start_0
> > > >  * Resource action: king_resource   notify on secondary
> > > >  * Pseudo action:   ms_king_resource_confirmed-
> > > > pre_notify_start_0
> > > >  * Pseudo action:   ms_king_resource_start_0
> > > >  * Resource action: king_resource   start on primary
> > > >  * Pseudo action:   ms_king_resource_running_0
> > > >  * Pseudo action:   ms_king_resource_post_notify_running_0
> > > >  * Resource action: king_resource   notify on secondary
> > > >  * Resource action: king_resource   notify on primary
> > > >  * Pseudo action:   ms_king_resource_confirmed-
> > > 
> > > post_notify_running_0
> > > >  * Pseudo action:   ms_king_resource_pre_notify_promote_0
> > > >  * Resource action: king_resource   notify on secondary
> > > >  * Resource action: king_resource   notify on primary
> > > >  * Pseudo action:   ms_king_resource_confirmed-
> > > 
> > > pre_notify_promote_0
> > > >  * Pseudo action:   ms_king_resource_promote_0
> > > >  * Resource action: king_resource   promote on primary
> > > >  * Pseudo action:   ms_king_resource_promoted_0
> > > >  * Pseudo action:   ms_king_resource_post_notify_promoted_0
> > > >  * Resource action: king_resource   notify on secondary
> > > >  * Resource action: king_resource   notify on primary
> > > >  * Pseudo action:   ms_king_resource_confirmed-
> > > 
> > > post_notify_promoted_0
> > > >  * Resource action: king_resource   monitor=10000 on primary
> > > >  Clone Set: ms_king_resource [king_resource] (promotable)
> 
> --
> Ken Gaillot <kgail...@redhat.com>
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot <kgail...@redhat.com>

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Problems with master/slave failovers

Reply via email to