Re: [ClusterLabs] ocf:pacemaker:ping works strange
On Tue, 2023-12-12 at 18:08 +0300, Artem wrote: > Hi Andrei. pingd==0 won't satisfy both statements. It would if I used > GTE, but I used GT. > pingd lt 1 --> [0] > pingd gt 0 --> [1,2,3,...] It's the "or defined pingd" part of the rule that will match pingd==0. A value of 0 is defined. I'm guessing you meant to use "pingd gt 0 *AND* pingd defined", but then the defined part would become redundant since any value greater than 0 is inherently defined. So, for that rule, you only need "pingd gt 0". > > On Tue, 12 Dec 2023 at 17:21, Andrei Borzenkov > wrote: > > On Tue, Dec 12, 2023 at 4:47 PM Artem wrote: > > >> > pcs constraint location FAKE3 rule score=0 pingd lt 1 or > > not_defined pingd > > >> > pcs constraint location FAKE3 rule score=125 pingd gt 0 or > > defined pingd > > > Are they really contradicting? > > > > Yes. pingd == 0 will satisfy both rules. My use of "always" was > > incorrect, it does not happen for all possible values of pingd, but > > it > > does happen for some. > > May be defined/not_defined should be put in front of lt/gt ? It is > possible that VM goes down, pingd to not_defined, then the rule > evaluates "lt 1" first, catches an error and doesn't evaluate next > part (after OR)? No, the order of and/or clauses doesn't matter. -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] ocf:pacemaker:ping works strange
On Mon, 2023-12-11 at 21:05 +0300, Artem wrote: > Hi Ken, > > On Mon, 11 Dec 2023 at 19:00, Ken Gaillot > wrote: > > > Question #2) I shut lustre3 VM down and leave it like that > > How did you shut it down? Outside cluster control, or with > > something > > like pcs resource disable? > > > > I did it outside of the cluster to simulate a failure. I turned off > this VM from vCenter. Cluster is unaware of anything behind OS. In that case check pacemaker.log for messages around the time of the failure. They should tell you what error originally occurred and why the cluster is blocked on it. > > > > * FAKE3 (ocf::pacemaker:Dummy): Stopped > > > * FAKE4 (ocf::pacemaker:Dummy): Started lustre4 > > > * Clone Set: ping-clone [ping]: > > > * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 > > lustre2 > > > lustre4 ] << lustre3 missing > > > OK for now > > > VM boots up. pcs status: > > > * FAKE3 (ocf::pacemaker:Dummy): FAILED (blocked) [ > > lustre3 > > > lustre4 ] << what is it? > > > * Clone Set: ping-clone [ping]: > > > * ping (ocf::pacemaker:ping): FAILED lustre3 > > (blocked) > > > << why not started? > > > * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 > > lustre2 > > > lustre4 ] > > > I checked server processes manually and found that lustre4 runs > > > "/usr/lib/ocf/resource.d/pacemaker/ping monitor" while lustre3 > > > doesn't > > > All is according to documentation but results are strange. > > > Then I tried to add meta target-role="started" to pcs resource > > create > > > ping and this time ping started after node rebooted. Can I expect > > > that it was just missing from official setup documentation, and > > now > > > everything will work fine? > > -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] ocf:pacemaker:ping works strange
Hi Andrei. pingd==0 won't satisfy both statements. It would if I used GTE, but I used GT. pingd lt 1 --> [0] pingd gt 0 --> [1,2,3,...] On Tue, 12 Dec 2023 at 17:21, Andrei Borzenkov wrote: > On Tue, Dec 12, 2023 at 4:47 PM Artem wrote: > >> > pcs constraint location FAKE3 rule score=0 pingd lt 1 or not_defined > pingd > >> > pcs constraint location FAKE3 rule score=125 pingd gt 0 or defined > pingd > > Are they really contradicting? > > Yes. pingd == 0 will satisfy both rules. My use of "always" was > incorrect, it does not happen for all possible values of pingd, but it > does happen for some. > May be defined/not_defined should be put in front of lt/gt ? It is possible that VM goes down, pingd to not_defined, then the rule evaluates "lt 1" first, catches an error and doesn't evaluate next part (after OR)? ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] ocf:pacemaker:ping works strange
On Tue, Dec 12, 2023 at 4:47 PM Artem wrote: > > > > On Tue, 12 Dec 2023 at 16:17, Andrei Borzenkov wrote: >> >> On Fri, Dec 8, 2023 at 5:44 PM Artem wrote: >> > pcs constraint location FAKE3 rule score=0 pingd lt 1 or not_defined pingd >> > pcs constraint location FAKE4 rule score=0 pingd lt 1 or not_defined pingd >> > pcs constraint location FAKE3 rule score=125 pingd gt 0 or defined pingd >> > pcs constraint location FAKE4 rule score=125 pingd gt 0 or defined pingd >> > >> >> These rules are contradicting. You set the score to 125 if pingd is >> defined and at the same time set it to 0 if the score is less than 1. >> To be "less than 1" it must be defined to start with so both rules >> will always apply. I do not know how the rules are ordered. Either you >> get random behavior, or one pair of these rules is effectively >> ignored. > > > "pingd lt 1 or not_defined pingd" means to me ==0 or not_defined, that is > ping fails to ping GW or fails to report to corosync/pacemaker. Am I wrong? That is correct (although I'd reverse conditions out of habit. It meaningless to check for "less than 1" something that is not defined) > "pingd gt 0 or defined pingd" means to me that ping gets reply from GW and > reports it to cluster. No. As you were already told this is true if pingd is defined. Value does not matter. > Are they really contradicting? Yes. pingd == 0 will satisfy both rules. My use of "always" was incorrect, it does not happen for all possible values of pingd, but it does happen for some. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] ocf:pacemaker:ping works strange
On Tue, 12 Dec 2023 at 16:17, Andrei Borzenkov wrote: > On Fri, Dec 8, 2023 at 5:44 PM Artem wrote: > > pcs constraint location FAKE3 rule score=0 pingd lt 1 or not_defined > pingd > > pcs constraint location FAKE4 rule score=0 pingd lt 1 or not_defined > pingd > > pcs constraint location FAKE3 rule score=125 pingd gt 0 or defined pingd > > pcs constraint location FAKE4 rule score=125 pingd gt 0 or defined pingd > > > > These rules are contradicting. You set the score to 125 if pingd is > defined and at the same time set it to 0 if the score is less than 1. > To be "less than 1" it must be defined to start with so both rules > will always apply. I do not know how the rules are ordered. Either you > get random behavior, or one pair of these rules is effectively > ignored. > "pingd lt 1 or not_defined pingd" means to me ==0 or not_defined, that is ping fails to ping GW or fails to report to corosync/pacemaker. Am I wrong? "pingd gt 0 or defined pingd" means to me that ping gets reply from GW and reports it to cluster. Are they really contradicting? I read this article and tried to do in a similar way: https://habr.com/ru/articles/118925/ > > > Question #1) Why I cannot see accumulated score from pingd in > crm_simulate output? Only location score and stickiness. > > pcmk__primitive_assign: FAKE3 allocation score on lustre3: 210 > > pcmk__primitive_assign: FAKE3 allocation score on lustre4: 90 > > pcmk__primitive_assign: FAKE4 allocation score on lustre3: 90 > > pcmk__primitive_assign: FAKE4 allocation score on lustre4: 210 > > Either when all is OK or when VM is down - score from pingd not added to > total score of RA > > > > > > Question #2) I shut lustre3 VM down and leave it like that. pcs status: > > * FAKE3 (ocf::pacemaker:Dummy): Stopped > > * FAKE4 (ocf::pacemaker:Dummy): Started lustre4 > > * Clone Set: ping-clone [ping]: > > * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 lustre2 > lustre4 ] << lustre3 missing > > OK for now > > VM boots up. pcs status: > > * FAKE3 (ocf::pacemaker:Dummy): FAILED (blocked) [ lustre3 > lustre4 ] << what is it? > > * Clone Set: ping-clone [ping]: > > * ping (ocf::pacemaker:ping): FAILED lustre3 (blocked)<< > why not started? > > * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 lustre2 > lustre4 ] > > If this is full pcs status output, I miss stonith resource. > > I have "pcs property set stonith-enabled=false" and don't plan to use it. I want simple active-passive cluster, like Veritas or ServiceGuard with most duties automated. And our production servers have their iBMC in a locked network segment ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] ocf:pacemaker:ping works strange
On Fri, Dec 8, 2023 at 5:44 PM Artem wrote: > > Hello experts. > > I use pacemaker for a Lustre cluster. But for simplicity and exploration I > use a Dummy resource. I didn't like how resource performed failover and > failback. When I shut down VM with remote agent, pacemaker tries to restart > it. According to pcs status it marks the resource (not RA) Online for some > time while VM stays down. > > OK, I wanted to improve its behavior and set up a ping monitor. I tuned the > scores like this: > pcs resource create FAKE3 ocf:pacemaker:Dummy > pcs resource create FAKE4 ocf:pacemaker:Dummy > pcs constraint location FAKE3 prefers lustre3=100 > pcs constraint location FAKE3 prefers lustre4=90 > pcs constraint location FAKE4 prefers lustre3=90 > pcs constraint location FAKE4 prefers lustre4=100 > pcs resource defaults update resource-stickiness=110 > pcs resource create ping ocf:pacemaker:ping dampen=5s host_list=local op > monitor interval=3s timeout=7s clone meta target-role="started" > for i in lustre{1..4}; do pcs constraint location ping-clone prefers $i; done > pcs constraint location FAKE3 rule score=0 pingd lt 1 or not_defined pingd > pcs constraint location FAKE4 rule score=0 pingd lt 1 or not_defined pingd > pcs constraint location FAKE3 rule score=125 pingd gt 0 or defined pingd > pcs constraint location FAKE4 rule score=125 pingd gt 0 or defined pingd > These rules are contradicting. You set the score to 125 if pingd is defined and at the same time set it to 0 if the score is less than 1. To be "less than 1" it must be defined to start with so both rules will always apply. I do not know how the rules are ordered. Either you get random behavior, or one pair of these rules is effectively ignored. > > Question #1) Why I cannot see accumulated score from pingd in crm_simulate > output? Only location score and stickiness. > pcmk__primitive_assign: FAKE3 allocation score on lustre3: 210 > pcmk__primitive_assign: FAKE3 allocation score on lustre4: 90 > pcmk__primitive_assign: FAKE4 allocation score on lustre3: 90 > pcmk__primitive_assign: FAKE4 allocation score on lustre4: 210 > Either when all is OK or when VM is down - score from pingd not added to > total score of RA > > > Question #2) I shut lustre3 VM down and leave it like that. pcs status: > * FAKE3 (ocf::pacemaker:Dummy): Stopped > * FAKE4 (ocf::pacemaker:Dummy): Started lustre4 > * Clone Set: ping-clone [ping]: > * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 lustre2 lustre4 ] > << lustre3 missing > OK for now > VM boots up. pcs status: > * FAKE3 (ocf::pacemaker:Dummy): FAILED (blocked) [ lustre3 lustre4 ] > << what is it? > * Clone Set: ping-clone [ping]: > * ping (ocf::pacemaker:ping): FAILED lustre3 (blocked)<< why > not started? > * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 lustre2 lustre4 ] If this is full pcs status output, I miss stonith resource. > I checked server processes manually and found that lustre4 runs > "/usr/lib/ocf/resource.d/pacemaker/ping monitor" while lustre3 doesn't > All is according to documentation but results are strange. > Then I tried to add meta target-role="started" to pcs resource create ping > and this time ping started after node rebooted. Can I expect that it was just > missing from official setup documentation, and now everything will work fine? > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] ocf:pacemaker:ping works strange
Dear Ken and other experts. How can I leverage pingd to speedup failover? Or may be it is useless and we should leverage monitor/start timeouts and migration-threshold/failure-timeout ? I have preference like this for normal operations: > pcs constraint location FAKE3 prefers lustre3=100 > pcs constraint location FAKE3 prefers lustre4=90 > pcs constraint location FAKE4 prefers lustre3=90 > pcs constraint location FAKE4 prefers lustre4=100 > pcs resource defaults update resource-stickiness=110 And I need a rule to decrease preference score on a node where pingd fails. I'm checking it by VM being powered off with pacemaker unaware of it (no agents on ESXi/vCenter). On Mon, 11 Dec 2023 at 19:00, Ken Gaillot wrote: > > Question #1) Why I cannot see accumulated score from pingd in > > crm_simulate output? Only location score and stickiness. > > ping scores aren't added to resource scores, they're just set as node > attribute values. Location constraint rules map those values to > resource scores (in this case any defined ping score gets mapped to > 125). > -- > Ken Gaillot > > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] ocf:pacemaker:ping works strange
Hi Ken, On Mon, 11 Dec 2023 at 19:00, Ken Gaillot wrote: > > Question #2) I shut lustre3 VM down and leave it like that > How did you shut it down? Outside cluster control, or with something > like pcs resource disable? > > I did it outside of the cluster to simulate a failure. I turned off this VM from vCenter. Cluster is unaware of anything behind OS. > > * FAKE3 (ocf::pacemaker:Dummy): Stopped > > * FAKE4 (ocf::pacemaker:Dummy): Started lustre4 > > * Clone Set: ping-clone [ping]: > > * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 lustre2 > > lustre4 ] << lustre3 missing > > OK for now > > VM boots up. pcs status: > > * FAKE3 (ocf::pacemaker:Dummy): FAILED (blocked) [ lustre3 > > lustre4 ] << what is it? > > * Clone Set: ping-clone [ping]: > > * ping (ocf::pacemaker:ping): FAILED lustre3 (blocked) > > << why not started? > > * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 lustre2 > > lustre4 ] > > I checked server processes manually and found that lustre4 runs > > "/usr/lib/ocf/resource.d/pacemaker/ping monitor" while lustre3 > > doesn't > > All is according to documentation but results are strange. > > Then I tried to add meta target-role="started" to pcs resource create > > ping and this time ping started after node rebooted. Can I expect > > that it was just missing from official setup documentation, and now > > everything will work fine? > > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] ocf:pacemaker:ping works strange
On Fri, 2023-12-08 at 17:44 +0300, Artem wrote: > Hello experts. > > I use pacemaker for a Lustre cluster. But for simplicity and > exploration I use a Dummy resource. I didn't like how resource > performed failover and failback. When I shut down VM with remote > agent, pacemaker tries to restart it. According to pcs status it > marks the resource (not RA) Online for some time while VM stays > down. > > OK, I wanted to improve its behavior and set up a ping monitor. I > tuned the scores like this: > pcs resource create FAKE3 ocf:pacemaker:Dummy > pcs resource create FAKE4 ocf:pacemaker:Dummy > pcs constraint location FAKE3 prefers lustre3=100 > pcs constraint location FAKE3 prefers lustre4=90 > pcs constraint location FAKE4 prefers lustre3=90 > pcs constraint location FAKE4 prefers lustre4=100 > pcs resource defaults update resource-stickiness=110 > pcs resource create ping ocf:pacemaker:ping dampen=5s host_list=local > op monitor interval=3s timeout=7s clone meta target-role="started" > for i in lustre{1..4}; do pcs constraint location ping-clone prefers > $i; done > pcs constraint location FAKE3 rule score=0 pingd lt 1 or not_defined > pingd > pcs constraint location FAKE4 rule score=0 pingd lt 1 or not_defined > pingd > pcs constraint location FAKE3 rule score=125 pingd gt 0 or defined > pingd > pcs constraint location FAKE4 rule score=125 pingd gt 0 or defined > pingd The gt 0 part is redundant since "defined pingd" matches *any* score. > > > Question #1) Why I cannot see accumulated score from pingd in > crm_simulate output? Only location score and stickiness. > pcmk__primitive_assign: FAKE3 allocation score on lustre3: 210 > pcmk__primitive_assign: FAKE3 allocation score on lustre4: 90 > pcmk__primitive_assign: FAKE4 allocation score on lustre3: 90 > pcmk__primitive_assign: FAKE4 allocation score on lustre4: 210 > Either when all is OK or when VM is down - score from pingd not added > to total score of RA ping scores aren't added to resource scores, they're just set as node attribute values. Location constraint rules map those values to resource scores (in this case any defined ping score gets mapped to 125). > > > Question #2) I shut lustre3 VM down and leave it like that. pcs > status: How did you shut it down? Outside cluster control, or with something like pcs resource disable? > * FAKE3 (ocf::pacemaker:Dummy): Stopped > * FAKE4 (ocf::pacemaker:Dummy): Started lustre4 > * Clone Set: ping-clone [ping]: > * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 lustre2 > lustre4 ] << lustre3 missing > OK for now > VM boots up. pcs status: > * FAKE3 (ocf::pacemaker:Dummy): FAILED (blocked) [ lustre3 > lustre4 ] << what is it? > * Clone Set: ping-clone [ping]: > * ping (ocf::pacemaker:ping): FAILED lustre3 (blocked) > << why not started? > * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 lustre2 > lustre4 ] > I checked server processes manually and found that lustre4 runs > "/usr/lib/ocf/resource.d/pacemaker/ping monitor" while lustre3 > doesn't > All is according to documentation but results are strange. > Then I tried to add meta target-role="started" to pcs resource create > ping and this time ping started after node rebooted. Can I expect > that it was just missing from official setup documentation, and now > everything will work fine? -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] ocf:pacemaker:ping works strange
Hello experts. I use pacemaker for a Lustre cluster. But for simplicity and exploration I use a Dummy resource. I didn't like how resource performed failover and failback. When I shut down VM with remote agent, pacemaker tries to restart it. According to pcs status it marks the resource (not RA) Online for some time while VM stays down. OK, I wanted to improve its behavior and set up a ping monitor. I tuned the scores like this: pcs resource create FAKE3 ocf:pacemaker:Dummy pcs resource create FAKE4 ocf:pacemaker:Dummy pcs constraint location FAKE3 prefers lustre3=100 pcs constraint location FAKE3 prefers lustre4=90 pcs constraint location FAKE4 prefers lustre3=90 pcs constraint location FAKE4 prefers lustre4=100 pcs resource defaults update resource-stickiness=110 pcs resource create ping ocf:pacemaker:ping dampen=5s host_list=local op monitor interval=3s timeout=7s clone meta target-role="started" for i in lustre{1..4}; do pcs constraint location ping-clone prefers $i; done pcs constraint location FAKE3 rule score=0 pingd lt 1 or not_defined pingd pcs constraint location FAKE4 rule score=0 pingd lt 1 or not_defined pingd pcs constraint location FAKE3 rule score=125 pingd gt 0 or defined pingd pcs constraint location FAKE4 rule score=125 pingd gt 0 or defined pingd Question #1) Why I cannot see accumulated score from pingd in crm_simulate output? Only location score and stickiness. pcmk__primitive_assign: FAKE3 allocation score on lustre3: 210 pcmk__primitive_assign: FAKE3 allocation score on lustre4: 90 pcmk__primitive_assign: FAKE4 allocation score on lustre3: 90 pcmk__primitive_assign: FAKE4 allocation score on lustre4: 210 Either when all is OK or when VM is down - score from pingd not added to total score of RA Question #2) I shut lustre3 VM down and leave it like that. pcs status: * FAKE3 (ocf::pacemaker:Dummy): Stopped * FAKE4 (ocf::pacemaker:Dummy): Started lustre4 * Clone Set: ping-clone [ping]: * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 lustre2 lustre4 ] << lustre3 missing OK for now VM boots up. pcs status: * FAKE3 (ocf::pacemaker:Dummy): FAILED (blocked) [ lustre3 lustre4 ] << what is it? * Clone Set: ping-clone [ping]: * ping (ocf::pacemaker:ping): FAILED lustre3 (blocked)<< why not started? * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 lustre2 lustre4 ] I checked server processes manually and found that lustre4 runs "/usr/lib/ocf/resource.d/pacemaker/ping monitor" while lustre3 doesn't All is according to documentation but results are strange. Then I tried to add meta target-role="started" to pcs resource create ping and this time ping started after node rebooted. Can I expect that it was just missing from official setup documentation, and now everything will work fine? ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/