Re: [ClusterLabs] ocf:pacemaker:ping works strange

2023-12-12 Thread Ken Gaillot
On Tue, 2023-12-12 at 18:08 +0300, Artem wrote:
> Hi Andrei. pingd==0 won't satisfy both statements. It would if I used
> GTE, but I used GT.
> pingd lt 1 --> [0]
> pingd gt 0 --> [1,2,3,...]

It's the "or defined pingd" part of the rule that will match pingd==0.
A value of 0 is defined.

I'm guessing you meant to use "pingd gt 0 *AND* pingd defined", but
then the defined part would become redundant since any value greater
than 0 is inherently defined. So, for that rule, you only need "pingd
gt 0".

> 
> On Tue, 12 Dec 2023 at 17:21, Andrei Borzenkov 
> wrote:
> > On Tue, Dec 12, 2023 at 4:47 PM Artem  wrote:
> > >> > pcs constraint location FAKE3 rule score=0 pingd lt 1 or
> > not_defined pingd
> > >> > pcs constraint location FAKE3 rule score=125 pingd gt 0 or
> > defined pingd
> > > Are they really contradicting?
> > 
> > Yes. pingd == 0 will satisfy both rules. My use of "always" was
> > incorrect, it does not happen for all possible values of pingd, but
> > it
> > does happen for some.
> 
> May be defined/not_defined should be put in front of lt/gt ? It is
> possible that VM goes down, pingd to not_defined, then the rule
> evaluates "lt 1" first, catches an error and doesn't evaluate next
> part (after OR)?

No, the order of and/or clauses doesn't matter.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] ocf:pacemaker:ping works strange

2023-12-12 Thread Ken Gaillot
On Mon, 2023-12-11 at 21:05 +0300, Artem wrote:
> Hi Ken,
> 
> On Mon, 11 Dec 2023 at 19:00, Ken Gaillot 
> wrote:
> > > Question #2) I shut lustre3 VM down and leave it like that
> > How did you shut it down? Outside cluster control, or with
> > something
> > like pcs resource disable?
> > 
> 
> I did it outside of the cluster to simulate a failure. I turned off
> this VM from vCenter. Cluster is unaware of anything behind OS.

In that case check pacemaker.log for messages around the time of the
failure. They should tell you what error originally occurred and why
the cluster is blocked on it.

>  
> > >   * FAKE3   (ocf::pacemaker:Dummy):  Stopped
> > >   * FAKE4   (ocf::pacemaker:Dummy):  Started lustre4
> > >   * Clone Set: ping-clone [ping]:
> > > * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1
> > lustre2
> > > lustre4 ] << lustre3 missing
> > > OK for now
> > > VM boots up. pcs status: 
> > >   * FAKE3   (ocf::pacemaker:Dummy):  FAILED (blocked) [
> > lustre3
> > > lustre4 ]  << what is it?
> > >   * Clone Set: ping-clone [ping]:
> > > * ping  (ocf::pacemaker:ping):   FAILED lustre3
> > (blocked)   
> > > << why not started?
> > > * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1
> > lustre2
> > > lustre4 ]
> > > I checked server processes manually and found that lustre4 runs
> > > "/usr/lib/ocf/resource.d/pacemaker/ping monitor" while lustre3
> > > doesn't
> > > All is according to documentation but results are strange.
> > > Then I tried to add meta target-role="started" to pcs resource
> > create
> > > ping and this time ping started after node rebooted. Can I expect
> > > that it was just missing from official setup documentation, and
> > now
> > > everything will work fine?
> > 
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] ocf:pacemaker:ping works strange

2023-12-12 Thread Artem
Hi Andrei. pingd==0 won't satisfy both statements. It would if I used GTE,
but I used GT.
pingd lt 1 --> [0]
pingd gt 0 --> [1,2,3,...]

On Tue, 12 Dec 2023 at 17:21, Andrei Borzenkov  wrote:

> On Tue, Dec 12, 2023 at 4:47 PM Artem  wrote:
> >> > pcs constraint location FAKE3 rule score=0 pingd lt 1 or not_defined
> pingd
> >> > pcs constraint location FAKE3 rule score=125 pingd gt 0 or defined
> pingd
> > Are they really contradicting?
>
> Yes. pingd == 0 will satisfy both rules. My use of "always" was
> incorrect, it does not happen for all possible values of pingd, but it
> does happen for some.
>

May be defined/not_defined should be put in front of lt/gt ? It is possible
that VM goes down, pingd to not_defined, then the rule evaluates "lt 1"
first, catches an error and doesn't evaluate next part (after OR)?
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] ocf:pacemaker:ping works strange

2023-12-12 Thread Andrei Borzenkov
On Tue, Dec 12, 2023 at 4:47 PM Artem  wrote:
>
>
>
> On Tue, 12 Dec 2023 at 16:17, Andrei Borzenkov  wrote:
>>
>> On Fri, Dec 8, 2023 at 5:44 PM Artem  wrote:
>> > pcs constraint location FAKE3 rule score=0 pingd lt 1 or not_defined pingd
>> > pcs constraint location FAKE4 rule score=0 pingd lt 1 or not_defined pingd
>> > pcs constraint location FAKE3 rule score=125 pingd gt 0 or defined pingd
>> > pcs constraint location FAKE4 rule score=125 pingd gt 0 or defined pingd
>> >
>>
>> These rules are contradicting. You set the score to 125 if pingd is
>> defined and at the same time set it to 0 if the score is less than 1.
>> To be "less than 1" it must be defined to start with so both rules
>> will always apply. I do not know how the rules are ordered. Either you
>> get random behavior, or one pair of these rules is effectively
>> ignored.
>
>
> "pingd lt 1 or not_defined pingd" means to me ==0 or not_defined, that is 
> ping fails to ping GW or fails to report to corosync/pacemaker. Am I wrong?

That is correct (although I'd reverse conditions out of habit. It
meaningless to check for "less than 1" something that is not defined)

> "pingd gt 0 or defined pingd" means to me that ping gets reply from GW and 
> reports it to cluster.

No. As you were already told this is true if pingd is defined. Value
does not matter.

> Are they really contradicting?

Yes. pingd == 0 will satisfy both rules. My use of "always" was
incorrect, it does not happen for all possible values of pingd, but it
does happen for some.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] ocf:pacemaker:ping works strange

2023-12-12 Thread Artem
On Tue, 12 Dec 2023 at 16:17, Andrei Borzenkov  wrote:

> On Fri, Dec 8, 2023 at 5:44 PM Artem  wrote:
> > pcs constraint location FAKE3 rule score=0 pingd lt 1 or not_defined
> pingd
> > pcs constraint location FAKE4 rule score=0 pingd lt 1 or not_defined
> pingd
> > pcs constraint location FAKE3 rule score=125 pingd gt 0 or defined pingd
> > pcs constraint location FAKE4 rule score=125 pingd gt 0 or defined pingd
> >
>
> These rules are contradicting. You set the score to 125 if pingd is
> defined and at the same time set it to 0 if the score is less than 1.
> To be "less than 1" it must be defined to start with so both rules
> will always apply. I do not know how the rules are ordered. Either you
> get random behavior, or one pair of these rules is effectively
> ignored.
>

"pingd lt 1 or not_defined pingd" means to me ==0 or not_defined, that is
ping fails to ping GW or fails to report to corosync/pacemaker. Am I wrong?
"pingd gt 0 or defined pingd" means to me that ping gets reply from GW and
reports it to cluster.
Are they really contradicting?
I read this article and tried to do in a similar way:
https://habr.com/ru/articles/118925/


>
> > Question #1) Why I cannot see accumulated score from pingd in
> crm_simulate output? Only location score and stickiness.
> > pcmk__primitive_assign: FAKE3 allocation score on lustre3: 210
> > pcmk__primitive_assign: FAKE3 allocation score on lustre4: 90
> > pcmk__primitive_assign: FAKE4 allocation score on lustre3: 90
> > pcmk__primitive_assign: FAKE4 allocation score on lustre4: 210
> > Either when all is OK or when VM is down - score from pingd not added to
> total score of RA
> >
> >
> > Question #2) I shut lustre3 VM down and leave it like that. pcs status:
> >   * FAKE3   (ocf::pacemaker:Dummy):  Stopped
> >   * FAKE4   (ocf::pacemaker:Dummy):  Started lustre4
> >   * Clone Set: ping-clone [ping]:
> > * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 lustre2
> lustre4 ] << lustre3 missing
> > OK for now
> > VM boots up. pcs status:
> >   * FAKE3   (ocf::pacemaker:Dummy):  FAILED (blocked) [ lustre3
> lustre4 ]  << what is it?
> >   * Clone Set: ping-clone [ping]:
> > * ping  (ocf::pacemaker:ping):   FAILED lustre3 (blocked)<<
> why not started?
> > * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 lustre2
> lustre4 ]
>
> If this is full pcs status output, I miss stonith resource.
>
> I have "pcs property set stonith-enabled=false" and don't plan to use it.
I want simple active-passive cluster, like Veritas or ServiceGuard with
most duties automated. And our production servers have their iBMC in a
locked network segment
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] ocf:pacemaker:ping works strange

2023-12-12 Thread Andrei Borzenkov
On Fri, Dec 8, 2023 at 5:44 PM Artem  wrote:
>
> Hello experts.
>
> I use pacemaker for a Lustre cluster. But for simplicity and exploration I 
> use a Dummy resource. I didn't like how resource performed failover and 
> failback. When I shut down VM with remote agent, pacemaker tries to restart 
> it. According to pcs status it marks the resource (not RA) Online for some 
> time while VM stays down.
>
> OK, I wanted to improve its behavior and set up a ping monitor. I tuned the 
> scores like this:
> pcs resource create FAKE3 ocf:pacemaker:Dummy
> pcs resource create FAKE4 ocf:pacemaker:Dummy
> pcs constraint location FAKE3 prefers lustre3=100
> pcs constraint location FAKE3 prefers lustre4=90
> pcs constraint location FAKE4 prefers lustre3=90
> pcs constraint location FAKE4 prefers lustre4=100
> pcs resource defaults update resource-stickiness=110
> pcs resource create ping ocf:pacemaker:ping dampen=5s host_list=local op 
> monitor interval=3s timeout=7s clone meta target-role="started"
> for i in lustre{1..4}; do pcs constraint location ping-clone prefers $i; done
> pcs constraint location FAKE3 rule score=0 pingd lt 1 or not_defined pingd
> pcs constraint location FAKE4 rule score=0 pingd lt 1 or not_defined pingd
> pcs constraint location FAKE3 rule score=125 pingd gt 0 or defined pingd
> pcs constraint location FAKE4 rule score=125 pingd gt 0 or defined pingd
>

These rules are contradicting. You set the score to 125 if pingd is
defined and at the same time set it to 0 if the score is less than 1.
To be "less than 1" it must be defined to start with so both rules
will always apply. I do not know how the rules are ordered. Either you
get random behavior, or one pair of these rules is effectively
ignored.

>
> Question #1) Why I cannot see accumulated score from pingd in crm_simulate 
> output? Only location score and stickiness.
> pcmk__primitive_assign: FAKE3 allocation score on lustre3: 210
> pcmk__primitive_assign: FAKE3 allocation score on lustre4: 90
> pcmk__primitive_assign: FAKE4 allocation score on lustre3: 90
> pcmk__primitive_assign: FAKE4 allocation score on lustre4: 210
> Either when all is OK or when VM is down - score from pingd not added to 
> total score of RA
>
>
> Question #2) I shut lustre3 VM down and leave it like that. pcs status:
>   * FAKE3   (ocf::pacemaker:Dummy):  Stopped
>   * FAKE4   (ocf::pacemaker:Dummy):  Started lustre4
>   * Clone Set: ping-clone [ping]:
> * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 lustre2 lustre4 ] 
> << lustre3 missing
> OK for now
> VM boots up. pcs status:
>   * FAKE3   (ocf::pacemaker:Dummy):  FAILED (blocked) [ lustre3 lustre4 ] 
>  << what is it?
>   * Clone Set: ping-clone [ping]:
> * ping  (ocf::pacemaker:ping):   FAILED lustre3 (blocked)<< why 
> not started?
> * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 lustre2 lustre4 ]

If this is full pcs status output, I miss stonith resource.

> I checked server processes manually and found that lustre4 runs 
> "/usr/lib/ocf/resource.d/pacemaker/ping monitor" while lustre3 doesn't
> All is according to documentation but results are strange.
> Then I tried to add meta target-role="started" to pcs resource create ping 
> and this time ping started after node rebooted. Can I expect that it was just 
> missing from official setup documentation, and now everything will work fine?
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] ocf:pacemaker:ping works strange

2023-12-12 Thread Artem
Dear Ken and other experts.

How can I leverage pingd to speedup failover? Or may be it is useless and
we should leverage monitor/start timeouts and
migration-threshold/failure-timeout ?

I have preference like this for normal operations:
> pcs constraint location FAKE3 prefers lustre3=100
> pcs constraint location FAKE3 prefers lustre4=90
> pcs constraint location FAKE4 prefers lustre3=90
> pcs constraint location FAKE4 prefers lustre4=100
> pcs resource defaults update resource-stickiness=110
And I need a rule to decrease preference score on a node where pingd fails.
I'm checking it by VM being powered off with pacemaker unaware of it (no
agents on ESXi/vCenter).

On Mon, 11 Dec 2023 at 19:00, Ken Gaillot  wrote:

> > Question #1) Why I cannot see accumulated score from pingd in
> > crm_simulate output? Only location score and stickiness.
>
> ping scores aren't added to resource scores, they're just set as node
> attribute values. Location constraint rules map those values to
> resource scores (in this case any defined ping score gets mapped to
> 125).
> --
> Ken Gaillot 
>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] ocf:pacemaker:ping works strange

2023-12-11 Thread Artem
Hi Ken,

On Mon, 11 Dec 2023 at 19:00, Ken Gaillot  wrote:

> > Question #2) I shut lustre3 VM down and leave it like that


> How did you shut it down? Outside cluster control, or with something
> like pcs resource disable?
>
> I did it outside of the cluster to simulate a failure. I turned off this
VM from vCenter. Cluster is unaware of anything behind OS.


> >   * FAKE3   (ocf::pacemaker:Dummy):  Stopped
> >   * FAKE4   (ocf::pacemaker:Dummy):  Started lustre4
> >   * Clone Set: ping-clone [ping]:
> > * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 lustre2
> > lustre4 ] << lustre3 missing
> > OK for now
> > VM boots up. pcs status:
> >   * FAKE3   (ocf::pacemaker:Dummy):  FAILED (blocked) [ lustre3
> > lustre4 ]  << what is it?
> >   * Clone Set: ping-clone [ping]:
> > * ping  (ocf::pacemaker:ping):   FAILED lustre3 (blocked)
> > << why not started?
> > * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 lustre2
> > lustre4 ]
> > I checked server processes manually and found that lustre4 runs
> > "/usr/lib/ocf/resource.d/pacemaker/ping monitor" while lustre3
> > doesn't
> > All is according to documentation but results are strange.
> > Then I tried to add meta target-role="started" to pcs resource create
> > ping and this time ping started after node rebooted. Can I expect
> > that it was just missing from official setup documentation, and now
> > everything will work fine?
>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] ocf:pacemaker:ping works strange

2023-12-11 Thread Ken Gaillot
On Fri, 2023-12-08 at 17:44 +0300, Artem wrote:
> Hello experts.
> 
> I use pacemaker for a Lustre cluster. But for simplicity and
> exploration I use a Dummy resource. I didn't like how resource
> performed failover and failback. When I shut down VM with remote
> agent, pacemaker tries to restart it. According to pcs status it
> marks the resource (not RA) Online for some time while VM stays
> down. 
> 
> OK, I wanted to improve its behavior and set up a ping monitor. I
> tuned the scores like this:
> pcs resource create FAKE3 ocf:pacemaker:Dummy
> pcs resource create FAKE4 ocf:pacemaker:Dummy
> pcs constraint location FAKE3 prefers lustre3=100
> pcs constraint location FAKE3 prefers lustre4=90
> pcs constraint location FAKE4 prefers lustre3=90
> pcs constraint location FAKE4 prefers lustre4=100
> pcs resource defaults update resource-stickiness=110
> pcs resource create ping ocf:pacemaker:ping dampen=5s host_list=local
> op monitor interval=3s timeout=7s clone meta target-role="started"
> for i in lustre{1..4}; do pcs constraint location ping-clone prefers
> $i; done
> pcs constraint location FAKE3 rule score=0 pingd lt 1 or not_defined
> pingd
> pcs constraint location FAKE4 rule score=0 pingd lt 1 or not_defined
> pingd
> pcs constraint location FAKE3 rule score=125 pingd gt 0 or defined
> pingd
> pcs constraint location FAKE4 rule score=125 pingd gt 0 or defined
> pingd

The gt 0 part is redundant since "defined pingd" matches *any* score.

> 
> 
> Question #1) Why I cannot see accumulated score from pingd in
> crm_simulate output? Only location score and stickiness. 
> pcmk__primitive_assign: FAKE3 allocation score on lustre3: 210
> pcmk__primitive_assign: FAKE3 allocation score on lustre4: 90
> pcmk__primitive_assign: FAKE4 allocation score on lustre3: 90
> pcmk__primitive_assign: FAKE4 allocation score on lustre4: 210
> Either when all is OK or when VM is down - score from pingd not added
> to total score of RA

ping scores aren't added to resource scores, they're just set as node
attribute values. Location constraint rules map those values to
resource scores (in this case any defined ping score gets mapped to
125).

> 
> 
> Question #2) I shut lustre3 VM down and leave it like that. pcs
> status:

How did you shut it down? Outside cluster control, or with something
like pcs resource disable?

>   * FAKE3   (ocf::pacemaker:Dummy):  Stopped
>   * FAKE4   (ocf::pacemaker:Dummy):  Started lustre4
>   * Clone Set: ping-clone [ping]:
> * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 lustre2
> lustre4 ] << lustre3 missing
> OK for now
> VM boots up. pcs status: 
>   * FAKE3   (ocf::pacemaker:Dummy):  FAILED (blocked) [ lustre3
> lustre4 ]  << what is it?
>   * Clone Set: ping-clone [ping]:
> * ping  (ocf::pacemaker:ping):   FAILED lustre3 (blocked)   
> << why not started?
> * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 lustre2
> lustre4 ]
> I checked server processes manually and found that lustre4 runs
> "/usr/lib/ocf/resource.d/pacemaker/ping monitor" while lustre3
> doesn't
> All is according to documentation but results are strange.
> Then I tried to add meta target-role="started" to pcs resource create
> ping and this time ping started after node rebooted. Can I expect
> that it was just missing from official setup documentation, and now
> everything will work fine?
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] ocf:pacemaker:ping works strange

2023-12-08 Thread Artem
Hello experts.

I use pacemaker for a Lustre cluster. But for simplicity and exploration I
use a Dummy resource. I didn't like how resource performed failover and
failback. When I shut down VM with remote agent, pacemaker tries to restart
it. According to pcs status it marks the resource (not RA) Online for some
time while VM stays down.

OK, I wanted to improve its behavior and set up a ping monitor. I tuned the
scores like this:
pcs resource create FAKE3 ocf:pacemaker:Dummy
pcs resource create FAKE4 ocf:pacemaker:Dummy
pcs constraint location FAKE3 prefers lustre3=100
pcs constraint location FAKE3 prefers lustre4=90
pcs constraint location FAKE4 prefers lustre3=90
pcs constraint location FAKE4 prefers lustre4=100
pcs resource defaults update resource-stickiness=110
pcs resource create ping ocf:pacemaker:ping dampen=5s host_list=local op
monitor interval=3s timeout=7s clone meta target-role="started"
for i in lustre{1..4}; do pcs constraint location ping-clone prefers $i;
done
pcs constraint location FAKE3 rule score=0 pingd lt 1 or not_defined pingd
pcs constraint location FAKE4 rule score=0 pingd lt 1 or not_defined pingd
pcs constraint location FAKE3 rule score=125 pingd gt 0 or defined pingd
pcs constraint location FAKE4 rule score=125 pingd gt 0 or defined pingd


Question #1) Why I cannot see accumulated score from pingd in crm_simulate
output? Only location score and stickiness.
pcmk__primitive_assign: FAKE3 allocation score on lustre3: 210
pcmk__primitive_assign: FAKE3 allocation score on lustre4: 90
pcmk__primitive_assign: FAKE4 allocation score on lustre3: 90
pcmk__primitive_assign: FAKE4 allocation score on lustre4: 210
Either when all is OK or when VM is down - score from pingd not added to
total score of RA


Question #2) I shut lustre3 VM down and leave it like that. pcs status:
  * FAKE3   (ocf::pacemaker:Dummy):  Stopped
  * FAKE4   (ocf::pacemaker:Dummy):  Started lustre4
  * Clone Set: ping-clone [ping]:
* Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 lustre2 lustre4
] << lustre3 missing
OK for now
VM boots up. pcs status:
  * FAKE3   (ocf::pacemaker:Dummy):  FAILED (blocked) [ lustre3 lustre4
]  << what is it?
  * Clone Set: ping-clone [ping]:
* ping  (ocf::pacemaker:ping):   FAILED lustre3 (blocked)<< why
not started?
* Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 lustre2 lustre4
]
I checked server processes manually and found that lustre4 runs
"/usr/lib/ocf/resource.d/pacemaker/ping monitor" while lustre3 doesn't
All is according to documentation but results are strange.
Then I tried to add meta target-role="started" to pcs resource create ping
and this time ping started after node rebooted. Can I expect that it was
just missing from official setup documentation, and now everything will
work fine?
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/