[ClusterLabs] Antw: [EXT] Re: Q: rulke-based operation pause/freeze?

2020-03-05 Thread Ulrich Windl
>>> Ondrej  schrieb am 06.03.2020 um 01:45 in
Nachricht
<7499_1583455563_5E619D4B_7499_1105_1_2a18c389-059e-cf6f-a840-dec26437fdd1@famer
.cz>:
> On 3/5/20 9:24 PM, Ulrich Windl wrote:
>> Hi!
>> 
>> I'm wondering whether it's possible to pause/freeze specific resource 
> operations through rules.
>> The idea is something like this: If your monitor operation needes (e.g.) 
> some external NFS server, and thst NFS server is known to be down, it seems

> better to delay the monitor operation until NFS is up again, rather than 
> forcing a monitor timeout that will most likely be followed by a stop 
> operation that will also time out, eventually killing the node (which has no

> problem itself).
>> 
>> As I guess it's not possible right now, what would be needed to make this 
> work?
>> In case it's possible, how would an example scenario look like?
>> 
>> Regards,
>> Ulrich
>> 
> 
> Hi Ulrich,
> 
> For 'monitor' operation you can disable it with approach described here 
> at 
>
https://clusterlabs.org/pacemaker/doc/en‑US/Pacemaker/1.1/html/Pacemaker_Expl

> ained/_disabling_a_monitor_operation.html
> 
>  > "followed by a stop operation that will also time out, eventually 
> killing the node (which has no problem itself)"
> This sounds to me as the resource agent "feature" and I would expect 
> that different resources agents would have different behavior when 
> something is lost/not present.

Of course. Some RAs are "slim", while others are real "fat" (like calling a
command that uses REST API to query a java server that runs a command which
finally checks the status of the service. Maybe even worse.).

> 
> To me the idea here looks like "maintenance period" for some resource.

No, it's to avoid an "error cascade".

> Is your expectation that cluster would not for some time do anything 
> with some resources?
> (In such case I would consider 'is‑managed'=false + disabling monitor)
>
https://clusterlabs.org/pacemaker/doc/en‑US/Pacemaker/1.1/html/Pacemaker_Expl

> ained/s‑resource‑options.html#_resource_meta_attributes

Your suggestion would require to modify multiple operations in multiple
resources every time it'S needed, while my idea was to "flag" corresponding
operations once, and let some rule decide what to do. Agreed, the rule would
eventually do the same from a higher perspective, but the "configuration" would
not change very time.

> 
> To determine _when_ this state should be enabled and disabled would be a 
> different story.

For the moment let's assume I know it ;-) ping-node, maybe.

Regards,
Ulrich

> 
> ‑‑
> Ondrej Famera
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Q: rulke-based operation pause/freeze?

2020-03-05 Thread Ondrej

On 3/5/20 9:24 PM, Ulrich Windl wrote:

Hi!

I'm wondering whether it's possible to pause/freeze specific resource 
operations through rules.
The idea is something like this: If your monitor operation needes (e.g.) some 
external NFS server, and thst NFS server is known to be down, it seems better 
to delay the monitor operation until NFS is up again, rather than forcing a 
monitor timeout that will most likely be followed by a stop operation that will 
also time out, eventually killing the node (which has no problem itself).

As I guess it's not possible right now, what would be needed to make this work?
In case it's possible, how would an example scenario look like?

Regards,
Ulrich



Hi Ulrich,

For 'monitor' operation you can disable it with approach described here 
at 
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_disabling_a_monitor_operation.html


> "followed by a stop operation that will also time out, eventually 
killing the node (which has no problem itself)"
This sounds to me as the resource agent "feature" and I would expect 
that different resources agents would have different behavior when 
something is lost/not present.


To me the idea here looks like "maintenance period" for some resource.
Is your expectation that cluster would not for some time do anything 
with some resources?

(In such case I would consider 'is-managed'=false + disabling monitor)
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-options.html#_resource_meta_attributes

To determine _when_ this state should be enabled and disabled would be a 
different story.


--
Ondrej Famera
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: clusterlabs.org upgrade done

2020-03-05 Thread Valentin Vidić
On Wed, Mar 04, 2020 at 10:05:50AM +0200, Strahil Nikolov wrote:
> Maybe I will be unsubscribed every 10th email instead of every 5th one.

In the default Mailman config unsubscribe score seems to be 5.0,
but you can only get 1.0 per day if there are bounces. 

Also score is reset to 0 if there are no bounces for 7 days.

https://www.gnu.org/software/mailman/mailman-admin/node25.html

-- 
Valentin
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: clusterlabs.org upgrade done

2020-03-05 Thread Valentin Vidić
On Thu, Mar 05, 2020 at 11:07:04PM +0200, Strahil Nikolov wrote:
> After random amount of e-mails, I got a notification that I'm
> unsubscribed  due to maximum ammount of bounces reached, but I got no
> e-mail about that from yahoo.
> 
> Actually I have no clue about the reason.

Yep, you probably did not get my reply either so I Cc you now to
prevent the split-brain situation :)
https://lists.clusterlabs.org/pipermail/users/2020-March/026939.html

After N people send DKIM mails to the list, they produce N bounces
from yahoo and Mailman removes you from the list.

-- 
Valentin
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: clusterlabs.org upgrade done

2020-03-05 Thread Valentin Vidić
On Thu, Mar 05, 2020 at 11:46:16AM -0600, Ken Gaillot wrote:
> Hmm, not sure what the best approach is. I think some people like
> having the [ClusterLabs] tag in the subject line. If anyone has
> suggested config changes for mailman 2, I can take a look.

In that case it would be best to rewrite the From header to use
the list address and the rest can probable stay as is. More info
here:

  https://wiki.list.org/DEV/DMARC

-- 
Valentin
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: clusterlabs.org upgrade done

2020-03-05 Thread Valentin Vidić
On Thu, Mar 05, 2020 at 11:44:55AM -0600, Ken Gaillot wrote:
> What sort of issue are you seeing exactly? Is your account being
> unsubscribed from the list automatically, or are you not receiving some
> of the emails sent by the list?

He is on yahoo and based on this Mailman page it seems yahoo rejects
messages with invalid signatures:

  https://wiki.list.org/DEV/DMARC

If there is a lot of these rejections for a subscribere, Mailman probably
decides to remove him from the list.

This is also in line with the report I get from yahoo:

  

  78.46.95.29
  2
  
reject
fail
fail
  


  valentin-vidic.from.hr


  
valentin-vidic.from.hr
permerror
  
  
clusterlabs.org
none
  

  

-- 
Valentin
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: clusterlabs.org upgrade done

2020-03-05 Thread Ken Gaillot
On Wed, 2020-03-04 at 10:44 +0100, Valentin Vidić wrote:
> AFAICT from the reports, the mail I send to the list might not get
> delivered, perhaps this is causing the unsubscribe too:
> 
>   
> 
>   78.46.95.29
>   2
>   
> reject
> fail
> fail
>   
> 
> 
>   valentin-vidic.from.hr
> 
> 
>   
> valentin-vidic.from.hr
> permerror
>   
>   
> clusterlabs.org
> none
>   
> 
>   
> 
> For DKIM the problem is that list modifies Subject and body so
> the signature is not valid anymore. The list would need to remove
> DKIM headers, change the From field to list address and perhaps
> add DKIM signature of its own. Another options is for the list
> to stop modifying messages:
> https://begriffs.com/posts/2018-09-18-dmarc-mailing-list.html

Hmm, not sure what the best approach is. I think some people like
having the [ClusterLabs] tag in the subject line. If anyone has
suggested config changes for mailman 2, I can take a look.

> For SPF if would be good to add SPF records into DNS for
> clusterlabs.org
> domain.

We definitely should add SPF records. That might help the "not being
delivered" issue, if mail servers are doing a "SPF or DKIM must pass"
test.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: clusterlabs.org upgrade done

2020-03-05 Thread Ken Gaillot
On Wed, 2020-03-04 at 10:05 +0200, Strahil Nikolov wrote:
> Maybe I will be unsubscribed every 10th email instead of every 5th
> one.

Hi Strahil,

What sort of issue are you seeing exactly? Is your account being
unsubscribed from the list automatically, or are you not receiving some
of the emails sent by the list?
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Debian 10 pacemaker - CIB did not pass schema validation

2020-03-05 Thread Ken Gaillot
On Thu, 2020-03-05 at 11:44 +, Bala Mutyam wrote:
> Hi Strahil,
> 
> Apologies for my delay. I've attached the config below.
> 
> Here is the new error:
> 
> crm_verify --verbose --xml-file=/tmp/ansible.yJMg2z.xml
> /tmp/ansible.yJMg2z.xml:28: element primitive: Relax-NG validity
> error : Invalid sequence in interleave
> /tmp/ansible.yJMg2z.xml:28: element primitive: Relax-NG validity
> error : Element primitive failed to validate content
> /tmp/ansible.yJMg2z.xml:28: element clone: Relax-NG validity error :
> Invalid sequence in interleave
> /tmp/ansible.yJMg2z.xml:28: element clone: Relax-NG validity error :
> Element clone failed to validate content
> /tmp/ansible.yJMg2z.xml:19: element primitive: Relax-NG validity
> error : Element resources has extra content: primitive
> (main)  error: CIB did not pass schema validation
> Errors found during check: config not valid

The attached config doesn't have any clone elements, so I'm guessing
it's not the /tmp/ansible.yJMg2z.xml mentioned above? The syntax in
that tmp file is not valid (somewhere in the  and 
tags).

> 
> Thanks
> Bala
> 
> 
> On Mon, Mar 2, 2020 at 5:26 PM Strahil Nikolov  > wrote:
> > On March 2, 2020 1:22:55 PM GMT+02:00, Bala Mutyam <
> > koti.reddy...@gmail.com> wrote:
> > >Hi All,
> > >
> > >I'm trying to setup Pacemaker cluster with 2 VIPs and a group with
> > the
> > >VIPs
> > >and service for squid proxy. But the CIB verification is failing
> > with
> > >below
> > >errors. Could someone help me with this please?
> > >
> > >Errors:
> > >
> > >crm_verify --verbose --xml-file=/tmp/ansible.oGK0ye.xml
> > >/tmp/ansible.oGK0ye.xml:17: element primitive: Relax-NG validity
> > error
> > >:
> > >Invalid sequence in interleave
> > >/tmp/ansible.oGK0ye.xml:17: element primitive: Relax-NG validity
> > error
> > >:
> > >Element primitive failed to validate content
> > >/tmp/ansible.oGK0ye.xml:17: element group: Relax-NG validity error
> > :
> > >Invalid sequence in interleave
> > >/tmp/ansible.oGK0ye.xml:17: element group: Relax-NG validity error
> > :
> > >Element group failed to validate content
> > >/tmp/ansible.oGK0ye.xml:17: element group: Relax-NG validity error
> > :
> > >Element resources has extra content: group
> > >(main)  error: CIB did not pass schema validation
> > >Errors found during check: config not valid
> > 
> > And your config is ?
> > 
> > Best Regards,
> > Strahil Nikolov
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Resource monitors crash, restart, leave core files

2020-03-05 Thread Ken Gaillot
On Thu, 2020-03-05 at 13:14 +, Jaap Winius wrote:
> Hi folks,
> 
> My test system, which includes support for a filesystem resource  
> called 'mount', works fine otherwise, but every day or so I see  
> monitor errors like the following when I run 'pcs status':
> 
>Failed Resource Actions:
>* mount_monitor_2 on bd3c7 'unknown error' (1): call=23,  
> status=Error, exitreason='',
> last-rc-change='Thu Mar  5 04:57:55 2020', queued=0ms,
> exec=0ms
> 
> The corosync.log shows some more information (see log fragments  
> below), but I'm unable to identify a cause. The resource monitor
> bombs  
> out, produces a core dump and then starts up again about 2 seconds  
> later. I've also seen this happen with the monitor for my nfsserver  
> resource. Other than that it stops for a few seconds, the other  
> problem is that this will eventually cause the filesystem with the  
> ./pacemaker/cores/ directory to fill up with core files (so far,
> each  
> is less than 1MB).
> 
> Could this be a bug, or is my software not configured correctly
> (see  
> cfg below)?
> 
> Thanks,
> 
> Jaap
> 
> PS -- I'm using CentOS 7.7.1908, Corosync 2.4.3, Pacemaker 1.1.20,
> PCS  
> 0.9.167 and DRBD 9.10.0.
> 
> # corosync.log #
> 
> Mar 05 04:57:55 [15652] bd3c7.umrk.nl   lrmd:error:  
> child_waitpid:  Managed process 22553 (mount_monitor_2)
> dumped  
> core

This would have to be a bug in the resource agent. I'd build it with
debug symbols to get a backtrace from the core.

> Mar 05 04:57:55 [15652] bd3c7.umrk.nl   lrmd:  warning:  
> operation_finished: mount_monitor_2:22553 - terminated with
> signal  
> 11
> Mar 05 04:57:55 [15655] bd3c7.umrk.nl   crmd:error:  
> process_lrm_event:  Result of monitor operation for mount on bd3c7:  
> Error | call=23 key=mount_monitor_2 confirmed=false status=4  
> cib-update=143
> ...
> Mar 05 04:57:55 [15655] bd3c7.umrk.nl   crmd: info:  
> abort_transition_graph: Transition aborted by operation  
> mount_monitor_2 'create' on bd3c7: Old event |  
> magic=4:1;40:2:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953 cib=0.22.62  
> source=process_graph_event:499 complete=true
> ...
> Mar 05 04:57:55 [15655] bd3c7.umrk.nl   crmd: info:  
> process_graph_event:Detected action (2.40)  
> mount_monitor_2.23=unknown error: failed
> ...
> Mar 05 04:57:56 [15652] bd3c7.umrk.nl   lrmd: info:  
> cancel_recurring_action:Cancelling ocf operation
> mount_monitor_2
> ...
> Mar 05 04:57:57 [15655] bd3c7.umrk.nl   crmd:   notice:  
> te_rsc_command: Initiating monitor operation
> mount_monitor_2  
> locally on bd3c7 | action 1
> Mar 05 04:57:57 [15655] bd3c7.umrk.nl   crmd: info:  
> do_lrm_rsc_op:  Performing  
> key=1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953
> op=mount_monitor_2
> ...
> Mar 05 04:57:57 [15650] bd3c7.umrk.nlcib: info:  
> cib_perform_op: +   
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resour
> ce[@id='mount']/lrm_rsc_op[@id='mount_monitor_2']:  @transition-
> key=1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953, @transition-magic=-
> 1:193;1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953, @call-id=-1, @rc-
> code=193, @op-status=-1, @last-rc-change=1583380677,  
> @exec-time=0
> ...
> Mar 05 04:57:57 [15655] bd3c7.umrk.nl   crmd: info:  
> process_lrm_event:  Result of monitor operation for mount on bd3c7:
> 0  
> (ok) | call=51 key=mount_monitor_2 confirmed=false cib-update=159
> ...
> Mar 05 04:57:57 [15650] bd3c7.umrk.nlcib: info:  
> cib_perform_op: +   
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resour
> ce[@id='mount']/lrm_rsc_op[@id='mount_monitor_2']:  @transition-
> magic=0:0;1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953, @call-id=51,
> @rc-code=0, @op-status=0,  
> @exec-time=70
> Mar 05 04:57:57 [15650] bd3c7.umrk.nlcib: info:  
> cib_process_request:Completed cib_modify operation for
> section  
> status: OK (rc=0, origin=bd3c7/crmd/159, version=0.22.77)
> Mar 05 04:57:57 [15655] bd3c7.umrk.nl   crmd: info:  
> match_graph_event:  Action mount_monitor_2 (1) confirmed on
> bd3c7  
> (rc=0)
> 
> 
> 
> # Pacemaker cfg 
> 
> ~# pcs resource defaults resource-stickiness=100 ; \
>pcs resource create drbd ocf:linbit:drbd drbd_resource=r0 op  
> monitor interval=60s ; \
>pcs resource master drbd master-max=1 master-node-max=1  
> clone-max=2 clone-node-max=1 notify=true ; \
>pcs resource create mount Filesystem device="/dev/drbd0"  
> directory="/data" fstype="ext4" ; \
>pcs constraint colocation add mount with drbd-master
> INFINITY  
> with-rsc-role=Master ; \
>pcs constraint order promote drbd-master then mount ; \
>pcs resource create vip ocf:heartbeat:IPaddr2
> ip=192.168.2.73  
> cidr_netmask=24 op 

Re: [ClusterLabs] Debian 10 pacemaker - CIB did not pass schema validation

2020-03-05 Thread Bala Mutyam
Hi Strahil,

Apologies for my delay. I've attached the config below.

Here is the new error:

crm_verify --verbose --xml-file=/tmp/ansible.yJMg2z.xml
/tmp/ansible.yJMg2z.xml:28: element primitive: Relax-NG validity error :
Invalid sequence in interleave
/tmp/ansible.yJMg2z.xml:28: element primitive: Relax-NG validity error :
Element primitive failed to validate content
/tmp/ansible.yJMg2z.xml:28: element clone: Relax-NG validity error :
Invalid sequence in interleave
/tmp/ansible.yJMg2z.xml:28: element clone: Relax-NG validity error :
Element clone failed to validate content
/tmp/ansible.yJMg2z.xml:19: element primitive: Relax-NG validity error :
Element resources has extra content: primitive
(main)  error: CIB did not pass schema validation
Errors found during check: config not valid

Thanks
Bala


On Mon, Mar 2, 2020 at 5:26 PM Strahil Nikolov 
wrote:

> On March 2, 2020 1:22:55 PM GMT+02:00, Bala Mutyam <
> koti.reddy...@gmail.com> wrote:
> >Hi All,
> >
> >I'm trying to setup Pacemaker cluster with 2 VIPs and a group with the
> >VIPs
> >and service for squid proxy. But the CIB verification is failing with
> >below
> >errors. Could someone help me with this please?
> >
> >Errors:
> >
> >crm_verify --verbose --xml-file=/tmp/ansible.oGK0ye.xml
> >/tmp/ansible.oGK0ye.xml:17: element primitive: Relax-NG validity error
> >:
> >Invalid sequence in interleave
> >/tmp/ansible.oGK0ye.xml:17: element primitive: Relax-NG validity error
> >:
> >Element primitive failed to validate content
> >/tmp/ansible.oGK0ye.xml:17: element group: Relax-NG validity error :
> >Invalid sequence in interleave
> >/tmp/ansible.oGK0ye.xml:17: element group: Relax-NG validity error :
> >Element group failed to validate content
> >/tmp/ansible.oGK0ye.xml:17: element group: Relax-NG validity error :
> >Element resources has extra content: group
> >(main)  error: CIB did not pass schema validation
> >Errors found during check: config not valid
>
> And your config is ?
>
> Best Regards,
> Strahil Nikolov
>


-- 
Thanks
Bala


config
Description: Binary data
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Q: rulke-based operation pause/freeze?

2020-03-05 Thread Strahil Nikolov
Hi Ulrich,

for HA NFS , you should expect no more than 90s (after the failover is 
complete) for NFSv4 clients to recover. Due to that, I think that all resources 
(in same cluster or another one) should have a longer period of monitoring. 
Maybe something like 179s .

Of course , if you NFS will be down for a longer period, you can set all HA 
resources that depend on it with a "on-fail=ignore" and once the maintenance is 
over to remove it.
After all , you seek the cluster not to react for that specific time , but you 
should keep track on such changes - as it is easy to forget such setting.

Another approach is to leave the monitoring period high enough ,so the cluster 
won't catch the downtime - but imagine that the downtime of the NFS has to be 
extended - do you believe that you will be able to change all affected 
resources on time ?


Best Regards,
Strahil Nikolov






В четвъртък, 5 март 2020 г., 14:25:36 ч. Гринуич+2, Ulrich Windl 
 написа: 





Hi!

I'm wondering whether it's possible to pause/freeze specific resource 
operations through rules.
The idea is something like this: If your monitor operation needes (e.g.) some 
external NFS server, and thst NFS server is known to be down, it seems better 
to delay the monitor operation until NFS is up again, rather than forcing a 
monitor timeout that will most likely be followed by a stop operation that will 
also time out, eventually killing the node (which has no problem itself).

As I guess it's not possible right now, what would be needed to make this work?
In case it's possible, how would an example scenario look like?

Regards,
Ulrich



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Resource monitors crash, restart, leave core files

2020-03-05 Thread Jaap Winius



Hi folks,

My test system, which includes support for a filesystem resource  
called 'mount', works fine otherwise, but every day or so I see  
monitor errors like the following when I run 'pcs status':


  Failed Resource Actions:
  * mount_monitor_2 on bd3c7 'unknown error' (1): call=23,  
status=Error, exitreason='',

   last-rc-change='Thu Mar  5 04:57:55 2020', queued=0ms, exec=0ms

The corosync.log shows some more information (see log fragments  
below), but I'm unable to identify a cause. The resource monitor bombs  
out, produces a core dump and then starts up again about 2 seconds  
later. I've also seen this happen with the monitor for my nfsserver  
resource. Other than that it stops for a few seconds, the other  
problem is that this will eventually cause the filesystem with the  
./pacemaker/cores/ directory to fill up with core files (so far, each  
is less than 1MB).


Could this be a bug, or is my software not configured correctly (see  
cfg below)?


Thanks,

Jaap

PS -- I'm using CentOS 7.7.1908, Corosync 2.4.3, Pacemaker 1.1.20, PCS  
0.9.167 and DRBD 9.10.0.


# corosync.log #

Mar 05 04:57:55 [15652] bd3c7.umrk.nl   lrmd:error:  
child_waitpid:  Managed process 22553 (mount_monitor_2) dumped  
core
Mar 05 04:57:55 [15652] bd3c7.umrk.nl   lrmd:  warning:  
operation_finished: mount_monitor_2:22553 - terminated with signal  
11
Mar 05 04:57:55 [15655] bd3c7.umrk.nl   crmd:error:  
process_lrm_event:  Result of monitor operation for mount on bd3c7:  
Error | call=23 key=mount_monitor_2 confirmed=false status=4  
cib-update=143

...
Mar 05 04:57:55 [15655] bd3c7.umrk.nl   crmd: info:  
abort_transition_graph: Transition aborted by operation  
mount_monitor_2 'create' on bd3c7: Old event |  
magic=4:1;40:2:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953 cib=0.22.62  
source=process_graph_event:499 complete=true

...
Mar 05 04:57:55 [15655] bd3c7.umrk.nl   crmd: info:  
process_graph_event:Detected action (2.40)  
mount_monitor_2.23=unknown error: failed

...
Mar 05 04:57:56 [15652] bd3c7.umrk.nl   lrmd: info:  
cancel_recurring_action:Cancelling ocf operation mount_monitor_2

...
Mar 05 04:57:57 [15655] bd3c7.umrk.nl   crmd:   notice:  
te_rsc_command: Initiating monitor operation mount_monitor_2  
locally on bd3c7 | action 1
Mar 05 04:57:57 [15655] bd3c7.umrk.nl   crmd: info:  
do_lrm_rsc_op:  Performing  
key=1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953 op=mount_monitor_2

...
Mar 05 04:57:57 [15650] bd3c7.umrk.nlcib: info:  
cib_perform_op: +   
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='mount']/lrm_rsc_op[@id='mount_monitor_2']:  @transition-key=1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953, @transition-magic=-1:193;1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953, @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1583380677,  
@exec-time=0

...
Mar 05 04:57:57 [15655] bd3c7.umrk.nl   crmd: info:  
process_lrm_event:  Result of monitor operation for mount on bd3c7: 0  
(ok) | call=51 key=mount_monitor_2 confirmed=false cib-update=159

...
Mar 05 04:57:57 [15650] bd3c7.umrk.nlcib: info:  
cib_perform_op: +   
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='mount']/lrm_rsc_op[@id='mount_monitor_2']:  @transition-magic=0:0;1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953, @call-id=51, @rc-code=0, @op-status=0,  
@exec-time=70
Mar 05 04:57:57 [15650] bd3c7.umrk.nlcib: info:  
cib_process_request:Completed cib_modify operation for section  
status: OK (rc=0, origin=bd3c7/crmd/159, version=0.22.77)
Mar 05 04:57:57 [15655] bd3c7.umrk.nl   crmd: info:  
match_graph_event:  Action mount_monitor_2 (1) confirmed on bd3c7  
(rc=0)




# Pacemaker cfg 

   ~# pcs resource defaults resource-stickiness=100 ; \
  pcs resource create drbd ocf:linbit:drbd drbd_resource=r0 op  
monitor interval=60s ; \
  pcs resource master drbd master-max=1 master-node-max=1  
clone-max=2 clone-node-max=1 notify=true ; \
  pcs resource create mount Filesystem device="/dev/drbd0"  
directory="/data" fstype="ext4" ; \
  pcs constraint colocation add mount with drbd-master INFINITY  
with-rsc-role=Master ; \

  pcs constraint order promote drbd-master then mount ; \
  pcs resource create vip ocf:heartbeat:IPaddr2 ip=192.168.2.73  
cidr_netmask=24 op monitor interval=30s ; \
  pcs constraint colocation add vip with drbd-master INFINITY  
with-rsc-role=Master ; \

  pcs constraint order mount then vip ; \
  pcs resource create nfsd nfsserver nfs_shared_infodir=/data ; \
  pcs resource create nfscfg exportfs clientspec="192.168.2.55"  
options=rw,no_subtree_check,no_root_squash directory=/data fsid=0 ; \

  pcs constraint colocation add nfsd 

[ClusterLabs] Q: rulke-based operation pause/freeze?

2020-03-05 Thread Ulrich Windl
Hi!

I'm wondering whether it's possible to pause/freeze specific resource 
operations through rules.
The idea is something like this: If your monitor operation needes (e.g.) some 
external NFS server, and thst NFS server is known to be down, it seems better 
to delay the monitor operation until NFS is up again, rather than forcing a 
monitor timeout that will most likely be followed by a stop operation that will 
also time out, eventually killing the node (which has no problem itself).

As I guess it's not possible right now, what would be needed to make this work?
In case it's possible, how would an example scenario look like?

Regards,
Ulrich



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] PostgreSQL cluster with Pacemaker+PAF problems

2020-03-05 Thread Jehan-Guillaume de Rorthais
Hello,

On Thu, 5 Mar 2020 12:21:14 +0100
Aleksandra C  wrote:
[...]
> I would be very happy to use some help from you.
> 
> I have configured PostgreSQL cluster with Pacemaker+PAF. The pacemaker
> configuration is the following (from
> https://clusterlabs.github.io/PAF/Quick_Start-CentOS-7.html)
> 
> # pgsqld
> pcs -f cluster1.xml resource create pgsqld ocf:heartbeat:pgsqlms \
> bindir=/usr/pgsql-9.6/bin pgdata=/var/lib/pgsql/9.6/data \
> op start timeout=60s \
> op stop timeout=60s  \
> op promote timeout=30s   \
> op demote timeout=120s   \
> op monitor interval=15s timeout=10s role="Master"\
> op monitor interval=16s timeout=10s role="Slave" \
> op notify timeout=60s

If you can, I would recommend using PostgreSQL v11 or v12. Support for v12 is in
PAF 2.3rc2 which is supposed to be released next week.


[...]
> The cluster is behaving in strange way. When I manually fence the master
> node (or ungracefully shutdown), after unfencing/starting, the node has
> status Failed/blocked and the node is constantly fenced(restarted) by the
> fencing agent. Should the fencing recover the cluster as Master/Slave
> without problem?

I suppose a failover occurred after the ungraceful shutdown? The old primary is
probably seen as crashed from PAF point of view.

Could you share pgsqlms detailed log?

[...]
> Is this a cluster misconfiguration? Any idea would be greatly appreciated.

I don't think so. Make sure to look at
https://clusterlabs.github.io/PAF/administration.html#failover

Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] PostgreSQL cluster with Pacemaker+PAF problems

2020-03-05 Thread Aleksandra C
Hello community,

I would be very happy to use some help from you.

I have configured PostgreSQL cluster with Pacemaker+PAF. The pacemaker
configuration is the following (from
https://clusterlabs.github.io/PAF/Quick_Start-CentOS-7.html)

# pgsqld
pcs -f cluster1.xml resource create pgsqld ocf:heartbeat:pgsqlms \
bindir=/usr/pgsql-9.6/bin pgdata=/var/lib/pgsql/9.6/data \
op start timeout=60s \
op stop timeout=60s  \
op promote timeout=30s   \
op demote timeout=120s   \
op monitor interval=15s timeout=10s role="Master"\
op monitor interval=16s timeout=10s role="Slave" \
op notify timeout=60s

# pgsql-ha
pcs -f cluster1.xml resource master pgsql-ha pgsqld notify=true

pcs -f cluster1.xml resource create pgsql-master-ip ocf:heartbeat:IPaddr2 \
ip=192.168.122.50 cidr_netmask=24 op monitor interval=10s

pcs -f cluster1.xml constraint colocation add pgsql-master-ip with
master pgsql-ha INFINITY
pcs -f cluster1.xml constraint order promote pgsql-ha then start
pgsql-master-ip symmetrical=false kind=Mandatory
pcs -f cluster1.xml constraint order demote pgsql-ha then stop
pgsql-master-ip symmetrical=false kind=Mandatory

I use fence_xvm fencing agent, with the following configuration:

pcs -f cluster1.xml stonith create fence1 fence_xvm
pcmk_host_check="static-list" pcmk_host_list="srv1" port="srv-m1"
multicast_address=224.0.0.2
pcs -f cluster1.xml stonith create fence2 fence_xvm
pcmk_host_check="static-list" pcmk_host_list="srv2" port="srv-m2"
multicast_address=224.0.0.2

pcs -f cluster1.xml constraint location fence1 avoids srv1=INFINITY
pcs -f cluster1.xml constraint location fence2 avoids srv2=INFINITY

The cluster is behaving in strange way. When I manually fence the master
node (or ungracefully shutdown), after unfencing/starting, the node has
status Failed/blocked and the node is constantly fenced(restarted) by the
fencing agent. Should the fencing recover the cluster as Master/Slave
without problem? The error log say that the demote action on the node has
failed:

warning: Action 10 (pgsqld_demote_0) on server1 failed (target: 0 vs. rc:
1): Error
warning: Processing failed op demote for pgsqld:1 on server1: unknown error
(1)
warning: Forcing pgsqld:1 to stop after a failed demote action


Is this a cluster misconfiguration? Any idea would be greatly appreciated.

Thank you in advance,

Aleksandra
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/