[ClusterLabs] Antw: [EXT] Re: Q: rulke-based operation pause/freeze?
>>> Ondrej schrieb am 06.03.2020 um 01:45 in Nachricht <7499_1583455563_5E619D4B_7499_1105_1_2a18c389-059e-cf6f-a840-dec26437fdd1@famer .cz>: > On 3/5/20 9:24 PM, Ulrich Windl wrote: >> Hi! >> >> I'm wondering whether it's possible to pause/freeze specific resource > operations through rules. >> The idea is something like this: If your monitor operation needes (e.g.) > some external NFS server, and thst NFS server is known to be down, it seems > better to delay the monitor operation until NFS is up again, rather than > forcing a monitor timeout that will most likely be followed by a stop > operation that will also time out, eventually killing the node (which has no > problem itself). >> >> As I guess it's not possible right now, what would be needed to make this > work? >> In case it's possible, how would an example scenario look like? >> >> Regards, >> Ulrich >> > > Hi Ulrich, > > For 'monitor' operation you can disable it with approach described here > at > https://clusterlabs.org/pacemaker/doc/en‑US/Pacemaker/1.1/html/Pacemaker_Expl > ained/_disabling_a_monitor_operation.html > > > "followed by a stop operation that will also time out, eventually > killing the node (which has no problem itself)" > This sounds to me as the resource agent "feature" and I would expect > that different resources agents would have different behavior when > something is lost/not present. Of course. Some RAs are "slim", while others are real "fat" (like calling a command that uses REST API to query a java server that runs a command which finally checks the status of the service. Maybe even worse.). > > To me the idea here looks like "maintenance period" for some resource. No, it's to avoid an "error cascade". > Is your expectation that cluster would not for some time do anything > with some resources? > (In such case I would consider 'is‑managed'=false + disabling monitor) > https://clusterlabs.org/pacemaker/doc/en‑US/Pacemaker/1.1/html/Pacemaker_Expl > ained/s‑resource‑options.html#_resource_meta_attributes Your suggestion would require to modify multiple operations in multiple resources every time it'S needed, while my idea was to "flag" corresponding operations once, and let some rule decide what to do. Agreed, the rule would eventually do the same from a higher perspective, but the "configuration" would not change very time. > > To determine _when_ this state should be enabled and disabled would be a > different story. For the moment let's assume I know it ;-) ping-node, maybe. Regards, Ulrich > > ‑‑ > Ondrej Famera > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Q: rulke-based operation pause/freeze?
On 3/5/20 9:24 PM, Ulrich Windl wrote: Hi! I'm wondering whether it's possible to pause/freeze specific resource operations through rules. The idea is something like this: If your monitor operation needes (e.g.) some external NFS server, and thst NFS server is known to be down, it seems better to delay the monitor operation until NFS is up again, rather than forcing a monitor timeout that will most likely be followed by a stop operation that will also time out, eventually killing the node (which has no problem itself). As I guess it's not possible right now, what would be needed to make this work? In case it's possible, how would an example scenario look like? Regards, Ulrich Hi Ulrich, For 'monitor' operation you can disable it with approach described here at https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_disabling_a_monitor_operation.html > "followed by a stop operation that will also time out, eventually killing the node (which has no problem itself)" This sounds to me as the resource agent "feature" and I would expect that different resources agents would have different behavior when something is lost/not present. To me the idea here looks like "maintenance period" for some resource. Is your expectation that cluster would not for some time do anything with some resources? (In such case I would consider 'is-managed'=false + disabling monitor) https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-options.html#_resource_meta_attributes To determine _when_ this state should be enabled and disabled would be a different story. -- Ondrej Famera ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Re: clusterlabs.org upgrade done
On Wed, Mar 04, 2020 at 10:05:50AM +0200, Strahil Nikolov wrote: > Maybe I will be unsubscribed every 10th email instead of every 5th one. In the default Mailman config unsubscribe score seems to be 5.0, but you can only get 1.0 per day if there are bounces. Also score is reset to 0 if there are no bounces for 7 days. https://www.gnu.org/software/mailman/mailman-admin/node25.html -- Valentin ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Re: clusterlabs.org upgrade done
On Thu, Mar 05, 2020 at 11:07:04PM +0200, Strahil Nikolov wrote: > After random amount of e-mails, I got a notification that I'm > unsubscribed due to maximum ammount of bounces reached, but I got no > e-mail about that from yahoo. > > Actually I have no clue about the reason. Yep, you probably did not get my reply either so I Cc you now to prevent the split-brain situation :) https://lists.clusterlabs.org/pipermail/users/2020-March/026939.html After N people send DKIM mails to the list, they produce N bounces from yahoo and Mailman removes you from the list. -- Valentin ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Re: clusterlabs.org upgrade done
On Thu, Mar 05, 2020 at 11:46:16AM -0600, Ken Gaillot wrote: > Hmm, not sure what the best approach is. I think some people like > having the [ClusterLabs] tag in the subject line. If anyone has > suggested config changes for mailman 2, I can take a look. In that case it would be best to rewrite the From header to use the list address and the rest can probable stay as is. More info here: https://wiki.list.org/DEV/DMARC -- Valentin ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Re: clusterlabs.org upgrade done
On Thu, Mar 05, 2020 at 11:44:55AM -0600, Ken Gaillot wrote: > What sort of issue are you seeing exactly? Is your account being > unsubscribed from the list automatically, or are you not receiving some > of the emails sent by the list? He is on yahoo and based on this Mailman page it seems yahoo rejects messages with invalid signatures: https://wiki.list.org/DEV/DMARC If there is a lot of these rejections for a subscribere, Mailman probably decides to remove him from the list. This is also in line with the report I get from yahoo: 78.46.95.29 2 reject fail fail valentin-vidic.from.hr valentin-vidic.from.hr permerror clusterlabs.org none -- Valentin ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Re: clusterlabs.org upgrade done
On Wed, 2020-03-04 at 10:44 +0100, Valentin Vidić wrote: > AFAICT from the reports, the mail I send to the list might not get > delivered, perhaps this is causing the unsubscribe too: > > > > 78.46.95.29 > 2 > > reject > fail > fail > > > > valentin-vidic.from.hr > > > > valentin-vidic.from.hr > permerror > > > clusterlabs.org > none > > > > > For DKIM the problem is that list modifies Subject and body so > the signature is not valid anymore. The list would need to remove > DKIM headers, change the From field to list address and perhaps > add DKIM signature of its own. Another options is for the list > to stop modifying messages: > https://begriffs.com/posts/2018-09-18-dmarc-mailing-list.html Hmm, not sure what the best approach is. I think some people like having the [ClusterLabs] tag in the subject line. If anyone has suggested config changes for mailman 2, I can take a look. > For SPF if would be good to add SPF records into DNS for > clusterlabs.org > domain. We definitely should add SPF records. That might help the "not being delivered" issue, if mail servers are doing a "SPF or DKIM must pass" test. -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Re: clusterlabs.org upgrade done
On Wed, 2020-03-04 at 10:05 +0200, Strahil Nikolov wrote: > Maybe I will be unsubscribed every 10th email instead of every 5th > one. Hi Strahil, What sort of issue are you seeing exactly? Is your account being unsubscribed from the list automatically, or are you not receiving some of the emails sent by the list? -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Debian 10 pacemaker - CIB did not pass schema validation
On Thu, 2020-03-05 at 11:44 +, Bala Mutyam wrote: > Hi Strahil, > > Apologies for my delay. I've attached the config below. > > Here is the new error: > > crm_verify --verbose --xml-file=/tmp/ansible.yJMg2z.xml > /tmp/ansible.yJMg2z.xml:28: element primitive: Relax-NG validity > error : Invalid sequence in interleave > /tmp/ansible.yJMg2z.xml:28: element primitive: Relax-NG validity > error : Element primitive failed to validate content > /tmp/ansible.yJMg2z.xml:28: element clone: Relax-NG validity error : > Invalid sequence in interleave > /tmp/ansible.yJMg2z.xml:28: element clone: Relax-NG validity error : > Element clone failed to validate content > /tmp/ansible.yJMg2z.xml:19: element primitive: Relax-NG validity > error : Element resources has extra content: primitive > (main) error: CIB did not pass schema validation > Errors found during check: config not valid The attached config doesn't have any clone elements, so I'm guessing it's not the /tmp/ansible.yJMg2z.xml mentioned above? The syntax in that tmp file is not valid (somewhere in the and tags). > > Thanks > Bala > > > On Mon, Mar 2, 2020 at 5:26 PM Strahil Nikolov > wrote: > > On March 2, 2020 1:22:55 PM GMT+02:00, Bala Mutyam < > > koti.reddy...@gmail.com> wrote: > > >Hi All, > > > > > >I'm trying to setup Pacemaker cluster with 2 VIPs and a group with > > the > > >VIPs > > >and service for squid proxy. But the CIB verification is failing > > with > > >below > > >errors. Could someone help me with this please? > > > > > >Errors: > > > > > >crm_verify --verbose --xml-file=/tmp/ansible.oGK0ye.xml > > >/tmp/ansible.oGK0ye.xml:17: element primitive: Relax-NG validity > > error > > >: > > >Invalid sequence in interleave > > >/tmp/ansible.oGK0ye.xml:17: element primitive: Relax-NG validity > > error > > >: > > >Element primitive failed to validate content > > >/tmp/ansible.oGK0ye.xml:17: element group: Relax-NG validity error > > : > > >Invalid sequence in interleave > > >/tmp/ansible.oGK0ye.xml:17: element group: Relax-NG validity error > > : > > >Element group failed to validate content > > >/tmp/ansible.oGK0ye.xml:17: element group: Relax-NG validity error > > : > > >Element resources has extra content: group > > >(main) error: CIB did not pass schema validation > > >Errors found during check: config not valid > > > > And your config is ? > > > > Best Regards, > > Strahil Nikolov > > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Resource monitors crash, restart, leave core files
On Thu, 2020-03-05 at 13:14 +, Jaap Winius wrote: > Hi folks, > > My test system, which includes support for a filesystem resource > called 'mount', works fine otherwise, but every day or so I see > monitor errors like the following when I run 'pcs status': > >Failed Resource Actions: >* mount_monitor_2 on bd3c7 'unknown error' (1): call=23, > status=Error, exitreason='', > last-rc-change='Thu Mar 5 04:57:55 2020', queued=0ms, > exec=0ms > > The corosync.log shows some more information (see log fragments > below), but I'm unable to identify a cause. The resource monitor > bombs > out, produces a core dump and then starts up again about 2 seconds > later. I've also seen this happen with the monitor for my nfsserver > resource. Other than that it stops for a few seconds, the other > problem is that this will eventually cause the filesystem with the > ./pacemaker/cores/ directory to fill up with core files (so far, > each > is less than 1MB). > > Could this be a bug, or is my software not configured correctly > (see > cfg below)? > > Thanks, > > Jaap > > PS -- I'm using CentOS 7.7.1908, Corosync 2.4.3, Pacemaker 1.1.20, > PCS > 0.9.167 and DRBD 9.10.0. > > # corosync.log # > > Mar 05 04:57:55 [15652] bd3c7.umrk.nl lrmd:error: > child_waitpid: Managed process 22553 (mount_monitor_2) > dumped > core This would have to be a bug in the resource agent. I'd build it with debug symbols to get a backtrace from the core. > Mar 05 04:57:55 [15652] bd3c7.umrk.nl lrmd: warning: > operation_finished: mount_monitor_2:22553 - terminated with > signal > 11 > Mar 05 04:57:55 [15655] bd3c7.umrk.nl crmd:error: > process_lrm_event: Result of monitor operation for mount on bd3c7: > Error | call=23 key=mount_monitor_2 confirmed=false status=4 > cib-update=143 > ... > Mar 05 04:57:55 [15655] bd3c7.umrk.nl crmd: info: > abort_transition_graph: Transition aborted by operation > mount_monitor_2 'create' on bd3c7: Old event | > magic=4:1;40:2:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953 cib=0.22.62 > source=process_graph_event:499 complete=true > ... > Mar 05 04:57:55 [15655] bd3c7.umrk.nl crmd: info: > process_graph_event:Detected action (2.40) > mount_monitor_2.23=unknown error: failed > ... > Mar 05 04:57:56 [15652] bd3c7.umrk.nl lrmd: info: > cancel_recurring_action:Cancelling ocf operation > mount_monitor_2 > ... > Mar 05 04:57:57 [15655] bd3c7.umrk.nl crmd: notice: > te_rsc_command: Initiating monitor operation > mount_monitor_2 > locally on bd3c7 | action 1 > Mar 05 04:57:57 [15655] bd3c7.umrk.nl crmd: info: > do_lrm_rsc_op: Performing > key=1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953 > op=mount_monitor_2 > ... > Mar 05 04:57:57 [15650] bd3c7.umrk.nlcib: info: > cib_perform_op: + > /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resour > ce[@id='mount']/lrm_rsc_op[@id='mount_monitor_2']: @transition- > key=1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953, @transition-magic=- > 1:193;1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953, @call-id=-1, @rc- > code=193, @op-status=-1, @last-rc-change=1583380677, > @exec-time=0 > ... > Mar 05 04:57:57 [15655] bd3c7.umrk.nl crmd: info: > process_lrm_event: Result of monitor operation for mount on bd3c7: > 0 > (ok) | call=51 key=mount_monitor_2 confirmed=false cib-update=159 > ... > Mar 05 04:57:57 [15650] bd3c7.umrk.nlcib: info: > cib_perform_op: + > /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resour > ce[@id='mount']/lrm_rsc_op[@id='mount_monitor_2']: @transition- > magic=0:0;1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953, @call-id=51, > @rc-code=0, @op-status=0, > @exec-time=70 > Mar 05 04:57:57 [15650] bd3c7.umrk.nlcib: info: > cib_process_request:Completed cib_modify operation for > section > status: OK (rc=0, origin=bd3c7/crmd/159, version=0.22.77) > Mar 05 04:57:57 [15655] bd3c7.umrk.nl crmd: info: > match_graph_event: Action mount_monitor_2 (1) confirmed on > bd3c7 > (rc=0) > > > > # Pacemaker cfg > > ~# pcs resource defaults resource-stickiness=100 ; \ >pcs resource create drbd ocf:linbit:drbd drbd_resource=r0 op > monitor interval=60s ; \ >pcs resource master drbd master-max=1 master-node-max=1 > clone-max=2 clone-node-max=1 notify=true ; \ >pcs resource create mount Filesystem device="/dev/drbd0" > directory="/data" fstype="ext4" ; \ >pcs constraint colocation add mount with drbd-master > INFINITY > with-rsc-role=Master ; \ >pcs constraint order promote drbd-master then mount ; \ >pcs resource create vip ocf:heartbeat:IPaddr2 > ip=192.168.2.73 > cidr_netmask=24 op
Re: [ClusterLabs] Debian 10 pacemaker - CIB did not pass schema validation
Hi Strahil, Apologies for my delay. I've attached the config below. Here is the new error: crm_verify --verbose --xml-file=/tmp/ansible.yJMg2z.xml /tmp/ansible.yJMg2z.xml:28: element primitive: Relax-NG validity error : Invalid sequence in interleave /tmp/ansible.yJMg2z.xml:28: element primitive: Relax-NG validity error : Element primitive failed to validate content /tmp/ansible.yJMg2z.xml:28: element clone: Relax-NG validity error : Invalid sequence in interleave /tmp/ansible.yJMg2z.xml:28: element clone: Relax-NG validity error : Element clone failed to validate content /tmp/ansible.yJMg2z.xml:19: element primitive: Relax-NG validity error : Element resources has extra content: primitive (main) error: CIB did not pass schema validation Errors found during check: config not valid Thanks Bala On Mon, Mar 2, 2020 at 5:26 PM Strahil Nikolov wrote: > On March 2, 2020 1:22:55 PM GMT+02:00, Bala Mutyam < > koti.reddy...@gmail.com> wrote: > >Hi All, > > > >I'm trying to setup Pacemaker cluster with 2 VIPs and a group with the > >VIPs > >and service for squid proxy. But the CIB verification is failing with > >below > >errors. Could someone help me with this please? > > > >Errors: > > > >crm_verify --verbose --xml-file=/tmp/ansible.oGK0ye.xml > >/tmp/ansible.oGK0ye.xml:17: element primitive: Relax-NG validity error > >: > >Invalid sequence in interleave > >/tmp/ansible.oGK0ye.xml:17: element primitive: Relax-NG validity error > >: > >Element primitive failed to validate content > >/tmp/ansible.oGK0ye.xml:17: element group: Relax-NG validity error : > >Invalid sequence in interleave > >/tmp/ansible.oGK0ye.xml:17: element group: Relax-NG validity error : > >Element group failed to validate content > >/tmp/ansible.oGK0ye.xml:17: element group: Relax-NG validity error : > >Element resources has extra content: group > >(main) error: CIB did not pass schema validation > >Errors found during check: config not valid > > And your config is ? > > Best Regards, > Strahil Nikolov > -- Thanks Bala config Description: Binary data ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Q: rulke-based operation pause/freeze?
Hi Ulrich, for HA NFS , you should expect no more than 90s (after the failover is complete) for NFSv4 clients to recover. Due to that, I think that all resources (in same cluster or another one) should have a longer period of monitoring. Maybe something like 179s . Of course , if you NFS will be down for a longer period, you can set all HA resources that depend on it with a "on-fail=ignore" and once the maintenance is over to remove it. After all , you seek the cluster not to react for that specific time , but you should keep track on such changes - as it is easy to forget such setting. Another approach is to leave the monitoring period high enough ,so the cluster won't catch the downtime - but imagine that the downtime of the NFS has to be extended - do you believe that you will be able to change all affected resources on time ? Best Regards, Strahil Nikolov В четвъртък, 5 март 2020 г., 14:25:36 ч. Гринуич+2, Ulrich Windl написа: Hi! I'm wondering whether it's possible to pause/freeze specific resource operations through rules. The idea is something like this: If your monitor operation needes (e.g.) some external NFS server, and thst NFS server is known to be down, it seems better to delay the monitor operation until NFS is up again, rather than forcing a monitor timeout that will most likely be followed by a stop operation that will also time out, eventually killing the node (which has no problem itself). As I guess it's not possible right now, what would be needed to make this work? In case it's possible, how would an example scenario look like? Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Resource monitors crash, restart, leave core files
Hi folks, My test system, which includes support for a filesystem resource called 'mount', works fine otherwise, but every day or so I see monitor errors like the following when I run 'pcs status': Failed Resource Actions: * mount_monitor_2 on bd3c7 'unknown error' (1): call=23, status=Error, exitreason='', last-rc-change='Thu Mar 5 04:57:55 2020', queued=0ms, exec=0ms The corosync.log shows some more information (see log fragments below), but I'm unable to identify a cause. The resource monitor bombs out, produces a core dump and then starts up again about 2 seconds later. I've also seen this happen with the monitor for my nfsserver resource. Other than that it stops for a few seconds, the other problem is that this will eventually cause the filesystem with the ./pacemaker/cores/ directory to fill up with core files (so far, each is less than 1MB). Could this be a bug, or is my software not configured correctly (see cfg below)? Thanks, Jaap PS -- I'm using CentOS 7.7.1908, Corosync 2.4.3, Pacemaker 1.1.20, PCS 0.9.167 and DRBD 9.10.0. # corosync.log # Mar 05 04:57:55 [15652] bd3c7.umrk.nl lrmd:error: child_waitpid: Managed process 22553 (mount_monitor_2) dumped core Mar 05 04:57:55 [15652] bd3c7.umrk.nl lrmd: warning: operation_finished: mount_monitor_2:22553 - terminated with signal 11 Mar 05 04:57:55 [15655] bd3c7.umrk.nl crmd:error: process_lrm_event: Result of monitor operation for mount on bd3c7: Error | call=23 key=mount_monitor_2 confirmed=false status=4 cib-update=143 ... Mar 05 04:57:55 [15655] bd3c7.umrk.nl crmd: info: abort_transition_graph: Transition aborted by operation mount_monitor_2 'create' on bd3c7: Old event | magic=4:1;40:2:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953 cib=0.22.62 source=process_graph_event:499 complete=true ... Mar 05 04:57:55 [15655] bd3c7.umrk.nl crmd: info: process_graph_event:Detected action (2.40) mount_monitor_2.23=unknown error: failed ... Mar 05 04:57:56 [15652] bd3c7.umrk.nl lrmd: info: cancel_recurring_action:Cancelling ocf operation mount_monitor_2 ... Mar 05 04:57:57 [15655] bd3c7.umrk.nl crmd: notice: te_rsc_command: Initiating monitor operation mount_monitor_2 locally on bd3c7 | action 1 Mar 05 04:57:57 [15655] bd3c7.umrk.nl crmd: info: do_lrm_rsc_op: Performing key=1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953 op=mount_monitor_2 ... Mar 05 04:57:57 [15650] bd3c7.umrk.nlcib: info: cib_perform_op: + /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='mount']/lrm_rsc_op[@id='mount_monitor_2']: @transition-key=1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953, @transition-magic=-1:193;1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953, @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1583380677, @exec-time=0 ... Mar 05 04:57:57 [15655] bd3c7.umrk.nl crmd: info: process_lrm_event: Result of monitor operation for mount on bd3c7: 0 (ok) | call=51 key=mount_monitor_2 confirmed=false cib-update=159 ... Mar 05 04:57:57 [15650] bd3c7.umrk.nlcib: info: cib_perform_op: + /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='mount']/lrm_rsc_op[@id='mount_monitor_2']: @transition-magic=0:0;1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953, @call-id=51, @rc-code=0, @op-status=0, @exec-time=70 Mar 05 04:57:57 [15650] bd3c7.umrk.nlcib: info: cib_process_request:Completed cib_modify operation for section status: OK (rc=0, origin=bd3c7/crmd/159, version=0.22.77) Mar 05 04:57:57 [15655] bd3c7.umrk.nl crmd: info: match_graph_event: Action mount_monitor_2 (1) confirmed on bd3c7 (rc=0) # Pacemaker cfg ~# pcs resource defaults resource-stickiness=100 ; \ pcs resource create drbd ocf:linbit:drbd drbd_resource=r0 op monitor interval=60s ; \ pcs resource master drbd master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true ; \ pcs resource create mount Filesystem device="/dev/drbd0" directory="/data" fstype="ext4" ; \ pcs constraint colocation add mount with drbd-master INFINITY with-rsc-role=Master ; \ pcs constraint order promote drbd-master then mount ; \ pcs resource create vip ocf:heartbeat:IPaddr2 ip=192.168.2.73 cidr_netmask=24 op monitor interval=30s ; \ pcs constraint colocation add vip with drbd-master INFINITY with-rsc-role=Master ; \ pcs constraint order mount then vip ; \ pcs resource create nfsd nfsserver nfs_shared_infodir=/data ; \ pcs resource create nfscfg exportfs clientspec="192.168.2.55" options=rw,no_subtree_check,no_root_squash directory=/data fsid=0 ; \ pcs constraint colocation add nfsd
[ClusterLabs] Q: rulke-based operation pause/freeze?
Hi! I'm wondering whether it's possible to pause/freeze specific resource operations through rules. The idea is something like this: If your monitor operation needes (e.g.) some external NFS server, and thst NFS server is known to be down, it seems better to delay the monitor operation until NFS is up again, rather than forcing a monitor timeout that will most likely be followed by a stop operation that will also time out, eventually killing the node (which has no problem itself). As I guess it's not possible right now, what would be needed to make this work? In case it's possible, how would an example scenario look like? Regards, Ulrich ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] PostgreSQL cluster with Pacemaker+PAF problems
Hello, On Thu, 5 Mar 2020 12:21:14 +0100 Aleksandra C wrote: [...] > I would be very happy to use some help from you. > > I have configured PostgreSQL cluster with Pacemaker+PAF. The pacemaker > configuration is the following (from > https://clusterlabs.github.io/PAF/Quick_Start-CentOS-7.html) > > # pgsqld > pcs -f cluster1.xml resource create pgsqld ocf:heartbeat:pgsqlms \ > bindir=/usr/pgsql-9.6/bin pgdata=/var/lib/pgsql/9.6/data \ > op start timeout=60s \ > op stop timeout=60s \ > op promote timeout=30s \ > op demote timeout=120s \ > op monitor interval=15s timeout=10s role="Master"\ > op monitor interval=16s timeout=10s role="Slave" \ > op notify timeout=60s If you can, I would recommend using PostgreSQL v11 or v12. Support for v12 is in PAF 2.3rc2 which is supposed to be released next week. [...] > The cluster is behaving in strange way. When I manually fence the master > node (or ungracefully shutdown), after unfencing/starting, the node has > status Failed/blocked and the node is constantly fenced(restarted) by the > fencing agent. Should the fencing recover the cluster as Master/Slave > without problem? I suppose a failover occurred after the ungraceful shutdown? The old primary is probably seen as crashed from PAF point of view. Could you share pgsqlms detailed log? [...] > Is this a cluster misconfiguration? Any idea would be greatly appreciated. I don't think so. Make sure to look at https://clusterlabs.github.io/PAF/administration.html#failover Regards, ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] PostgreSQL cluster with Pacemaker+PAF problems
Hello community, I would be very happy to use some help from you. I have configured PostgreSQL cluster with Pacemaker+PAF. The pacemaker configuration is the following (from https://clusterlabs.github.io/PAF/Quick_Start-CentOS-7.html) # pgsqld pcs -f cluster1.xml resource create pgsqld ocf:heartbeat:pgsqlms \ bindir=/usr/pgsql-9.6/bin pgdata=/var/lib/pgsql/9.6/data \ op start timeout=60s \ op stop timeout=60s \ op promote timeout=30s \ op demote timeout=120s \ op monitor interval=15s timeout=10s role="Master"\ op monitor interval=16s timeout=10s role="Slave" \ op notify timeout=60s # pgsql-ha pcs -f cluster1.xml resource master pgsql-ha pgsqld notify=true pcs -f cluster1.xml resource create pgsql-master-ip ocf:heartbeat:IPaddr2 \ ip=192.168.122.50 cidr_netmask=24 op monitor interval=10s pcs -f cluster1.xml constraint colocation add pgsql-master-ip with master pgsql-ha INFINITY pcs -f cluster1.xml constraint order promote pgsql-ha then start pgsql-master-ip symmetrical=false kind=Mandatory pcs -f cluster1.xml constraint order demote pgsql-ha then stop pgsql-master-ip symmetrical=false kind=Mandatory I use fence_xvm fencing agent, with the following configuration: pcs -f cluster1.xml stonith create fence1 fence_xvm pcmk_host_check="static-list" pcmk_host_list="srv1" port="srv-m1" multicast_address=224.0.0.2 pcs -f cluster1.xml stonith create fence2 fence_xvm pcmk_host_check="static-list" pcmk_host_list="srv2" port="srv-m2" multicast_address=224.0.0.2 pcs -f cluster1.xml constraint location fence1 avoids srv1=INFINITY pcs -f cluster1.xml constraint location fence2 avoids srv2=INFINITY The cluster is behaving in strange way. When I manually fence the master node (or ungracefully shutdown), after unfencing/starting, the node has status Failed/blocked and the node is constantly fenced(restarted) by the fencing agent. Should the fencing recover the cluster as Master/Slave without problem? The error log say that the demote action on the node has failed: warning: Action 10 (pgsqld_demote_0) on server1 failed (target: 0 vs. rc: 1): Error warning: Processing failed op demote for pgsqld:1 on server1: unknown error (1) warning: Forcing pgsqld:1 to stop after a failed demote action Is this a cluster misconfiguration? Any idea would be greatly appreciated. Thank you in advance, Aleksandra ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/