Re: [ClusterLabs] ?==?utf-8?q? Limit of concurrent ressources to start?
Am 13.06.2018 um 16:18 schrieb Ken Gaillot: > On Wed, 2018-06-13 at 14:25 +0200, Michael Schwartzkopff wrote: >> On Wednesday, June 13, 2018 10:01 CEST, "Michael Schwartzkopff" > ys4.de> wrote: >> >>> Hi, >>> >>> we have a cluster with several IP addresses that can start after an >>> other resource. In the logs we see that only 2 IP addresses start >>> in parallel, not all. Can anyone please explain, why not all IP >>> addresses start in parallel? >>> >>> Config: >>> primitive resProc ocf:myprovider:Proc >>> (ten times:) primitive resIP1 ocf:heartbeat:IPaddr2 params >>> ip="192.168.100.1" >>> order ord_Proc_IP Mandatory: resProc ( resIP1 resIP2 ... ) >>> collocation col_IP_Proc inf: (resIP1 resIP2 ...) resProc >>> >>> No batch-limit in properties. >>> Any ideas? Thanks. >>> >>> Michael > Each node has a limit of how many jobs it can execute in parallel. In > order of most preferred to least, it will be: > > * The value of the (undocumented) PCMK_node_action_limit environment > variable on that node (no limit if not existing) > > * The value of the (also undocumented) node-action-limit cluster > property (defaulting to 0 meaning no limit) > > * Twice the node's number of CPU cores (as reported by /proc/stat) > > Also, the cluster will auto-calculate a cluster-wide batch-limit if > high load is observed on any node. > > So, you could mostly override throttling by setting a high node-action- > limit. > >> Hi, >> >> additional remark: >> >> With some tweaks I made my cluster start two resources (i.e. IP1 and >> IP2) at the same time. But it takes about 4 seconds to that the >> cluster starts the next resources (i.e. IP3 and IP4). >> >> Did anybody see this behaviour before? >> >> Why does my cluster do not start all "parallel" resources together? >> >> Michael. > Ken Gaillot Thanks for this clarification. Mit freundlichen Grüßen, -- [*] sys4 AG https://sys4.de, +49 (89) 30 90 46 64 Schleißheimer Straße 26/MG,80333 München Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Marc Schiffbauer, Wolfgang Stief Aufsichtsratsvorsitzender: Florian Kirstein signature.asc Description: OpenPGP digital signature ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] ?==?utf-8?q? Limit of concurrent ressources to start?
On Wed, 2018-06-13 at 14:25 +0200, Michael Schwartzkopff wrote: > On Wednesday, June 13, 2018 10:01 CEST, "Michael Schwartzkopff" ys4.de> wrote: > > > Hi, > > > > we have a cluster with several IP addresses that can start after an > > other resource. In the logs we see that only 2 IP addresses start > > in parallel, not all. Can anyone please explain, why not all IP > > addresses start in parallel? > > > > Config: > > primitive resProc ocf:myprovider:Proc > > (ten times:) primitive resIP1 ocf:heartbeat:IPaddr2 params > > ip="192.168.100.1" > > order ord_Proc_IP Mandatory: resProc ( resIP1 resIP2 ... ) > > collocation col_IP_Proc inf: (resIP1 resIP2 ...) resProc > > > > No batch-limit in properties. > > Any ideas? Thanks. > > > > Michael Each node has a limit of how many jobs it can execute in parallel. In order of most preferred to least, it will be: * The value of the (undocumented) PCMK_node_action_limit environment variable on that node (no limit if not existing) * The value of the (also undocumented) node-action-limit cluster property (defaulting to 0 meaning no limit) * Twice the node's number of CPU cores (as reported by /proc/stat) Also, the cluster will auto-calculate a cluster-wide batch-limit if high load is observed on any node. So, you could mostly override throttling by setting a high node-action- limit. > Hi, > > additional remark: > > With some tweaks I made my cluster start two resources (i.e. IP1 and > IP2) at the same time. But it takes about 4 seconds to that the > cluster starts the next resources (i.e. IP3 and IP4). > > Did anybody see this behaviour before? > > Why does my cluster do not start all "parallel" resources together? > > Michael. Ken Gaillot -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pengine always trying to start the resource on the standby node.
On Wed, 2018-06-13 at 17:09 +0800, Albert Weng wrote: > Hi All, > > Thanks for reply. > > Recently, i run the following command : > (clustera) # crm_simulate --xml-file pe-warn.last > > it returns the following results : > error: crm_abort: xpath_search: Triggered assert at xpath.c:153 > : xml_top != NULL > error: crm_element_value: Couldn't find validate-with in NULL It looks like pe-warn.last somehow got corrupted. It appears to not be a full CIB file. If the original was compressed (.gz/.bz2 extension), and you didn't uncompress it, re-add the extension -- that's how pacemaker knows to uncompress it. > error: crm_abort: crm_element_value: Triggered assert at > xml.c:5135 : data != NULL > Configuration validation is currently disabled. It is highly > encouraged and prevents many common cluster issues. > error: crm_element_value: Couldn't find validate-with in NULL > error: crm_abort: crm_element_value: Triggered assert at > xml.c:5135 : data != NULL > error: crm_element_value: Couldn't find ignore-dtd in NULL > error: crm_abort: crm_element_value: Triggered assert at > xml.c:5135 : data != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: crm_xml_add: Triggered assert at xml.c:2494 : > node != NULL > error: write_xml_stream: Cannot write NULL to > /var/lib/pacemaker/cib/shadow.20008 > Could not create '/var/lib/pacemaker/cib/shadow.20008': Success > > Could anyone help me how to read those messages and what's going on > my server? > > Thanks a lot.. > > > On Fri, Jun 8, 2018 at 4:49 AM, Ken Gaillot > wrote: > > On Thu, 2018-06-07 at 08:37 +0800, Albert Weng wrote: > > > Hi Andrei, > > > > > > Thanks for your quickly reply. Still need help as below : > > > > > > On Wed, Jun 6, 2018 at 11:58 AM, Andrei Borzenkov > l.co > > > m> wrote: > > > > 06.06.2018 04:27, Albert Weng пишет: > > > > > Hi All, > > > > > > > > > > I have created active/passive pacemaker cluster on RHEL 7. > > > > > > > > > > Here are my environment: > > > > > clustera : 192.168.11.1 (passive) > > > > > clusterb : 192.168.11.2 (master) > > > > > clustera-ilo4 : 192.168.11.10 > > > > > clusterb-ilo4 : 192.168.11.11 > > > > > > > > > > cluster resource status : > > > > > cluster_fs started on clusterb > > > > > cluster_vip started on clusterb > > > > > cluster_sid started on clusterb > > > > > cluster_listnr started on clusterb > > > > > > > > > > Both cluster node are online status. > > > > > > > > > > i found my corosync.log contain many records like below: > > > > > > > > > > clustera pengine: info: > > > > determine_online_status_fencing: > > > > > Node clusterb is active > > > > > clustera pengine: info: determine_online_status: > > > > > > Node > > > > > clusterb is online > > > > > clustera pengine: info: > > > > determine_online_status_fencing: > > > > > Node clustera is active > > > > > clustera pengine: info: determine_online_status: > > > > > > Node > > > > > clustera is online > > > > > > > > > > *clustera pengine: warning: unpack_rsc_op_failure: > > > > Processing > > > > > failed op start for cluster_sid on clustera: unknown error > > (1)* > > > > > *=> Question : Why pengine always trying to start > > cluster_sid on > > > > the > > > > > passive node? how to fix it? * > > > > > > > > > > > > > pacemaker does not have concept of "passive" or "
Re: [ClusterLabs] ?==?utf-8?q? Limit of concurrent ressources to start?
On Wednesday, June 13, 2018 10:01 CEST, "Michael Schwartzkopff" wrote: > Hi, > > we have a cluster with several IP addresses that can start after an other > resource. In the logs we see that only 2 IP addresses start in parallel, not > all. Can anyone please explain, why not all IP addresses start in parallel? > > Config: > primitive resProc ocf:myprovider:Proc > (ten times:) primitive resIP1 ocf:heartbeat:IPaddr2 params ip="192.168.100.1" > order ord_Proc_IP Mandatory: resProc ( resIP1 resIP2 ... ) > collocation col_IP_Proc inf: (resIP1 resIP2 ...) resProc > > No batch-limit in properties. > Any ideas? Thanks. > > Michael Hi, additional remark: With some tweaks I made my cluster start two resources (i.e. IP1 and IP2) at the same time. But it takes about 4 seconds to that the cluster starts the next resources (i.e. IP3 and IP4). Did anybody see this behaviour before? Why does my cluster do not start all "parallel" resources together? Michael. ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Questions about SBD behavior
On 06/13/2018 10:58 AM, 井上 和徳 wrote: > Thanks for the response. > > As of v1.3.1 and later, I recognized that real quorum is necessary. > I also read this: > https://wiki.clusterlabs.org/wiki/Using_SBD_with_Pacemaker#Watchdog-based_self-fencing_with_resource_recovery > > As related to this specification, in order to use pacemaker-2.0, > we are confirming the following known issue. > > * When SIGSTOP is sent to the pacemaker process, no failure of the > resource will be detected. > https://lists.clusterlabs.org/pipermail/users/2016-September/011146.html > https://lists.clusterlabs.org/pipermail/users/2016-October/011429.html > > I expected that it was being handled by SBD, but no one detected > that the following process was frozen. Therefore, no failure of > the resource was detected either. > - pacemaker-based > - pacemaker-execd > - pacemaker-attrd > - pacemaker-schedulerd > - pacemaker-controld > > I confirmed this, but I couldn't read about the correspondence > situation. > > https://wiki.clusterlabs.org/w/images/1/1a/Recent_Work_and_Future_Plans_for_SBD_1.1.pdf You are right. The issue was known as when I created these slides. So a plan for improving the observation of the pacemaker-daemons should have gone into that probably. Thanks for bringing this to the table. Guess the issue got a little bit neglected recently. > > As a result of our discussion, we want SBD to detect it and reset the > machine. Implementation wise I would go for some kind of a split solution between pacemaker & SBD. Thinking of Pacemaker observing the sub-daemons by itself while there would be some kind of a heartbeat (implicitly via corosync or explicitly) between pacemaker & SBD that assures this internal observation is doing it's job properly. > > Also, for users who do not have shared disk or qdevice, > we need an option to work even without real quorum. > (fence races are going to avoid with delay attribute: > https://access.redhat.com/solutions/91653 > https://access.redhat.com/solutions/1293523) I'm not sure if I get your point here. Watchdog-fencing on a 2-node-cluster without additional qdevice or shared disk is like denying the laws of physics in my mind. At the moment I don't see why auto_tie_breaker wouldn't work on a 4-node and up cluster here. Regards, Klaus > > Best Regards, > Kazunori INOUE > >> -Original Message- >> From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Klaus >> Wenninger >> Sent: Friday, May 25, 2018 4:08 PM >> To: users@clusterlabs.org >> Subject: Re: [ClusterLabs] Questions about SBD behavior >> >> On 05/25/2018 07:31 AM, 井上 和徳 wrote: >>> Hi, >>> >>> I am checking the watchdog function of SBD (without shared block-device). >>> In a two-node cluster, if one cluster is stopped, watchdog is triggered on >>> the >> remaining node. >>> Is this the designed behavior? >> SBD without a shared block-device doesn't really make sense on >> a two-node cluster. >> The basic idea is - e.g. in a case of a networking problem - >> that a cluster splits up in a quorate and a non-quorate partition. >> The quorate partition stays over while SBD guarantees a >> reliable watchdog-based self-fencing of the non-quorate partition >> within a defined timeout. >> This idea of course doesn't work with just 2 nodes. >> Taking quorum info from the 2-node feature of corosync (automatically >> switching on wait-for-all) doesn't help in this case but instead >> would lead to split-brain. >> What you can do - and what e.g. pcs does automatically - is enable >> the auto-tie-breaker instead of two-node in corosync. But that >> still doesn't give you a higher availability than the one of the >> winner of auto-tie-breaker. (Maybe interesting if you are going >> for a load-balancing-scenario that doesn't affect availability or >> for a transient state while setting up a cluste node-by-node ...) >> What you can do though is using qdevice to still have 'real-quorum' >> info with just 2 full cluster-nodes. >> >> There was quite a lot of discussion round this topic on this >> thread previously if you search the history. >> >> Regards, >> Klaus > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pengine always trying to start the resource on the standby node.
Hi All, Thanks for reply. Recently, i run the following command : (clustera) # crm_simulate --xml-file pe-warn.last it returns the following results : error: crm_abort:xpath_search: Triggered assert at xpath.c:153 : xml_top != NULL error: crm_element_value:Couldn't find validate-with in NULL error: crm_abort:crm_element_value: Triggered assert at xml.c:5135 : data != NULL Configuration validation is currently disabled. It is highly encouraged and prevents many common cluster issues. error: crm_element_value:Couldn't find validate-with in NULL error: crm_abort:crm_element_value: Triggered assert at xml.c:5135 : data != NULL error: crm_element_value:Couldn't find ignore-dtd in NULL error: crm_abort:crm_element_value: Triggered assert at xml.c:5135 : data != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:crm_xml_add: Triggered assert at xml.c:2494 : node != NULL error: write_xml_stream: Cannot write NULL to /var/lib/pacemaker/cib/shadow.20008 Could not create '/var/lib/pacemaker/cib/shadow.20008': Success Could anyone help me how to read those messages and what's going on my server? Thanks a lot.. On Fri, Jun 8, 2018 at 4:49 AM, Ken Gaillot wrote: > On Thu, 2018-06-07 at 08:37 +0800, Albert Weng wrote: > > Hi Andrei, > > > > Thanks for your quickly reply. Still need help as below : > > > > On Wed, Jun 6, 2018 at 11:58 AM, Andrei Borzenkov > m> wrote: > > > 06.06.2018 04:27, Albert Weng пишет: > > > > Hi All, > > > > > > > > I have created active/passive pacemaker cluster on RHEL 7. > > > > > > > > Here are my environment: > > > > clustera : 192.168.11.1 (passive) > > > > clusterb : 192.168.11.2 (master) > > > > clustera-ilo4 : 192.168.11.10 > > > > clusterb-ilo4 : 192.168.11.11 > > > > > > > > cluster resource status : > > > > cluster_fsstarted on clusterb > > > > cluster_vip started on clusterb > > > > cluster_sid started on clusterb > > > > cluster_listnrstarted on clusterb > > > > > > > > Both cluster node are online status. > > > > > > > > i found my corosync.log contain many records like below: > > > > > > > > clusterapengine: info: > > > determine_online_status_fencing: > > > > Node clusterb is active > > > > clusterapengine: info: determine_online_status: > > > Node > > > > clusterb is online > > > > clusterapengine: info: > > > determine_online_status_fencing: > > > > Node clustera is active > > > > clusterapengine: info: determine_online_status: > > > Node > > > > clustera is online > > > > > > > > *clusterapengine: warning: unpack_rsc_op_failure: > > > Processing > > > > failed op start for cluster_sid on clustera: unknown error (1)* > > > > *=> Question : Why pengine always trying to start cluster_sid on > > > the > > > > passive node? how to fix it? * > > > > > > > > > > pacemaker does not have concept of "passive" or "master" node - it > > > is up > > > to you to decide when you configure resource placement. By default > > > pacemaker will attempt to spread resources across all eligible > > > nodes. > > > You can influence node selection by using constraints. See > > > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pace > > > maker_Explained/_deciding_which_nodes_a_resource_can_run_on.html > > > for details. > > > > > > But in any case - all your resources MUST be capable of running of > > > both > > > nodes, otherwise cluster makes no sense. If one resource A depends > >
Re: [ClusterLabs] Questions about SBD behavior
Thanks for the response. As of v1.3.1 and later, I recognized that real quorum is necessary. I also read this: https://wiki.clusterlabs.org/wiki/Using_SBD_with_Pacemaker#Watchdog-based_self-fencing_with_resource_recovery As related to this specification, in order to use pacemaker-2.0, we are confirming the following known issue. * When SIGSTOP is sent to the pacemaker process, no failure of the resource will be detected. https://lists.clusterlabs.org/pipermail/users/2016-September/011146.html https://lists.clusterlabs.org/pipermail/users/2016-October/011429.html I expected that it was being handled by SBD, but no one detected that the following process was frozen. Therefore, no failure of the resource was detected either. - pacemaker-based - pacemaker-execd - pacemaker-attrd - pacemaker-schedulerd - pacemaker-controld I confirmed this, but I couldn't read about the correspondence situation. https://wiki.clusterlabs.org/w/images/1/1a/Recent_Work_and_Future_Plans_for_SBD_1.1.pdf As a result of our discussion, we want SBD to detect it and reset the machine. Also, for users who do not have shared disk or qdevice, we need an option to work even without real quorum. (fence races are going to avoid with delay attribute: https://access.redhat.com/solutions/91653 https://access.redhat.com/solutions/1293523) Best Regards, Kazunori INOUE > -Original Message- > From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Klaus > Wenninger > Sent: Friday, May 25, 2018 4:08 PM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] Questions about SBD behavior > > On 05/25/2018 07:31 AM, 井上 和徳 wrote: > > Hi, > > > > I am checking the watchdog function of SBD (without shared block-device). > > In a two-node cluster, if one cluster is stopped, watchdog is triggered on > > the > remaining node. > > Is this the designed behavior? > > SBD without a shared block-device doesn't really make sense on > a two-node cluster. > The basic idea is - e.g. in a case of a networking problem - > that a cluster splits up in a quorate and a non-quorate partition. > The quorate partition stays over while SBD guarantees a > reliable watchdog-based self-fencing of the non-quorate partition > within a defined timeout. > This idea of course doesn't work with just 2 nodes. > Taking quorum info from the 2-node feature of corosync (automatically > switching on wait-for-all) doesn't help in this case but instead > would lead to split-brain. > What you can do - and what e.g. pcs does automatically - is enable > the auto-tie-breaker instead of two-node in corosync. But that > still doesn't give you a higher availability than the one of the > winner of auto-tie-breaker. (Maybe interesting if you are going > for a load-balancing-scenario that doesn't affect availability or > for a transient state while setting up a cluste node-by-node ...) > What you can do though is using qdevice to still have 'real-quorum' > info with just 2 full cluster-nodes. > > There was quite a lot of discussion round this topic on this > thread previously if you search the history. > > Regards, > Klaus ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Limit of concurrent ressources to start?
Hi, we have a cluster with several IP addresses that can start after an other resource. In the logs we see that only 2 IP addresses start in parallel, not all. Can anyone please explain, why not all IP addresses start in parallel? Config: primitive resProc ocf:myprovider:Proc (ten times:) primitive resIP1 ocf:heartbeat:IPaddr2 params ip="192.168.100.1" order ord_Proc_IP Mandatory: resProc ( resIP1 resIP2 ... ) collocation col_IP_Proc inf: (resIP1 resIP2 ...) resProc No batch-limit in properties. Any ideas? Thanks. Michael ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org