Re: [ClusterLabs] Pengine always trying to start the resource on the standby node.
On Wed, 2018-06-13 at 17:09 +0800, Albert Weng wrote: > Hi All, > > Thanks for reply. > > Recently, i run the following command : > (clustera) # crm_simulate --xml-file pe-warn.last > > it returns the following results : > error: crm_abort: xpath_search: Triggered assert at xpath.c:153 > : xml_top != NULL > error: crm_element_value: Couldn't find validate-with in NULL It looks like pe-warn.last somehow got corrupted. It appears to not be a full CIB file. If the original was compressed (.gz/.bz2 extension), and you didn't uncompress it, re-add the extension -- that's how pacemaker knows to uncompress it. > error: crm_abort: crm_element_value: Triggered assert at > xml.c:5135 : data != NULL > Configuration validation is currently disabled. It is highly > encouraged and prevents many common cluster issues. > error: crm_element_value: Couldn't find validate-with in NULL > error: crm_abort: crm_element_value: Triggered assert at > xml.c:5135 : data != NULL > error: crm_element_value: Couldn't find ignore-dtd in NULL > error: crm_abort: crm_element_value: Triggered assert at > xml.c:5135 : data != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: crm_xml_add: Triggered assert at xml.c:2494 : > node != NULL > error: write_xml_stream: Cannot write NULL to > /var/lib/pacemaker/cib/shadow.20008 > Could not create '/var/lib/pacemaker/cib/shadow.20008': Success > > Could anyone help me how to read those messages and what's going on > my server? > > Thanks a lot.. > > > On Fri, Jun 8, 2018 at 4:49 AM, Ken Gaillot > wrote: > > On Thu, 2018-06-07 at 08:37 +0800, Albert Weng wrote: > > > Hi Andrei, > > > > > > Thanks for your quickly reply. Still need help as below : > > > > > > On Wed, Jun 6, 2018 at 11:58 AM, Andrei Borzenkov > l.co > > > m> wrote: > > > > 06.06.2018 04:27, Albert Weng пишет: > > > > > Hi All, > > > > > > > > > > I have created active/passive pacemaker cluster on RHEL 7. > > > > > > > > > > Here are my environment: > > > > > clustera : 192.168.11.1 (passive) > > > > > clusterb : 192.168.11.2 (master) > > > > > clustera-ilo4 : 192.168.11.10 > > > > > clusterb-ilo4 : 192.168.11.11 > > > > > > > > > > cluster resource status : > > > > > cluster_fs started on clusterb > > > > > cluster_vip started on clusterb > > > > > cluster_sid started on clusterb > > > > > cluster_listnr started on clusterb > > > > > > > > > > Both cluster node are online status. > > > > > > > > > > i found my corosync.log contain many records like below: > > > > > > > > > > clustera pengine: info: > > > > determine_online_status_fencing: > > > > > Node clusterb is active > > > > > clustera pengine: info: determine_online_status: > > > > > > Node > > > > > clusterb is online > > > > > clustera pengine: info: > > > > determine_online_status_fencing: > > > > > Node clustera is active > > > > > clustera pengine: info: determine_online_status: > > > > > > Node > > > > > clustera is online > > > > > > > > > > *clustera pengine: warning: unpack_rsc_op_failure: > > > > Processing > > > > > failed op start for cluster_sid on clustera: unknown error > > (1)* > > > > > *=> Question : Why pengine always trying to start > > cluster_sid on > > > > the > > > > > passive node? how to fix it? * > > > > > > > > > > > > > pacemaker does not have concept of "passive" or "
Re: [ClusterLabs] Pengine always trying to start the resource on the standby node.
Hi All, Thanks for reply. Recently, i run the following command : (clustera) # crm_simulate --xml-file pe-warn.last it returns the following results : error: crm_abort:xpath_search: Triggered assert at xpath.c:153 : xml_top != NULL error: crm_element_value:Couldn't find validate-with in NULL error: crm_abort:crm_element_value: Triggered assert at xml.c:5135 : data != NULL Configuration validation is currently disabled. It is highly encouraged and prevents many common cluster issues. error: crm_element_value:Couldn't find validate-with in NULL error: crm_abort:crm_element_value: Triggered assert at xml.c:5135 : data != NULL error: crm_element_value:Couldn't find ignore-dtd in NULL error: crm_abort:crm_element_value: Triggered assert at xml.c:5135 : data != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:validate_with: Triggered assert at schemas.c:522 : xml != NULL error: crm_abort:crm_xml_add: Triggered assert at xml.c:2494 : node != NULL error: write_xml_stream: Cannot write NULL to /var/lib/pacemaker/cib/shadow.20008 Could not create '/var/lib/pacemaker/cib/shadow.20008': Success Could anyone help me how to read those messages and what's going on my server? Thanks a lot.. On Fri, Jun 8, 2018 at 4:49 AM, Ken Gaillot wrote: > On Thu, 2018-06-07 at 08:37 +0800, Albert Weng wrote: > > Hi Andrei, > > > > Thanks for your quickly reply. Still need help as below : > > > > On Wed, Jun 6, 2018 at 11:58 AM, Andrei Borzenkov > m> wrote: > > > 06.06.2018 04:27, Albert Weng пишет: > > > > Hi All, > > > > > > > > I have created active/passive pacemaker cluster on RHEL 7. > > > > > > > > Here are my environment: > > > > clustera : 192.168.11.1 (passive) > > > > clusterb : 192.168.11.2 (master) > > > > clustera-ilo4 : 192.168.11.10 > > > > clusterb-ilo4 : 192.168.11.11 > > > > > > > > cluster resource status : > > > > cluster_fsstarted on clusterb > > > > cluster_vip started on clusterb > > > > cluster_sid started on clusterb > > > > cluster_listnrstarted on clusterb > > > > > > > > Both cluster node are online status. > > > > > > > > i found my corosync.log contain many records like below: > > > > > > > > clusterapengine: info: > > > determine_online_status_fencing: > > > > Node clusterb is active > > > > clusterapengine: info: determine_online_status: > > > Node > > > > clusterb is online > > > > clusterapengine: info: > > > determine_online_status_fencing: > > > > Node clustera is active > > > > clusterapengine: info: determine_online_status: > > > Node > > > > clustera is online > > > > > > > > *clusterapengine: warning: unpack_rsc_op_failure: > > > Processing > > > > failed op start for cluster_sid on clustera: unknown error (1)* > > > > *=> Question : Why pengine always trying to start cluster_sid on > > > the > > > > passive node? how to fix it? * > > > > > > > > > > pacemaker does not have concept of "passive" or "master" node - it > > > is up > > > to you to decide when you configure resource placement. By default > > > pacemaker will attempt to spread resources across all eligible > > > nodes. > > > You can influence node selection by using constraints. See > > > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pace > > > maker_Explained/_deciding_which_nodes_a_resource_can_run_on.html > > > for details. > > > > > > But in any case - all your resources MUST be capable of running of > > > both > > > nodes, otherwise cluster makes no sense. If one resource A depends > >
Re: [ClusterLabs] Pengine always trying to start the resource on the standby node.
On Thu, 2018-06-07 at 08:37 +0800, Albert Weng wrote: > Hi Andrei, > > Thanks for your quickly reply. Still need help as below : > > On Wed, Jun 6, 2018 at 11:58 AM, Andrei Borzenkov m> wrote: > > 06.06.2018 04:27, Albert Weng пишет: > > > Hi All, > > > > > > I have created active/passive pacemaker cluster on RHEL 7. > > > > > > Here are my environment: > > > clustera : 192.168.11.1 (passive) > > > clusterb : 192.168.11.2 (master) > > > clustera-ilo4 : 192.168.11.10 > > > clusterb-ilo4 : 192.168.11.11 > > > > > > cluster resource status : > > > cluster_fs started on clusterb > > > cluster_vip started on clusterb > > > cluster_sid started on clusterb > > > cluster_listnr started on clusterb > > > > > > Both cluster node are online status. > > > > > > i found my corosync.log contain many records like below: > > > > > > clustera pengine: info: > > determine_online_status_fencing: > > > Node clusterb is active > > > clustera pengine: info: determine_online_status: > > Node > > > clusterb is online > > > clustera pengine: info: > > determine_online_status_fencing: > > > Node clustera is active > > > clustera pengine: info: determine_online_status: > > Node > > > clustera is online > > > > > > *clustera pengine: warning: unpack_rsc_op_failure: > > Processing > > > failed op start for cluster_sid on clustera: unknown error (1)* > > > *=> Question : Why pengine always trying to start cluster_sid on > > the > > > passive node? how to fix it? * > > > > > > > pacemaker does not have concept of "passive" or "master" node - it > > is up > > to you to decide when you configure resource placement. By default > > pacemaker will attempt to spread resources across all eligible > > nodes. > > You can influence node selection by using constraints. See > > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pace > > maker_Explained/_deciding_which_nodes_a_resource_can_run_on.html > > for details. > > > > But in any case - all your resources MUST be capable of running of > > both > > nodes, otherwise cluster makes no sense. If one resource A depends > > on > > something that another resource B provides and can be started only > > together with resource B (and after it is ready) - you must tell it > > to > > pacemaker by using resource colocations and ordering. See same > > document > > for details. > > > > > clustera pengine: info: native_print: ipmi-fence- > > clustera > > > (stonith:fence_ipmilan): Started clustera > > > clustera pengine: info: native_print: ipmi-fence- > > clusterb > > > (stonith:fence_ipmilan): Started clustera > > > clustera pengine: info: group_print: Resource > > Group: cluster > > > clustera pengine: info: native_print: > > cluster_fs > > > (ocf::heartbeat:Filesystem): Started clusterb > > > clustera pengine: info: native_print: > > cluster_vip > > > (ocf::heartbeat:IPaddr2): Started clusterb > > > clustera pengine: info: native_print: > > cluster_sid > > > (ocf::heartbeat:oracle): Started clusterb > > > clustera pengine: info: native_print: > > > cluster_listnr (ocf::heartbeat:oralsnr): Started > > clusterb > > > clustera pengine: info: get_failcount_full: > > cluster_sid has > > > failed INFINITY times on clustera > > > > > > > > > *clustera pengine: warning: common_apply_stickiness: > > Forcing > > > cluster_sid away from clustera after 100 failures > > (max=100)* > > > *=> Question: too much trying result in forbid the resource start > > on > > > clustera ?* > > > > > > > Yes. > > How to find out the root cause of 100 failures? which log will > contain the error message? As an aside, 1,000,000 is "infinity" to pacemaker. It could mean 1,000,000 actual failures, or a "fatal" failure that causes pacemaker to set the fail count to infinity. The most recent failure of each resource will be shown in the status display (crm_mon, pcs status, etc.). They will have a basic exit code (which you can use to distinguish a timeout from an error received from the agent), and if the agent provided one, an "exit-reason". That's the first place to look. Failures will remain in the status display, and affect the placement of resources, until one of two things happen: you manually clean up the failure (crm_resource --cleanup, pcs resource cleanup, etc.), or, if you configured a failure-timeout for the resource, that much time has passed with no more failures. For deeper investigation, check the system log (wherever it's kept on your distro). You can use the timestamp from the failure in the status to know where to look. For even more detail, you can look at pacemaker's detail log (the one you posted excerpts from). This will have additional messages beyond the system log, but they are har
Re: [ClusterLabs] Pengine always trying to start the resource on the standby node.
Hi Andrei, Thanks for your quickly reply. Still need help as below : On Wed, Jun 6, 2018 at 11:58 AM, Andrei Borzenkov wrote: > 06.06.2018 04:27, Albert Weng пишет: > > Hi All, > > > > I have created active/passive pacemaker cluster on RHEL 7. > > > > Here are my environment: > > clustera : 192.168.11.1 (passive) > > clusterb : 192.168.11.2 (master) > > clustera-ilo4 : 192.168.11.10 > > clusterb-ilo4 : 192.168.11.11 > > > > cluster resource status : > > cluster_fsstarted on clusterb > > cluster_vip started on clusterb > > cluster_sid started on clusterb > > cluster_listnrstarted on clusterb > > > > Both cluster node are online status. > > > > i found my corosync.log contain many records like below: > > > > clusterapengine: info: determine_online_status_fencing: > > Node clusterb is active > > clusterapengine: info: determine_online_status:Node > > clusterb is online > > clusterapengine: info: determine_online_status_fencing: > > Node clustera is active > > clusterapengine: info: determine_online_status:Node > > clustera is online > > > > *clusterapengine: warning: unpack_rsc_op_failure: Processing > > failed op start for cluster_sid on clustera: unknown error (1)* > > *=> Question : Why pengine always trying to start cluster_sid on the > > passive node? how to fix it? * > > > > pacemaker does not have concept of "passive" or "master" node - it is up > to you to decide when you configure resource placement. By default > pacemaker will attempt to spread resources across all eligible nodes. > You can influence node selection by using constraints. See > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/ > 1.1/html/Pacemaker_Explained/_deciding_which_nodes_a_ > resource_can_run_on.html > for details. > > But in any case - all your resources MUST be capable of running of both > nodes, otherwise cluster makes no sense. If one resource A depends on > something that another resource B provides and can be started only > together with resource B (and after it is ready) - you must tell it to > pacemaker by using resource colocations and ordering. See same document > for details. > > > clusterapengine: info: native_print: ipmi-fence-clustera > > (stonith:fence_ipmilan):Started clustera > > clusterapengine: info: native_print: ipmi-fence-clusterb > > (stonith:fence_ipmilan):Started clustera > > clusterapengine: info: group_print: Resource Group: > cluster > > clusterapengine: info: native_print:cluster_fs > > (ocf::heartbeat:Filesystem):Started clusterb > > clusterapengine: info: native_print:cluster_vip > > (ocf::heartbeat:IPaddr2): Started clusterb > > clusterapengine: info: native_print:cluster_sid > > (ocf::heartbeat:oracle):Started clusterb > > clusterapengine: info: native_print: > > cluster_listnr (ocf::heartbeat:oralsnr): Started clusterb > > clusterapengine: info: get_failcount_full: cluster_sid > has > > failed INFINITY times on clustera > > > > > > *clusterapengine: warning: common_apply_stickiness: > Forcing > > cluster_sid away from clustera after 100 failures (max=100)* > > *=> Question: too much trying result in forbid the resource start on > > clustera ?* > > > > Yes. > How to find out the root cause of 100 failures? which log will contain the error message? > > > Couple days ago, the clusterb has been stonith by unknown reason, but > only > > "cluster_fs", "cluster_vip" moved to clustera successfully, but > > "cluster_sid" and "cluster_listnr" go to "STOP" status. > > like below messages, is it related with "op start for cluster_sid on > > clustera..." ? > > > > Yes. Node clustera is now marked as being incapable of running resource > so if node cluaterb fails, resource cannot be started anywhere. > > How could i fix it? i need some hint for troubleshooting. > > clusterapengine: warning: unpack_rsc_op_failure: Processing failed > op > > start for cluster_sid on clustera: unknown error (1) > > clusterapengine: info: native_print: ipmi-fence-clustera > > (stonith:fence_ipmilan):Started clustera > > clusterapengine: info: native_print: ipmi-fence-clusterb > > (stonith:fence_ipmilan):Started clustera > > clusterapengine: info: group_print: Resource Group: cluster > > clusterapengine: info: native_print:cluster_fs > > (ocf::heartbeat:Filesystem):Started clusterb (UNCLEAN) > > clusterapengine: info: native_print:cluster_vip > > (ocf::heartbeat:IPaddr2): Started clusterb (UNCLEAN) > > clusterapengine: info: native_print:cluster_sid > > (ocf::heartbeat:oracle):Started clusterb (UNCLEAN) > > clusterapengine: info: native_print:cluster_listnr > > (ocf::
[ClusterLabs] pengine always trying to start the resource on the standby node.
Hi All, I have created active/passive pacemaker cluster on RHEL 7. Here are my environment: clustera : 192.168.11.1 (passive) clusterb : 192.168.11.2 (master) clustera-ilo4 : 192.168.11.10 clusterb-ilo4 : 192.168.11.11 cluster resource status : cluster_fsstarted on clusterb cluster_vip started on clusterb cluster_sid started on clusterb cluster_listnrstarted on clusterb Both cluster node are online status. i found my corosync.log contain many records like below: clusterapengine: info: determine_online_status_fencing: Node clusterb is active clusterapengine: info: determine_online_status:Node clusterb is online clusterapengine: info: determine_online_status_fencing: Node clustera is active clusterapengine: info: determine_online_status:Node clustera is online *clusterapengine: warning: unpack_rsc_op_failure: Processing failed op start for cluster_sid on clustera: unknown error (1)* *=> Question : Why pengine always trying to start cluster_sid on the passive node? how to fix it? * clusterapengine: info: native_print: ipmi-fence-clustera (stonith:fence_ipmilan):Started clustera clusterapengine: info: native_print: ipmi-fence-clusterb (stonith:fence_ipmilan):Started clustera clusterapengine: info: group_print: Resource Group: cluster clusterapengine: info: native_print:cluster_fs (ocf::heartbeat:Filesystem):Started clusterb clusterapengine: info: native_print:cluster_vip (ocf::heartbeat:IPaddr2): Started clusterb clusterapengine: info: native_print:cluster_sid (ocf::heartbeat:oracle):Started clusterb clusterapengine: info: native_print: cluster_listnr (ocf::heartbeat:oralsnr): Started clusterb clusterapengine: info: get_failcount_full: cluster_sid has failed INFINITY times on clustera *clusterapengine: warning: common_apply_stickiness:Forcing cluster_sid away from clustera after 100 failures (max=100)* *=> Question: too much trying result in forbid the resource start on clustera ?* Couple days ago, the clusterb has been stonith by unknown reason, but only "cluster_fs", "cluster_vip" moved to clustera successfully, but "cluster_sid" and "cluster_listnr" go to "STOP" status. like below messages, is it related with "op start for cluster_sid on clustera..." ? clusterapengine: warning: unpack_rsc_op_failure: Processing failed op start for cluster_sid on clustera: unknown error (1) clusterapengine: info: native_print: ipmi-fence-clustera (stonith:fence_ipmilan):Started clustera clusterapengine: info: native_print: ipmi-fence-clusterb (stonith:fence_ipmilan):Started clustera clusterapengine: info: group_print: Resource Group: cluster clusterapengine: info: native_print:cluster_fs (ocf::heartbeat:Filesystem):Started clusterb (UNCLEAN) clusterapengine: info: native_print:cluster_vip (ocf::heartbeat:IPaddr2): Started clusterb (UNCLEAN) clusterapengine: info: native_print:cluster_sid (ocf::heartbeat:oracle):Started clusterb (UNCLEAN) clusterapengine: info: native_print:cluster_listnr (ocf::heartbeat:oralsnr): Started clusterb (UNCLEAN) clusterapengine: info: get_failcount_full: cluster_sid has failed INFINITY times on clustera clusterapengine: warning: common_apply_stickiness:Forcing cluster_sid away from clustera after 100 failures (max=100) clusterapengine: info: rsc_merge_weights: cluster_fs: Rolling back scores from cluster_sid clusterapengine: info: rsc_merge_weights: cluster_vip: Rolling back scores from cluster_sid clusterapengine: info: rsc_merge_weights: cluster_sid: Rolling back scores from cluster_listnr clusterapengine: info: native_color: Resource cluster_sid cannot run anywhere clusterapengine: info: native_color: Resource cluster_listnr cannot run anywhere clusterapengine: warning: custom_action: Action cluster_fs_stop_0 on clusterb is unrunnable (offline) clusterapengine: info: RecurringOp: Start recurring monitor (20s) for cluster_fs on clustera clusterapengine: warning: custom_action: Action cluster_vip_stop_0 on clusterb is unrunnable (offline) clusterapengine: info: RecurringOp: Start recurring monitor (10s) for cluster_vip on clustera clusterapengine: warning: custom_action: Action cluster_sid_stop_0 on clusterb is unrunnable (offline) clusterapengine: warning: custom_action: Action cluster_sid_stop_0 on clusterb is unrunnable (offline) clusterapengine: warning: custom_action: Action cluster_listnr_stop_0 on clusterb is unrunnable (offline) clusterapengine: warning: custom_action: Act
Re: [ClusterLabs] Pengine always trying to start the resource on the standby node.
06.06.2018 04:27, Albert Weng пишет: > Hi All, > > I have created active/passive pacemaker cluster on RHEL 7. > > Here are my environment: > clustera : 192.168.11.1 (passive) > clusterb : 192.168.11.2 (master) > clustera-ilo4 : 192.168.11.10 > clusterb-ilo4 : 192.168.11.11 > > cluster resource status : > cluster_fsstarted on clusterb > cluster_vip started on clusterb > cluster_sid started on clusterb > cluster_listnrstarted on clusterb > > Both cluster node are online status. > > i found my corosync.log contain many records like below: > > clusterapengine: info: determine_online_status_fencing: > Node clusterb is active > clusterapengine: info: determine_online_status:Node > clusterb is online > clusterapengine: info: determine_online_status_fencing: > Node clustera is active > clusterapengine: info: determine_online_status:Node > clustera is online > > *clusterapengine: warning: unpack_rsc_op_failure: Processing > failed op start for cluster_sid on clustera: unknown error (1)* > *=> Question : Why pengine always trying to start cluster_sid on the > passive node? how to fix it? * > pacemaker does not have concept of "passive" or "master" node - it is up to you to decide when you configure resource placement. By default pacemaker will attempt to spread resources across all eligible nodes. You can influence node selection by using constraints. See https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_deciding_which_nodes_a_resource_can_run_on.html for details. But in any case - all your resources MUST be capable of running of both nodes, otherwise cluster makes no sense. If one resource A depends on something that another resource B provides and can be started only together with resource B (and after it is ready) - you must tell it to pacemaker by using resource colocations and ordering. See same document for details. > clusterapengine: info: native_print: ipmi-fence-clustera > (stonith:fence_ipmilan):Started clustera > clusterapengine: info: native_print: ipmi-fence-clusterb > (stonith:fence_ipmilan):Started clustera > clusterapengine: info: group_print: Resource Group: cluster > clusterapengine: info: native_print:cluster_fs > (ocf::heartbeat:Filesystem):Started clusterb > clusterapengine: info: native_print:cluster_vip > (ocf::heartbeat:IPaddr2): Started clusterb > clusterapengine: info: native_print:cluster_sid > (ocf::heartbeat:oracle):Started clusterb > clusterapengine: info: native_print: > cluster_listnr (ocf::heartbeat:oralsnr): Started clusterb > clusterapengine: info: get_failcount_full: cluster_sid has > failed INFINITY times on clustera > > > *clusterapengine: warning: common_apply_stickiness:Forcing > cluster_sid away from clustera after 100 failures (max=100)* > *=> Question: too much trying result in forbid the resource start on > clustera ?* > Yes. > Couple days ago, the clusterb has been stonith by unknown reason, but only > "cluster_fs", "cluster_vip" moved to clustera successfully, but > "cluster_sid" and "cluster_listnr" go to "STOP" status. > like below messages, is it related with "op start for cluster_sid on > clustera..." ? > Yes. Node clustera is now marked as being incapable of running resource so if node cluaterb fails, resource cannot be started anywhere. > clusterapengine: warning: unpack_rsc_op_failure: Processing failed op > start for cluster_sid on clustera: unknown error (1) > clusterapengine: info: native_print: ipmi-fence-clustera > (stonith:fence_ipmilan):Started clustera > clusterapengine: info: native_print: ipmi-fence-clusterb > (stonith:fence_ipmilan):Started clustera > clusterapengine: info: group_print: Resource Group: cluster > clusterapengine: info: native_print:cluster_fs > (ocf::heartbeat:Filesystem):Started clusterb (UNCLEAN) > clusterapengine: info: native_print:cluster_vip > (ocf::heartbeat:IPaddr2): Started clusterb (UNCLEAN) > clusterapengine: info: native_print:cluster_sid > (ocf::heartbeat:oracle):Started clusterb (UNCLEAN) > clusterapengine: info: native_print:cluster_listnr > (ocf::heartbeat:oralsnr): Started clusterb (UNCLEAN) > clusterapengine: info: get_failcount_full: cluster_sid has > failed INFINITY times on clustera > clusterapengine: warning: common_apply_stickiness:Forcing > cluster_sid away from clustera after 100 failures (max=100) > clusterapengine: info: rsc_merge_weights: cluster_fs: Rolling > back scores from cluster_sid > clusterapengine: info: rsc_merge_weights: cluster_vip:
[ClusterLabs] Pengine always trying to start the resource on the standby node.
Hi All, I have created active/passive pacemaker cluster on RHEL 7. Here are my environment: clustera : 192.168.11.1 (passive) clusterb : 192.168.11.2 (master) clustera-ilo4 : 192.168.11.10 clusterb-ilo4 : 192.168.11.11 cluster resource status : cluster_fsstarted on clusterb cluster_vip started on clusterb cluster_sid started on clusterb cluster_listnrstarted on clusterb Both cluster node are online status. i found my corosync.log contain many records like below: clusterapengine: info: determine_online_status_fencing: Node clusterb is active clusterapengine: info: determine_online_status:Node clusterb is online clusterapengine: info: determine_online_status_fencing: Node clustera is active clusterapengine: info: determine_online_status:Node clustera is online *clusterapengine: warning: unpack_rsc_op_failure: Processing failed op start for cluster_sid on clustera: unknown error (1)* *=> Question : Why pengine always trying to start cluster_sid on the passive node? how to fix it? * clusterapengine: info: native_print: ipmi-fence-clustera (stonith:fence_ipmilan):Started clustera clusterapengine: info: native_print: ipmi-fence-clusterb (stonith:fence_ipmilan):Started clustera clusterapengine: info: group_print: Resource Group: cluster clusterapengine: info: native_print:cluster_fs (ocf::heartbeat:Filesystem):Started clusterb clusterapengine: info: native_print:cluster_vip (ocf::heartbeat:IPaddr2): Started clusterb clusterapengine: info: native_print:cluster_sid (ocf::heartbeat:oracle):Started clusterb clusterapengine: info: native_print: cluster_listnr (ocf::heartbeat:oralsnr): Started clusterb clusterapengine: info: get_failcount_full: cluster_sid has failed INFINITY times on clustera *clusterapengine: warning: common_apply_stickiness:Forcing cluster_sid away from clustera after 100 failures (max=100)* *=> Question: too much trying result in forbid the resource start on clustera ?* Couple days ago, the clusterb has been stonith by unknown reason, but only "cluster_fs", "cluster_vip" moved to clustera successfully, but "cluster_sid" and "cluster_listnr" go to "STOP" status. like below messages, is it related with "op start for cluster_sid on clustera..." ? clusterapengine: warning: unpack_rsc_op_failure: Processing failed op start for cluster_sid on clustera: unknown error (1) clusterapengine: info: native_print: ipmi-fence-clustera (stonith:fence_ipmilan):Started clustera clusterapengine: info: native_print: ipmi-fence-clusterb (stonith:fence_ipmilan):Started clustera clusterapengine: info: group_print: Resource Group: cluster clusterapengine: info: native_print:cluster_fs (ocf::heartbeat:Filesystem):Started clusterb (UNCLEAN) clusterapengine: info: native_print:cluster_vip (ocf::heartbeat:IPaddr2): Started clusterb (UNCLEAN) clusterapengine: info: native_print:cluster_sid (ocf::heartbeat:oracle):Started clusterb (UNCLEAN) clusterapengine: info: native_print:cluster_listnr (ocf::heartbeat:oralsnr): Started clusterb (UNCLEAN) clusterapengine: info: get_failcount_full: cluster_sid has failed INFINITY times on clustera clusterapengine: warning: common_apply_stickiness:Forcing cluster_sid away from clustera after 100 failures (max=100) clusterapengine: info: rsc_merge_weights: cluster_fs: Rolling back scores from cluster_sid clusterapengine: info: rsc_merge_weights: cluster_vip: Rolling back scores from cluster_sid clusterapengine: info: rsc_merge_weights: cluster_sid: Rolling back scores from cluster_listnr clusterapengine: info: native_color: Resource cluster_sid cannot run anywhere clusterapengine: info: native_color: Resource cluster_listnr cannot run anywhere clusterapengine: warning: custom_action: Action cluster_fs_stop_0 on clusterb is unrunnable (offline) clusterapengine: info: RecurringOp: Start recurring monitor (20s) for cluster_fs on clustera clusterapengine: warning: custom_action: Action cluster_vip_stop_0 on clusterb is unrunnable (offline) clusterapengine: info: RecurringOp: Start recurring monitor (10s) for cluster_vip on clustera clusterapengine: warning: custom_action: Action cluster_sid_stop_0 on clusterb is unrunnable (offline) clusterapengine: warning: custom_action: Action cluster_sid_stop_0 on clusterb is unrunnable (offline) clusterapengine: warning: custom_action: Action cluster_listnr_stop_0 on clusterb is unrunnable (offline) clusterapengine: warning: custom_action: Ac