On Wed, 2018-06-13 at 17:09 +0800, Albert Weng wrote: > Hi All, > > Thanks for reply. > > Recently, i run the following command : > (clustera) # crm_simulate --xml-file pe-warn.last > > it returns the following results : > error: crm_abort: xpath_search: Triggered assert at xpath.c:153 > : xml_top != NULL > error: crm_element_value: Couldn't find validate-with in NULL
It looks like pe-warn.last somehow got corrupted. It appears to not be a full CIB file. If the original was compressed (.gz/.bz2 extension), and you didn't uncompress it, re-add the extension -- that's how pacemaker knows to uncompress it. > error: crm_abort: crm_element_value: Triggered assert at > xml.c:5135 : data != NULL > Configuration validation is currently disabled. It is highly > encouraged and prevents many common cluster issues. > error: crm_element_value: Couldn't find validate-with in NULL > error: crm_abort: crm_element_value: Triggered assert at > xml.c:5135 : data != NULL > error: crm_element_value: Couldn't find ignore-dtd in NULL > error: crm_abort: crm_element_value: Triggered assert at > xml.c:5135 : data != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: validate_with: Triggered assert at > schemas.c:522 : xml != NULL > error: crm_abort: crm_xml_add: Triggered assert at xml.c:2494 : > node != NULL > error: write_xml_stream: Cannot write NULL to > /var/lib/pacemaker/cib/shadow.20008 > Could not create '/var/lib/pacemaker/cib/shadow.20008': Success > > Could anyone help me how to read those messages and what's going on > my server? > > Thanks a lot.. > > > On Fri, Jun 8, 2018 at 4:49 AM, Ken Gaillot <kgail...@redhat.com> > wrote: > > On Thu, 2018-06-07 at 08:37 +0800, Albert Weng wrote: > > > Hi Andrei, > > > > > > Thanks for your quickly reply. Still need help as below : > > > > > > On Wed, Jun 6, 2018 at 11:58 AM, Andrei Borzenkov <arvidjaar@gmai > > l.co > > > m> wrote: > > > > 06.06.2018 04:27, Albert Weng пишет: > > > > > Hi All, > > > > > > > > > > I have created active/passive pacemaker cluster on RHEL 7. > > > > > > > > > > Here are my environment: > > > > > clustera : 192.168.11.1 (passive) > > > > > clusterb : 192.168.11.2 (master) > > > > > clustera-ilo4 : 192.168.11.10 > > > > > clusterb-ilo4 : 192.168.11.11 > > > > > > > > > > cluster resource status : > > > > > cluster_fs started on clusterb > > > > > cluster_vip started on clusterb > > > > > cluster_sid started on clusterb > > > > > cluster_listnr started on clusterb > > > > > > > > > > Both cluster node are online status. > > > > > > > > > > i found my corosync.log contain many records like below: > > > > > > > > > > clustera pengine: info: > > > > determine_online_status_fencing: > > > > > Node clusterb is active > > > > > clustera pengine: info: determine_online_status: > > > > > > Node > > > > > clusterb is online > > > > > clustera pengine: info: > > > > determine_online_status_fencing: > > > > > Node clustera is active > > > > > clustera pengine: info: determine_online_status: > > > > > > Node > > > > > clustera is online > > > > > > > > > > *clustera pengine: warning: unpack_rsc_op_failure: > > > > Processing > > > > > failed op start for cluster_sid on clustera: unknown error > > (1)* > > > > > *=> Question : Why pengine always trying to start > > cluster_sid on > > > > the > > > > > passive node? how to fix it? * > > > > > > > > > > > > > pacemaker does not have concept of "passive" or "master" node - > > it > > > > is up > > > > to you to decide when you configure resource placement. By > > default > > > > pacemaker will attempt to spread resources across all eligible > > > > nodes. > > > > You can influence node selection by using constraints. See > > > > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/ > > Pace > > > > > > maker_Explained/_deciding_which_nodes_a_resource_can_run_on.html > > > > for details. > > > > > > > > But in any case - all your resources MUST be capable of running > > of > > > > both > > > > nodes, otherwise cluster makes no sense. If one resource A > > depends > > > > on > > > > something that another resource B provides and can be started > > only > > > > together with resource B (and after it is ready) - you must > > tell it > > > > to > > > > pacemaker by using resource colocations and ordering. See same > > > > document > > > > for details. > > > > > > > > > clustera pengine: info: native_print: ipmi- > > fence- > > > > clustera > > > > > (stonith:fence_ipmilan): Started clustera > > > > > clustera pengine: info: native_print: ipmi- > > fence- > > > > clusterb > > > > > (stonith:fence_ipmilan): Started clustera > > > > > clustera pengine: info: group_print: Resource > > > > Group: cluster > > > > > clustera pengine: info: native_print: > > > > cluster_fs > > > > > (ocf::heartbeat:Filesystem): Started clusterb > > > > > clustera pengine: info: native_print: > > > > cluster_vip > > > > > (ocf::heartbeat:IPaddr2): Started clusterb > > > > > clustera pengine: info: native_print: > > > > cluster_sid > > > > > (ocf::heartbeat:oracle): Started clusterb > > > > > clustera pengine: info: native_print: > > > > > cluster_listnr (ocf::heartbeat:oralsnr): Started > > > > clusterb > > > > > clustera pengine: info: get_failcount_full: > > > > cluster_sid has > > > > > failed INFINITY times on clustera > > > > > > > > > > > > > > > *clustera pengine: warning: common_apply_stickiness: > > > > > > Forcing > > > > > cluster_sid away from clustera after 1000000 failures > > > > (max=1000000)* > > > > > *=> Question: too much trying result in forbid the resource > > start > > > > on > > > > > clustera ?* > > > > > > > > > > > > > Yes. > > > > > > How to find out the root cause of 1000000 failures? which log > > will > > > contain the error message? > > > > As an aside, 1,000,000 is "infinity" to pacemaker. It could mean > > 1,000,000 actual failures, or a "fatal" failure that causes > > pacemaker > > to set the fail count to infinity. > > > > The most recent failure of each resource will be shown in the > > status > > display (crm_mon, pcs status, etc.). They will have a basic exit > > code > > (which you can use to distinguish a timeout from an error received > > from > > the agent), and if the agent provided one, an "exit-reason". That's > > the > > first place to look. > > > > Failures will remain in the status display, and affect the > > placement of > > resources, until one of two things happen: you manually clean up > > the > > failure (crm_resource --cleanup, pcs resource cleanup, etc.), or, > > if > > you configured a failure-timeout for the resource, that much time > > has > > passed with no more failures. > > > > For deeper investigation, check the system log (wherever it's kept > > on > > your distro). You can use the timestamp from the failure in the > > status > > to know where to look. > > > > For even more detail, you can look at pacemaker's detail log (the > > one > > you posted excerpts from). This will have additional messages > > beyond > > the system log, but they are harder to follow and more intended for > > developers and advanced troubleshooting. > > > > > > > > > > Couple days ago, the clusterb has been stonith by unknown > > reason, > > > > but only > > > > > "cluster_fs", "cluster_vip" moved to clustera successfully, > > but > > > > > "cluster_sid" and "cluster_listnr" go to "STOP" status. > > > > > like below messages, is it related with "op start for > > cluster_sid > > > > on > > > > > clustera..." ? > > > > > > > > > > > > > Yes. Node clustera is now marked as being incapable of running > > > > resource > > > > so if node cluaterb fails, resource cannot be started anywhere. > > > > > > > > > > > > > > How could i fix it? i need some hint for troubleshooting. > > > > > > > > clustera pengine: warning: unpack_rsc_op_failure: > > Processing > > > > failed op > > > > > start for cluster_sid on clustera: unknown error (1) > > > > > clustera pengine: info: native_print: ipmi-fence- > > > > clustera > > > > > (stonith:fence_ipmilan): Started clustera > > > > > clustera pengine: info: native_print: ipmi-fence- > > > > clusterb > > > > > (stonith:fence_ipmilan): Started clustera > > > > > clustera pengine: info: group_print: Resource > > Group: > > > > cluster > > > > > clustera pengine: info: native_print: > > cluster_fs > > > > > (ocf::heartbeat:Filesystem): Started clusterb (UNCLEAN) > > > > > clustera pengine: info: native_print: > > cluster_vip > > > > > (ocf::heartbeat:IPaddr2): Started clusterb (UNCLEAN) > > > > > clustera pengine: info: native_print: > > cluster_sid > > > > > (ocf::heartbeat:oracle): Started clusterb (UNCLEAN) > > > > > clustera pengine: info: native_print: > > > > cluster_listnr > > > > > (ocf::heartbeat:oralsnr): Started clusterb (UNCLEAN) > > > > > clustera pengine: info: get_failcount_full: > > > > cluster_sid has > > > > > failed INFINITY times on clustera > > > > > clustera pengine: warning: common_apply_stickiness: > > > > > > Forcing > > > > > cluster_sid away from clustera after 1000000 failures > > > > (max=1000000) > > > > > clustera pengine: info: rsc_merge_weights: > > > > cluster_fs: Rolling > > > > > back scores from cluster_sid > > > > > clustera pengine: info: rsc_merge_weights: > > > > cluster_vip: Rolling > > > > > back scores from cluster_sid > > > > > clustera pengine: info: rsc_merge_weights: > > > > cluster_sid: Rolling > > > > > back scores from cluster_listnr > > > > > clustera pengine: info: native_color: Resource > > > > cluster_sid cannot > > > > > run anywhere > > > > > clustera pengine: info: native_color: Resource > > > > cluster_listnr > > > > > cannot run anywhere > > > > > clustera pengine: warning: custom_action: Action > > > > cluster_fs_stop_0 on > > > > > clusterb is unrunnable (offline) > > > > > clustera pengine: info: RecurringOp: Start > > recurring > > > > monitor > > > > > (20s) for cluster_fs on clustera > > > > > clustera pengine: warning: custom_action: Action > > > > cluster_vip_stop_0 on > > > > > clusterb is unrunnable (offline) > > > > > clustera pengine: info: RecurringOp: Start > > recurring > > > > monitor > > > > > (10s) for cluster_vip on clustera > > > > > clustera pengine: warning: custom_action: Action > > > > cluster_sid_stop_0 on > > > > > clusterb is unrunnable (offline) > > > > > clustera pengine: warning: custom_action: Action > > > > cluster_sid_stop_0 on > > > > > clusterb is unrunnable (offline) > > > > > clustera pengine: warning: custom_action: Action > > > > cluster_listnr_stop_0 > > > > > on clusterb is unrunnable (offline) > > > > > clustera pengine: warning: custom_action: Action > > > > cluster_listnr_stop_0 > > > > > on clusterb is unrunnable (offline) > > > > > clustera pengine: warning: stage6: Scheduling Node > > clusterb > > > > for STONITH > > > > > clustera pengine: info: native_stop_constraints: > > > > > cluster_fs_stop_0 is implicit after clusterb is fenced > > > > > clustera pengine: info: native_stop_constraints: > > > > > cluster_vip_stop_0 is implicit after clusterb is fenced > > > > > clustera pengine: info: native_stop_constraints: > > > > > cluster_sid_stop_0 is implicit after clusterb is fenced > > > > > clustera pengine: info: native_stop_constraints: > > > > > cluster_listnr_stop_0 is implicit after clusterb is fenced > > > > > clustera pengine: info: LogActions: Leave ipmi- > > > > fence-db01 > > > > > (Started clustera) > > > > > clustera pengine: info: LogActions: Leave ipmi- > > > > fence-db02 > > > > > (Started clustera) > > > > > clustera pengine: notice: LogActions: Move > > cluster_fs > > > > > (Started clusterb -> clustera) > > > > > clustera pengine: notice: LogActions: Move > > > > cluster_vip > > > > > (Started clusterb -> clustera) > > > > > clustera pengine: notice: LogActions: Stop > > > > cluster_sid > > > > > (clusterb) > > > > > clustera pengine: notice: LogActions: Stop > > > > cluster_listnr > > > > > (clusterb) > > > > > clustera pengine: warning: process_pe_message: > > Calculated > > > > > Transition 26821: /var/lib/pacemaker/pengine/pe-warn-7.bz2 > > > > > clustera crmd: info: do_state_transition: State > > > > transition > > > > > S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS > > > > > cause=C_IPC_MESSAGE origin=handle_response ] > > > > > clustera crmd: info: do_te_invoke: Processing > > graph > > > > 26821 > > > > > (ref=pe_calc-dc-1526868653-26882) derived from > > > > > /var/lib/pacemaker/pengine/pe-warn-7.bz2 > > > > > clustera crmd: notice: te_fence_node: Executing > > reboot > > > > fencing > > > > > operation (23) on clusterb (timeout=60000) > > > > > > > > > > > > > > > Thanks ~~~~ > > > > Ken Gaillot <kgail...@redhat.com> > > _______________________________________________ > > Users mailing list: Users@clusterlabs.org > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc > > h.pdf > > Bugs: http://bugs.clusterlabs.org > > > > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > pdf > Bugs: http://bugs.clusterlabs.org -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org