On Thu, 2018-06-07 at 08:37 +0800, Albert Weng wrote: > Hi Andrei, > > Thanks for your quickly reply. Still need help as below : > > On Wed, Jun 6, 2018 at 11:58 AM, Andrei Borzenkov <arvidj...@gmail.co > m> wrote: > > 06.06.2018 04:27, Albert Weng пишет: > > > Hi All, > > > > > > I have created active/passive pacemaker cluster on RHEL 7. > > > > > > Here are my environment: > > > clustera : 192.168.11.1 (passive) > > > clusterb : 192.168.11.2 (master) > > > clustera-ilo4 : 192.168.11.10 > > > clusterb-ilo4 : 192.168.11.11 > > > > > > cluster resource status : > > > cluster_fs started on clusterb > > > cluster_vip started on clusterb > > > cluster_sid started on clusterb > > > cluster_listnr started on clusterb > > > > > > Both cluster node are online status. > > > > > > i found my corosync.log contain many records like below: > > > > > > clustera pengine: info: > > determine_online_status_fencing: > > > Node clusterb is active > > > clustera pengine: info: determine_online_status: > > Node > > > clusterb is online > > > clustera pengine: info: > > determine_online_status_fencing: > > > Node clustera is active > > > clustera pengine: info: determine_online_status: > > Node > > > clustera is online > > > > > > *clustera pengine: warning: unpack_rsc_op_failure: > > Processing > > > failed op start for cluster_sid on clustera: unknown error (1)* > > > *=> Question : Why pengine always trying to start cluster_sid on > > the > > > passive node? how to fix it? * > > > > > > > pacemaker does not have concept of "passive" or "master" node - it > > is up > > to you to decide when you configure resource placement. By default > > pacemaker will attempt to spread resources across all eligible > > nodes. > > You can influence node selection by using constraints. See > > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pace > > maker_Explained/_deciding_which_nodes_a_resource_can_run_on.html > > for details. > > > > But in any case - all your resources MUST be capable of running of > > both > > nodes, otherwise cluster makes no sense. If one resource A depends > > on > > something that another resource B provides and can be started only > > together with resource B (and after it is ready) - you must tell it > > to > > pacemaker by using resource colocations and ordering. See same > > document > > for details. > > > > > clustera pengine: info: native_print: ipmi-fence- > > clustera > > > (stonith:fence_ipmilan): Started clustera > > > clustera pengine: info: native_print: ipmi-fence- > > clusterb > > > (stonith:fence_ipmilan): Started clustera > > > clustera pengine: info: group_print: Resource > > Group: cluster > > > clustera pengine: info: native_print: > > cluster_fs > > > (ocf::heartbeat:Filesystem): Started clusterb > > > clustera pengine: info: native_print: > > cluster_vip > > > (ocf::heartbeat:IPaddr2): Started clusterb > > > clustera pengine: info: native_print: > > cluster_sid > > > (ocf::heartbeat:oracle): Started clusterb > > > clustera pengine: info: native_print: > > > cluster_listnr (ocf::heartbeat:oralsnr): Started > > clusterb > > > clustera pengine: info: get_failcount_full: > > cluster_sid has > > > failed INFINITY times on clustera > > > > > > > > > *clustera pengine: warning: common_apply_stickiness: > > Forcing > > > cluster_sid away from clustera after 1000000 failures > > (max=1000000)* > > > *=> Question: too much trying result in forbid the resource start > > on > > > clustera ?* > > > > > > > Yes. > > How to find out the root cause of 1000000 failures? which log will > contain the error message?
As an aside, 1,000,000 is "infinity" to pacemaker. It could mean 1,000,000 actual failures, or a "fatal" failure that causes pacemaker to set the fail count to infinity. The most recent failure of each resource will be shown in the status display (crm_mon, pcs status, etc.). They will have a basic exit code (which you can use to distinguish a timeout from an error received from the agent), and if the agent provided one, an "exit-reason". That's the first place to look. Failures will remain in the status display, and affect the placement of resources, until one of two things happen: you manually clean up the failure (crm_resource --cleanup, pcs resource cleanup, etc.), or, if you configured a failure-timeout for the resource, that much time has passed with no more failures. For deeper investigation, check the system log (wherever it's kept on your distro). You can use the timestamp from the failure in the status to know where to look. For even more detail, you can look at pacemaker's detail log (the one you posted excerpts from). This will have additional messages beyond the system log, but they are harder to follow and more intended for developers and advanced troubleshooting. > > > > Couple days ago, the clusterb has been stonith by unknown reason, > > but only > > > "cluster_fs", "cluster_vip" moved to clustera successfully, but > > > "cluster_sid" and "cluster_listnr" go to "STOP" status. > > > like below messages, is it related with "op start for cluster_sid > > on > > > clustera..." ? > > > > > > > Yes. Node clustera is now marked as being incapable of running > > resource > > so if node cluaterb fails, resource cannot be started anywhere. > > > > > > How could i fix it? i need some hint for troubleshooting. > > > > clustera pengine: warning: unpack_rsc_op_failure: Processing > > failed op > > > start for cluster_sid on clustera: unknown error (1) > > > clustera pengine: info: native_print: ipmi-fence- > > clustera > > > (stonith:fence_ipmilan): Started clustera > > > clustera pengine: info: native_print: ipmi-fence- > > clusterb > > > (stonith:fence_ipmilan): Started clustera > > > clustera pengine: info: group_print: Resource Group: > > cluster > > > clustera pengine: info: native_print: cluster_fs > > > (ocf::heartbeat:Filesystem): Started clusterb (UNCLEAN) > > > clustera pengine: info: native_print: cluster_vip > > > (ocf::heartbeat:IPaddr2): Started clusterb (UNCLEAN) > > > clustera pengine: info: native_print: cluster_sid > > > (ocf::heartbeat:oracle): Started clusterb (UNCLEAN) > > > clustera pengine: info: native_print: > > cluster_listnr > > > (ocf::heartbeat:oralsnr): Started clusterb (UNCLEAN) > > > clustera pengine: info: get_failcount_full: > > cluster_sid has > > > failed INFINITY times on clustera > > > clustera pengine: warning: common_apply_stickiness: > > Forcing > > > cluster_sid away from clustera after 1000000 failures > > (max=1000000) > > > clustera pengine: info: rsc_merge_weights: > > cluster_fs: Rolling > > > back scores from cluster_sid > > > clustera pengine: info: rsc_merge_weights: > > cluster_vip: Rolling > > > back scores from cluster_sid > > > clustera pengine: info: rsc_merge_weights: > > cluster_sid: Rolling > > > back scores from cluster_listnr > > > clustera pengine: info: native_color: Resource > > cluster_sid cannot > > > run anywhere > > > clustera pengine: info: native_color: Resource > > cluster_listnr > > > cannot run anywhere > > > clustera pengine: warning: custom_action: Action > > cluster_fs_stop_0 on > > > clusterb is unrunnable (offline) > > > clustera pengine: info: RecurringOp: Start recurring > > monitor > > > (20s) for cluster_fs on clustera > > > clustera pengine: warning: custom_action: Action > > cluster_vip_stop_0 on > > > clusterb is unrunnable (offline) > > > clustera pengine: info: RecurringOp: Start recurring > > monitor > > > (10s) for cluster_vip on clustera > > > clustera pengine: warning: custom_action: Action > > cluster_sid_stop_0 on > > > clusterb is unrunnable (offline) > > > clustera pengine: warning: custom_action: Action > > cluster_sid_stop_0 on > > > clusterb is unrunnable (offline) > > > clustera pengine: warning: custom_action: Action > > cluster_listnr_stop_0 > > > on clusterb is unrunnable (offline) > > > clustera pengine: warning: custom_action: Action > > cluster_listnr_stop_0 > > > on clusterb is unrunnable (offline) > > > clustera pengine: warning: stage6: Scheduling Node clusterb > > for STONITH > > > clustera pengine: info: native_stop_constraints: > > > cluster_fs_stop_0 is implicit after clusterb is fenced > > > clustera pengine: info: native_stop_constraints: > > > cluster_vip_stop_0 is implicit after clusterb is fenced > > > clustera pengine: info: native_stop_constraints: > > > cluster_sid_stop_0 is implicit after clusterb is fenced > > > clustera pengine: info: native_stop_constraints: > > > cluster_listnr_stop_0 is implicit after clusterb is fenced > > > clustera pengine: info: LogActions: Leave ipmi- > > fence-db01 > > > (Started clustera) > > > clustera pengine: info: LogActions: Leave ipmi- > > fence-db02 > > > (Started clustera) > > > clustera pengine: notice: LogActions: Move cluster_fs > > > (Started clusterb -> clustera) > > > clustera pengine: notice: LogActions: Move > > cluster_vip > > > (Started clusterb -> clustera) > > > clustera pengine: notice: LogActions: Stop > > cluster_sid > > > (clusterb) > > > clustera pengine: notice: LogActions: Stop > > cluster_listnr > > > (clusterb) > > > clustera pengine: warning: process_pe_message: Calculated > > > Transition 26821: /var/lib/pacemaker/pengine/pe-warn-7.bz2 > > > clustera crmd: info: do_state_transition: State > > transition > > > S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS > > > cause=C_IPC_MESSAGE origin=handle_response ] > > > clustera crmd: info: do_te_invoke: Processing graph > > 26821 > > > (ref=pe_calc-dc-1526868653-26882) derived from > > > /var/lib/pacemaker/pengine/pe-warn-7.bz2 > > > clustera crmd: notice: te_fence_node: Executing reboot > > fencing > > > operation (23) on clusterb (timeout=60000) > > > > > > > > > Thanks ~~~~ Ken Gaillot <kgail...@redhat.com> _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org