>>> Philipp Achmüller <philipp.achmuel...@arz.at> schrieb am 16.07.2018 um 11:44 in Nachricht <ofe07b0df6.7da543a3-onc12582cc.0031fedf-c12582cc.00358...@arz.at>: > hi! > > Thank you for comment. > Unfortunatly it is not obvious for me - the "grep fence" is attached in my > original message.
Hi! OK, seems I missed finding the needle in all the hay... Anyway I think the problem is "Cluster node siteb-1 will be fenced: peer is no longer part of the cluster". Looks as if the cluster noticed an unclean shutdown of node siteb-1. The message "Stonith/shutdown of siteb-1 not matched" seems to confirm that. When shutting down two nodes, did you wait until shutdown of the first node succeeded before shutting down the second? Regards, Ulrich > > i searched older logs with activated dubug information for dlm: - this is > the sequence from syslog (from another timeframe): > > ------------------- > Node: siteb-2 (DC): > ............ > 2018-06-28T09:02:23.272415+02:00 siteb-2 crmd[189260]: notice: State > transition S_IDLE -> S_POLICY_ENGINE > 2018-06-28T09:02:23.279028+02:00 siteb-2 pengine[189259]: notice: > Watchdog will be used via SBD if fencing is required > 2018-06-28T09:02:23.279214+02:00 siteb-2 pengine[189259]: notice: On > loss of CCM Quorum: Ignore > 2018-06-28T09:02:23.282153+02:00 siteb-2 pengine[189259]: notice: Move > stonith-sbd#011(Started sitea-1 -> siteb-1) > 2018-06-28T09:02:23.282249+02:00 siteb-2 pengine[189259]: notice: Move > cl-info#011(Started sitea-1 -> siteb-2) > 2018-06-28T09:02:23.282338+02:00 siteb-2 pengine[189259]: notice: Move > k45RG#011(Started sitea-2 -> siteb-1) > 2018-06-28T09:02:23.282422+02:00 siteb-2 pengine[189259]: notice: Stop > dlm:0#011(sitea-1) > 2018-06-28T09:02:23.282505+02:00 siteb-2 pengine[189259]: notice: Stop > clvm:0#011(sitea-1) > 2018-06-28T09:02:23.282588+02:00 siteb-2 pengine[189259]: notice: Stop > vg1:0#011(sitea-1) > 2018-06-28T09:02:23.282670+02:00 siteb-2 pengine[189259]: notice: Stop > dlm:3#011(sitea-2) > 2018-06-28T09:02:23.282752+02:00 siteb-2 pengine[189259]: notice: Stop > clvm:3#011(sitea-2) > 2018-06-28T09:02:23.282833+02:00 siteb-2 pengine[189259]: notice: Stop > vg1:3#011(sitea-2) > 2018-06-28T09:02:23.282916+02:00 siteb-2 pengine[189259]: notice: Stop > sysinfo:0#011(sitea-1) > 2018-06-28T09:02:23.283001+02:00 siteb-2 pengine[189259]: notice: Stop > sysinfo:3#011(sitea-2) > 2018-06-28T09:02:23.283978+02:00 siteb-2 pengine[189259]: notice: > Calculated transition 1056, saving inputs in > /var/lib/pacemaker/pengine/pe-input-2321.bz2 > 2018-06-28T09:02:23.284428+02:00 siteb-2 crmd[189260]: notice: > Processing graph 1056 (ref=pe_calc-dc-1530169343-1339) derived from > /var/lib/pacemaker/pengine/pe-input-2321.bz2 > 2018-06-28T09:02:23.284575+02:00 siteb-2 crmd[189260]: notice: > Initiating stop operation stonith-sbd_stop_0 on sitea-1 > 2018-06-28T09:02:23.284659+02:00 siteb-2 crmd[189260]: notice: > Initiating stop operation cl-info_stop_0 on sitea-1 > 2018-06-28T09:02:23.284742+02:00 siteb-2 crmd[189260]: notice: > Initiating stop operation k45RG_stop_0 on sitea-2 > 2018-06-28T09:02:23.284824+02:00 siteb-2 crmd[189260]: notice: > Initiating stop operation vg1_stop_0 on sitea-1 > 2018-06-28T09:02:23.284908+02:00 siteb-2 crmd[189260]: notice: > Initiating stop operation vg1_stop_0 on sitea-2 > 2018-06-28T09:02:23.284990+02:00 siteb-2 crmd[189260]: notice: > Initiating stop operation sysinfo_stop_0 on sitea-1 > 2018-06-28T09:02:23.285072+02:00 siteb-2 crmd[189260]: notice: > Initiating stop operation sysinfo_stop_0 on sitea-2 > 2018-06-28T09:02:23.288254+02:00 siteb-2 crmd[189260]: notice: > Initiating start operation stonith-sbd_start_0 on siteb-1 > 2018-06-28T09:02:23.298867+02:00 siteb-2 crmd[189260]: notice: > Initiating start operation cl-info_start_0 locally on siteb-2 > 2018-06-28T09:02:23.309272+02:00 siteb-2 lrmd[189257]: notice: executing > - rsc:cl-info action:start call_id:105 > 2018-06-28T09:02:23.384074+02:00 siteb-2 lrmd[189257]: notice: finished > - rsc:cl-info action:start call_id:105 pid:253747 exit-code:0 > exec-time:75ms queue-time:0ms > 2018-06-28T09:02:23.393759+02:00 siteb-2 crmd[189260]: notice: Result of > start operation for cl-info on siteb-2: 0 (ok) > 2018-06-28T09:02:23.395594+02:00 siteb-2 crmd[189260]: notice: > Initiating monitor operation cl-info_monitor_60000 locally on siteb-2 > 2018-06-28T09:02:24.159586+02:00 siteb-2 crmd[189260]: notice: > Initiating stop operation clvm_stop_0 on sitea-2 > 2018-06-28T09:02:24.193317+02:00 siteb-2 kernel: [80844.122213] dlm: > clvmd: dlm_recover 5 > 2018-06-28T09:02:24.193349+02:00 siteb-2 kernel: [80844.122240] dlm: > clvmd: dlm_clear_toss 1 done > 2018-06-28T09:02:24.193351+02:00 siteb-2 kernel: [80844.122251] dlm: > clvmd: remove member 3 > 2018-06-28T09:02:24.193352+02:00 siteb-2 kernel: [80844.122579] dlm: > clvmd: dlm_recover_members 3 nodes > 2018-06-28T09:02:24.993269+02:00 siteb-2 kernel: [80844.920751] dlm: > clvmd: generation 7 slots 3 1:2 2:4 3:1 > 2018-06-28T09:02:24.993283+02:00 siteb-2 kernel: [80844.920755] dlm: > clvmd: dlm_recover_directory > 2018-06-28T09:02:25.189265+02:00 siteb-2 kernel: [80845.118173] dlm: > clvmd: dlm_recover_directory 1 in 0 new > 2018-06-28T09:02:25.604661+02:00 siteb-2 crmd[189260]: notice: > Initiating stop operation clvm_stop_0 on sitea-1 > 2018-06-28T09:02:26.189277+02:00 siteb-2 kernel: [80846.116630] dlm: > clvmd: dlm_recover_directory 0 out 2 messages > 2018-06-28T09:02:26.189291+02:00 siteb-2 kernel: [80846.116635] dlm: > clvmd: dlm_recover_masters > 2018-06-28T09:02:26.189292+02:00 siteb-2 kernel: [80846.116648] dlm: > clvmd: dlm_recover_masters 0 of 2 > 2018-06-28T09:02:26.189293+02:00 siteb-2 kernel: [80846.116649] dlm: > clvmd: dlm_recover_locks 0 out > 2018-06-28T09:02:26.208948+02:00 siteb-2 crmd[189260]: notice: > Initiating stop operation dlm_stop_0 on sitea-2 > 2018-06-28T09:02:26.789266+02:00 siteb-2 kernel: [80846.716532] dlm: > clvmd: dlm_recover_locks 0 in > 2018-06-28T09:02:27.013264+02:00 siteb-2 kernel: [80846.940071] dlm: > clvmd: dlm_recover_done_wait error -4 > 2018-06-28T09:02:27.013276+02:00 siteb-2 kernel: [80846.940076] dlm: > clvmd: dlm_recover 5 error -4 > 2018-06-28T09:02:27.093263+02:00 siteb-2 kernel: [80847.022308] dlm: > clvmd: dlm_recover 7 > 2018-06-28T09:02:27.093273+02:00 siteb-2 kernel: [80847.022372] dlm: > clvmd: prev removed member 3 > 2018-06-28T09:02:27.093274+02:00 siteb-2 kernel: [80847.022375] dlm: > clvmd: remove member 1 > 2018-06-28T09:02:27.493266+02:00 siteb-2 kernel: [80847.420198] dlm: > clvmd: dlm_recover_members 2 nodes > 2018-06-28T09:02:38.414440+02:00 siteb-2 corosync[189245]: [TOTEM ] A > processor failed, forming new configuration. > 2018-06-28T09:02:52.080141+02:00 siteb-2 corosync[189245]: [TOTEM ] A > new membership (192.168.121.55:2012) was formed. Members left: 2 > 2018-06-28T09:02:52.080537+02:00 siteb-2 corosync[189245]: [TOTEM ] > Failed to receive the leave message. failed: 2 > 2018-06-28T09:02:52.083415+02:00 siteb-2 attrd[189258]: notice: Node > siteb-1 state is now lost > 2018-06-28T09:02:52.083533+02:00 siteb-2 attrd[189258]: notice: Removing > all siteb-1 attributes for peer loss > 2018-06-28T09:02:52.083617+02:00 siteb-2 attrd[189258]: notice: Purged 1 > peer with id=2 and/or uname=siteb-1 from the membership cache > 2018-06-28T09:02:52.083700+02:00 siteb-2 cib[189255]: notice: Node > siteb-1 state is now lost > 2018-06-28T09:02:52.083793+02:00 siteb-2 cib[189255]: notice: Purged 1 > peer with id=2 and/or uname=siteb-1 from the membership cache > 2018-06-28T09:02:52.083875+02:00 siteb-2 stonith-ng[189256]: notice: > Node siteb-1 state is now lost > 2018-06-28T09:02:52.083968+02:00 siteb-2 stonith-ng[189256]: notice: > Purged 1 peer with id=2 and/or uname=siteb-1 from the membership cache > 2018-06-28T09:02:52.084054+02:00 siteb-2 crmd[189260]: warning: No reason > to expect node 2 to be down > 2018-06-28T09:02:52.084156+02:00 siteb-2 crmd[189260]: notice: > Stonith/shutdown of siteb-1 not matched > 2018-06-28T09:02:52.084242+02:00 siteb-2 crmd[189260]: notice: > Transition aborted: Node failure > 2018-06-28T09:02:52.084327+02:00 siteb-2 crmd[189260]: notice: Action 17 > (stonith-sbd_start_0) was pending on 2 (offline) > 2018-06-28T09:02:52.084409+02:00 siteb-2 corosync[189245]: [QUORUM] > Members[3]: 1 3 4 > 2018-06-28T09:02:52.084492+02:00 siteb-2 corosync[189245]: [MAIN ] > Completed service synchronization, ready to provide service. > 2018-06-28T09:02:52.084574+02:00 siteb-2 pacemakerd[189252]: notice: > Node siteb-1 state is now lost > 2018-06-28T09:02:52.084667+02:00 siteb-2 crmd[189260]: notice: Node > siteb-1 state is now lost > 2018-06-28T09:02:52.084749+02:00 siteb-2 crmd[189260]: warning: No reason > to expect node 2 to be down > 2018-06-28T09:02:52.084831+02:00 siteb-2 crmd[189260]: notice: > Stonith/shutdown of siteb-1 not matched > 2018-06-28T09:02:52.085203+02:00 siteb-2 kernel: [80872.011754] dlm: > clvmd: dlm_recover_members_wait error -4 > 2018-06-28T09:02:52.085209+02:00 siteb-2 kernel: [80872.011757] dlm: > clvmd: dlm_recover 7 error -4 > 2018-06-28T09:02:52.085210+02:00 siteb-2 kernel: [80872.012486] dlm: > closing connection to node 2 > 2018-06-28T09:02:52.086174+02:00 siteb-2 crmd[189260]: notice: > Transition 1056 (Complete=18, Pending=0, Fired=0, Skipped=4, Incomplete=4, > Source=/var/lib/pacemaker/pengine/pe-input-2321.bz2): Stopped > 2018-06-28T09:02:53.095172+02:00 siteb-2 pengine[189259]: notice: > Watchdog will be used via SBD if fencing is required > 2018-06-28T09:02:53.095348+02:00 siteb-2 pengine[189259]: notice: On > loss of CCM Quorum: Ignore > 2018-06-28T09:02:53.095584+02:00 siteb-2 pengine[189259]: warning: > Cluster node siteb-1 will be fenced: peer is no longer part of the cluster > 2018-06-28T09:02:53.095673+02:00 siteb-2 pengine[189259]: warning: Node > siteb-1 is unclean > 2018-06-28T09:02:53.097943+02:00 siteb-2 pengine[189259]: warning: Action > stonith-sbd_stop_0 on siteb-1 is unrunnable (offline) > 2018-06-28T09:02:53.098028+02:00 siteb-2 pengine[189259]: warning: Action > dlm:2_stop_0 on siteb-1 is unrunnable (offline) > 2018-06-28T09:02:53.098110+02:00 siteb-2 pengine[189259]: warning: Action > dlm:2_stop_0 on siteb-1 is unrunnable (offline) > 2018-06-28T09:02:53.098191+02:00 siteb-2 pengine[189259]: warning: Action > clvm:2_stop_0 on siteb-1 is unrunnable (offline) > 2018-06-28T09:02:53.098273+02:00 siteb-2 pengine[189259]: warning: Action > clvm:2_stop_0 on siteb-1 is unrunnable (offline) > 2018-06-28T09:02:53.098354+02:00 siteb-2 pengine[189259]: warning: Action > vg1:2_stop_0 on siteb-1 is unrunnable (offline) > 2018-06-28T09:02:53.098436+02:00 siteb-2 pengine[189259]: warning: Action > vg1:2_stop_0 on siteb-1 is unrunnable (offline) > 2018-06-28T09:02:53.098521+02:00 siteb-2 pengine[189259]: warning: Action > sysinfo:1_stop_0 on siteb-1 is unrunnable (offline) > 2018-06-28T09:02:53.098603+02:00 siteb-2 pengine[189259]: warning: Action > sysinfo:1_stop_0 on siteb-1 is unrunnable (offline) > 2018-06-28T09:02:53.098683+02:00 siteb-2 pengine[189259]: warning: > Scheduling Node siteb-1 for STONITH > 2018-06-28T09:02:53.098764+02:00 siteb-2 pengine[189259]: notice: * > Fence siteb-1 > 2018-06-28T09:02:53.098846+02:00 siteb-2 pengine[189259]: notice: Move > stonith-sbd#011(Started siteb-1 -> siteb-2) > 2018-06-28T09:02:53.098928+02:00 siteb-2 pengine[189259]: notice: Stop > dlm:0#011(sitea-1) > 2018-06-28T09:02:53.099010+02:00 siteb-2 pengine[189259]: notice: Stop > dlm:2#011(siteb-1) > 2018-06-28T09:02:53.099094+02:00 siteb-2 pengine[189259]: notice: Stop > clvm:2#011(siteb-1) > 2018-06-28T09:02:53.099177+02:00 siteb-2 pengine[189259]: notice: Stop > vg1:2#011(siteb-1) > 2018-06-28T09:02:53.099258+02:00 siteb-2 pengine[189259]: notice: Stop > sysinfo:1#011(siteb-1) > 2018-06-28T09:02:53.099339+02:00 siteb-2 pengine[189259]: notice: Start > k45RG#011(siteb-2) > 2018-06-28T09:02:53.099715+02:00 siteb-2 pengine[189259]: warning: > Calculated transition 1057 (with warnings), saving inputs in > /var/lib/pacemaker/pengine/pe-warn-6.bz2 > 2018-06-28T09:02:53.100282+02:00 siteb-2 crmd[189260]: notice: > Processing graph 1057 (ref=pe_calc-dc-1530169373-1354) derived from > /var/lib/pacemaker/pengine/pe-warn-6.bz2 > 2018-06-28T09:02:53.100407+02:00 siteb-2 crmd[189260]: notice: > Initiating start operation stonith-sbd_start_0 locally on siteb-2 > 2018-06-28T09:02:53.112921+02:00 siteb-2 crmd[189260]: notice: > Requesting fencing (reboot) of node siteb-1 > 2018-06-28T09:02:53.113113+02:00 siteb-2 stonith-ng[189256]: notice: > Client crmd.189260.7d286b46 wants to fence (reboot) 'siteb-1' with device > '(any)' > 2018-06-28T09:02:53.113308+02:00 siteb-2 stonith-ng[189256]: notice: > Merging stonith action reboot for node siteb-1 originating from client > crmd.189260.7b806ca3 with identical request from > stonith-api.171087@sitea-1.9fe08723 (72s) > 2018-06-28T09:03:29.097551+02:00 siteb-2 stonith-ng[189256]: notice: > Operation reboot of siteb-1 by sitea-2 for > stonith-api.171087@sitea-1.9fe08723: OK > 2018-06-28T09:03:29.097669+02:00 siteb-2 stonith-ng[189256]: notice: > Operation reboot of siteb-1 by sitea-2 for crmd.189260@siteb-2.7b806ca3: > OK > 2018-06-28T09:03:29.097756+02:00 siteb-2 crmd[189260]: notice: Peer > siteb-1 was terminated (reboot) by sitea-2 on behalf of > stonith-api.171087: OK > 2018-06-28T09:03:29.097863+02:00 siteb-2 crmd[189260]: notice: > Transition aborted: External Fencing Operation > 2018-06-28T09:03:29.097948+02:00 siteb-2 crmd[189260]: notice: Stonith > operation 2/51:1057:0:1e551e09-fe55-4e9f-9447-a5199f92bf8c: OK (0) > 2018-06-28T09:03:29.098036+02:00 siteb-2 crmd[189260]: notice: Peer > siteb-1 was terminated (reboot) by sitea-2 on behalf of crmd.189260: OK > 2018-06-28T09:03:29.098196+02:00 siteb-2 crmd[189260]: notice: > Transition 1057 (Complete=14, Pending=0, Fired=0, Skipped=2, Incomplete=6, > Source=/var/lib/pacemaker/pengine/pe-warn-6.bz2): Stopped > 2018-06-28T09:03:30.100556+02:00 siteb-2 dlm_controld[205418]: 80917 fence > status 2 receive 0 from 1 walltime 1530169410 local 80917 > 2018-06-28T09:03:30.101210+02:00 siteb-2 kernel: [80910.025422] dlm: > clvmd: dlm_recover 9 > 2018-06-28T09:03:30.101216+02:00 siteb-2 kernel: [80910.025444] dlm: > clvmd: prev removed member 1 > 2018-06-28T09:03:30.101216+02:00 siteb-2 kernel: [80910.025446] dlm: > clvmd: prev removed member 3 > 2018-06-28T09:03:30.101217+02:00 siteb-2 kernel: [80910.025446] dlm: > clvmd: remove member 2 > 2018-06-28T09:03:30.101217+02:00 siteb-2 kernel: [80910.025448] dlm: > clvmd: dlm_recover_members 1 nodes > 2018-06-28T09:03:30.101219+02:00 siteb-2 kernel: [80910.025450] dlm: > clvmd: generation 8 slots 1 2:4 > 2018-06-28T09:03:30.101222+02:00 siteb-2 kernel: [80910.025452] dlm: > clvmd: dlm_recover_directory > 2018-06-28T09:03:30.101223+02:00 siteb-2 kernel: [80910.025452] dlm: > clvmd: dlm_recover_directory 0 in 0 new > 2018-06-28T09:03:30.101223+02:00 siteb-2 kernel: [80910.025454] dlm: > clvmd: dlm_recover_directory 0 out 0 messages > 2018-06-28T09:03:30.101223+02:00 siteb-2 kernel: [80910.025455] dlm: > clvmd: dlm_recover_masters > 2018-06-28T09:03:30.101224+02:00 siteb-2 kernel: [80910.025456] dlm: > clvmd: dlm_recover_masters 2 of 2 > 2018-06-28T09:03:30.101224+02:00 siteb-2 kernel: [80910.025457] dlm: > clvmd: dlm_recover_locks 0 out > 2018-06-28T09:03:30.101226+02:00 siteb-2 kernel: [80910.025458] dlm: > clvmd: dlm_recover_locks 0 in > 2018-06-28T09:03:30.101227+02:00 siteb-2 kernel: [80910.025459] dlm: > clvmd: dlm_recover_rsbs 2 done > 2018-06-28T09:03:30.101227+02:00 siteb-2 kernel: [80910.025470] dlm: > clvmd: dlm_recover 9 generation 8 done: 0 ms > 2018-06-28T09:03:30.103542+02:00 siteb-2 pengine[189259]: notice: > Watchdog will be used via SBD if fencing is required > 2018-06-28T09:03:30.103660+02:00 siteb-2 pengine[189259]: notice: On > loss of CCM Quorum: Ignore > 2018-06-28T09:03:30.106439+02:00 siteb-2 pengine[189259]: notice: Stop > dlm:0#011(sitea-1) > 2018-06-28T09:03:30.106528+02:00 siteb-2 pengine[189259]: notice: Start > k45RG#011(siteb-2) > 2018-06-28T09:03:30.107002+02:00 siteb-2 pengine[189259]: notice: > Calculated transition 1058, saving inputs in > /var/lib/pacemaker/pengine/pe-input-2322.bz2 > 2018-06-28T09:03:30.107292+02:00 siteb-2 crmd[189260]: notice: > Processing graph 1058 (ref=pe_calc-dc-1530169410-1356) derived from > /var/lib/pacemaker/pengine/pe-input-2322.bz2 > 2018-06-28T09:03:30.107384+02:00 siteb-2 crmd[189260]: notice: > Initiating start operation k45RG_start_0 locally on siteb-2 > 2018-06-28T09:03:30.117125+02:00 siteb-2 lrmd[189257]: notice: executing > - rsc:k45RG action:start call_id:108 > 2018-06-28T09:03:30.117427+02:00 siteb-2 crmd[189260]: notice: > Initiating stop operation dlm_stop_0 on sitea-1 > 2018-06-28T09:03:44.279235+02:00 siteb-2 lrmd[189257]: notice: finished > - rsc:k45RG action:start call_id:108 pid:254501 exit-code:0 > exec-time:14162ms queue-time:0ms > 2018-06-28T09:03:44.287804+02:00 siteb-2 crmd[189260]: notice: Result of > start operation for k45RG on siteb-2: 0 (ok) > 2018-06-28T09:03:44.289761+02:00 siteb-2 crmd[189260]: notice: > Initiating monitor operation k45RG_monitor_300000 locally on siteb-2 > 2018-06-28T09:03:45.952776+02:00 siteb-2 crmd[189260]: notice: > Transition 1058 (Complete=8, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-2322.bz2): Complete > 2018-06-28T09:03:45.952976+02:00 siteb-2 crmd[189260]: notice: State > transition S_TRANSITION_ENGINE -> S_IDLE > ------------------- > Node sitea-1: > ..... > 2018-06-28T09:02:23.504692+02:00 sitea-1 LVM(vg1)[170755]: INFO: > Deactivating volume group clustervg > 2018-06-28T09:02:23.937007+02:00 sitea-1 multipathd[3567]: dm-12: remove > map (uevent) > 2018-06-28T09:02:23.937371+02:00 sitea-1 multipathd[3567]: dm-12: devmap > not registered, can't remove > 2018-06-28T09:02:24.193367+02:00 sitea-1 kernel: [59480.678902] dlm: > clvmd: dlm_recover 13 > 2018-06-28T09:02:24.193379+02:00 sitea-1 kernel: [59480.678938] dlm: > clvmd: remove member 3 > 2018-06-28T09:02:24.393411+02:00 sitea-1 kernel: [59480.878296] dlm: > clvmd: dlm_recover_members 3 nodes > 2018-06-28T09:02:24.593412+02:00 sitea-1 kernel: [59481.078227] dlm: > clvmd: generation 7 slots 3 1:2 2:4 3:1 > 2018-06-28T09:02:24.593423+02:00 sitea-1 kernel: [59481.078232] dlm: > clvmd: dlm_recover_directory > 2018-06-28T09:02:24.793410+02:00 sitea-1 kernel: [59481.278263] dlm: > clvmd: dlm_recover_directory 0 in 0 new > 2018-06-28T09:02:25.017465+02:00 sitea-1 multipathd[3567]: dm-12: remove > map (uevent) > 2018-06-28T09:02:25.017639+02:00 sitea-1 multipathd[3567]: dm-13: remove > map (uevent) > 2018-06-28T09:02:25.017724+02:00 sitea-1 multipathd[3567]: dm-13: devmap > not registered, can't remove > 2018-06-28T09:02:25.017801+02:00 sitea-1 multipathd[3567]: dm-13: remove > map (uevent) > 2018-06-28T09:02:25.589352+02:00 sitea-1 kernel: [59482.074179] dlm: > clvmd: dlm_recover_directory 1 out 2 messages > 2018-06-28T09:02:25.589357+02:00 sitea-1 kernel: [59482.074192] dlm: > clvmd: dlm_recover_masters > 2018-06-28T09:02:25.589357+02:00 sitea-1 kernel: [59482.074201] dlm: > clvmd: dlm_recover_masters 1 of 3 > 2018-06-28T09:02:25.591207+02:00 sitea-1 LVM(vg1)[170755]: INFO: 0 > logical volume(s) in volume group "clustervg" now active > 2018-06-28T09:02:25.596699+02:00 sitea-1 LVM(vg1)[170755]: INFO: LVM > Volume clustervg is not available (stopped) > 2018-06-28T09:02:25.600665+02:00 sitea-1 lrmd[9025]: notice: finished - > rsc:vg1 action:stop call_id:80 pid:170755 exit-code:0 exec-time:2315ms > queue-time:0ms > 2018-06-28T09:02:25.600897+02:00 sitea-1 crmd[9031]: notice: Result of > stop operation for vg1 on sitea-1: 0 (ok) > 2018-06-28T09:02:25.604928+02:00 sitea-1 lrmd[9025]: notice: executing - > rsc:clvm action:stop call_id:83 > 2018-06-28T09:02:25.617437+02:00 sitea-1 clvmd(clvm)[171004]: INFO: > Stopping clvm > 2018-06-28T09:02:25.626327+02:00 sitea-1 clvmd(clvm)[171004]: INFO: > Stopping clvmd > 2018-06-28T09:02:25.745414+02:00 sitea-1 kernel: [59482.230710] dlm: > clvmd: dlm_recover_locks 0 out > 2018-06-28T09:02:26.389417+02:00 sitea-1 kernel: [59482.874050] dlm: > clvmd: dlm_recover_locks 0 in > 2018-06-28T09:02:26.389429+02:00 sitea-1 kernel: [59482.874056] dlm: > clvmd: dlm_recover_rsbs 3 done > 2018-06-28T09:02:26.988133+02:00 sitea-1 kernel: [59483.474046] dlm: > clvmd: dlm_recover 13 generation 7 done: 2396 ms > 2018-06-28T09:02:26.988142+02:00 sitea-1 kernel: [59483.474240] dlm: > clvmd: leaving the lockspace group... > 2018-06-28T09:02:26.988142+02:00 sitea-1 kernel: [59483.475630] dlm: > clvmd: group event done 0 0 > 2018-06-28T09:02:26.988143+02:00 sitea-1 kernel: [59483.476068] dlm: > clvmd: release_lockspace final free > 2018-06-28T09:02:27.652248+02:00 sitea-1 clvmd(clvm)[171004]: INFO: > Stopping cmirrord > 2018-06-28T09:02:28.666056+02:00 sitea-1 lrmd[9025]: notice: finished - > rsc:clvm action:stop call_id:83 pid:171004 exit-code:0 exec-time:3061ms > queue-time:0ms > 2018-06-28T09:02:28.666439+02:00 sitea-1 crmd[9031]: notice: Result of > stop operation for clvm on sitea-1: 0 (ok) > 2018-06-28T09:02:38.413748+02:00 sitea-1 corosync[6661]: [TOTEM ] A > processor failed, forming new configuration. > 2018-06-28T09:02:52.079905+02:00 sitea-1 corosync[6661]: [TOTEM ] A new > membership (192.168.121.55:2012) was formed. Members left: 2 > 2018-06-28T09:02:52.080306+02:00 sitea-1 corosync[6661]: [TOTEM ] Failed > to receive the leave message. failed: 2 > 2018-06-28T09:02:52.082619+02:00 sitea-1 cib[9021]: notice: Node siteb-1 > state is now lost > 2018-06-28T09:02:52.082790+02:00 sitea-1 cib[9021]: notice: Purged 1 > peer with id=2 and/or uname=siteb-1 from the membership cache > 2018-06-28T09:02:52.082895+02:00 sitea-1 attrd[9026]: notice: Node > siteb-1 state is now lost > 2018-06-28T09:02:52.082993+02:00 sitea-1 attrd[9026]: notice: Removing > all siteb-1 attributes for peer loss > 2018-06-28T09:02:52.083075+02:00 sitea-1 attrd[9026]: notice: Purged 1 > peer with id=2 and/or uname=siteb-1 from the membership cache > 2018-06-28T09:02:52.083156+02:00 sitea-1 stonith-ng[9022]: notice: Node > siteb-1 state is now lost > 2018-06-28T09:02:52.083253+02:00 sitea-1 stonith-ng[9022]: notice: > Purged 1 peer with id=2 and/or uname=siteb-1 from the membership cache > 2018-06-28T09:02:52.083335+02:00 sitea-1 pacemakerd[8546]: notice: Node > siteb-1 state is now lost > 2018-06-28T09:02:52.083429+02:00 sitea-1 corosync[6661]: [QUORUM] > Members[3]: 1 3 4 > 2018-06-28T09:02:52.083521+02:00 sitea-1 corosync[6661]: [MAIN ] > Completed service synchronization, ready to provide service. > 2018-06-28T09:02:52.083606+02:00 sitea-1 crmd[9031]: notice: Node > siteb-1 state is now lost > 2018-06-28T09:02:52.084290+02:00 sitea-1 dlm_controld[73416]: 59514 fence > request 2 pid 171087 nodedown time 1530169372 fence_all dlm_stonith > 2018-06-28T09:02:52.085446+02:00 sitea-1 kernel: [59508.568940] dlm: > closing connection to node 2 > 2018-06-28T09:02:52.109393+02:00 sitea-1 dlm_stonith: stonith_api_time: > Found 0 entries for 2/(null): 0 in progress, 0 completed > 2018-06-28T09:02:52.110167+02:00 sitea-1 stonith-ng[9022]: notice: > Client stonith-api.171087.d3c59fc2 wants to fence (reboot) '2' with device > '(any)' > 2018-06-28T09:02:52.113257+02:00 sitea-1 stonith-ng[9022]: notice: > Requesting peer fencing (reboot) of siteb-1 > 2018-06-28T09:03:29.096714+02:00 sitea-1 stonith-ng[9022]: notice: > Operation reboot of siteb-1 by sitea-2 for > stonith-api.171087@sitea-1.9fe08723: OK > 2018-06-28T09:03:29.097152+02:00 sitea-1 stonith-api[171087]: > stonith_api_kick: Node 2/(null) kicked: reboot > 2018-06-28T09:03:29.097426+02:00 sitea-1 crmd[9031]: notice: Peer > lnx0361b was terminated (reboot) by sitea-2 on behalf of > stonith-api.171087: OK > 2018-06-28T09:03:30.098657+02:00 sitea-1 dlm_controld[73416]: 59552 fence > result 2 pid 171087 result 0 exit status > 2018-06-28T09:03:30.099730+02:00 sitea-1 dlm_controld[73416]: 59552 fence > status 2 receive 0 from 1 walltime 1530169410 local 59552 > 2018-06-28T09:03:30.118114+02:00 sitea-1 lrmd[9025]: notice: executing - > rsc:dlm action:stop call_id:85 > 2018-06-28T09:03:30.149362+02:00 sitea-1 kernel: [59546.630487] dlm: > closing connection to node 4 > 2018-06-28T09:03:30.149370+02:00 sitea-1 kernel: [59546.630496] dlm: > closing connection to node 3 > 2018-06-28T09:03:30.149371+02:00 sitea-1 kernel: [59546.630501] dlm: > closing connection to node 1 > 2018-06-28T09:03:31.168356+02:00 sitea-1 lrmd[9025]: notice: finished - > rsc:dlm action:stop call_id:85 pid:171235 exit-code:0 exec-time:1050ms > queue-time:0ms > 2018-06-28T09:03:31.168768+02:00 sitea-1 crmd[9031]: notice: Result of > stop operation for dlm on lnx0361a: 0 (ok) > > ------------------- > Node sitea-2: > ... > 2018-06-28T09:02:23.285402+02:00 sitea-2 lrmd[9657]: notice: executing - > rsc:vg1 action:stop call_id:78 > 2018-06-28T09:02:23.286028+02:00 sitea-2 lrmd[9657]: notice: executing - > rsc:sysinfo action:stop call_id:80 > 2018-06-28T09:02:23.293918+02:00 sitea-2 lrmd[9657]: notice: finished - > rsc:sysinfo action:stop call_id:80 pid:228481 exit-code:0 exec-time:8ms > queue-time:0ms > 2018-06-28T09:02:23.294070+02:00 sitea-2 crmd[9660]: notice: Result of > stop operation for sysinfo on sitea-2: 0 (ok) > 2018-06-28T09:02:23.448868+02:00 sitea-2 LVM(vg1)[228480]: INFO: > Deactivating volume group clustervg > 2018-06-28T09:02:23.494608+02:00 sitea-2 multipathd[3459]: dm-12: remove > map (uevent) > 2018-06-28T09:02:23.494774+02:00 sitea-2 multipathd[3459]: dm-12: devmap > not registered, can't remove > 2018-06-28T09:02:24.147057+02:00 sitea-2 LVM(vg1)[228480]: INFO: 0 > logical volume(s) in volume group "clustervg" now active > 2018-06-28T09:02:24.152209+02:00 sitea-2 LVM(vg1)[228480]: INFO: LVM > Volume clustervg is not available (stopped) > 2018-06-28T09:02:24.155824+02:00 sitea-2 lrmd[9657]: notice: finished - > rsc:vg1 action:stop call_id:78 pid:228480 exit-code:0 exec-time:871ms > queue-time:0ms > 2018-06-28T09:02:24.156133+02:00 sitea-2 crmd[9660]: notice: Result of > stop operation for vg1 on sitea-2: 0 (ok) > 2018-06-28T09:02:24.159361+02:00 sitea-2 lrmd[9657]: notice: executing - > rsc:clvm action:stop call_id:81 > 2018-06-28T09:02:24.169269+02:00 sitea-2 clvmd(clvm)[229478]: INFO: > Stopping clvm > 2018-06-28T09:02:24.177998+02:00 sitea-2 clvmd(clvm)[229478]: INFO: > Stopping clvmd > 2018-06-28T09:02:24.187070+02:00 sitea-2 kernel: [59548.133661] dlm: > clvmd: leaving the lockspace group... > 2018-06-28T09:02:24.187078+02:00 sitea-2 kernel: [59548.134502] dlm: > clvmd: group event done 0 0 > 2018-06-28T09:02:24.187078+02:00 sitea-2 kernel: [59548.134863] dlm: > clvmd: release_lockspace final free > 2018-06-28T09:02:24.707189+02:00 sitea-2 multipathd[3459]: dm-12: remove > map (uevent) > 2018-06-28T09:02:24.707584+02:00 sitea-2 multipathd[3459]: dm-13: remove > map (uevent) > 2018-06-28T09:02:24.707672+02:00 sitea-2 multipathd[3459]: dm-13: devmap > not registered, can't remove > 2018-06-28T09:02:24.707754+02:00 sitea-2 multipathd[3459]: dm-13: remove > map (uevent) > 2018-06-28T09:02:25.193751+02:00 sitea-2 clvmd(clvm)[229478]: INFO: > Stopping cmirrord > 2018-06-28T09:02:26.204708+02:00 sitea-2 lrmd[9657]: notice: finished - > rsc:clvm action:stop call_id:81 pid:229478 exit-code:0 exec-time:2046ms > queue-time:0ms > 2018-06-28T09:02:26.204942+02:00 sitea-2 crmd[9660]: notice: Result of > stop operation for clvm on sitea-2: 0 (ok) > 2018-06-28T09:02:26.208588+02:00 sitea-2 lrmd[9657]: notice: executing - > rsc:dlm action:stop call_id:83 > 2018-06-28T09:02:26.236329+02:00 sitea-2 kernel: [59550.182284] dlm: > closing connection to node 4 > 2018-06-28T09:02:26.236336+02:00 sitea-2 kernel: [59550.182293] dlm: > closing connection to node 3 > 2018-06-28T09:02:26.236337+02:00 sitea-2 kernel: [59550.182299] dlm: > closing connection to node 2 > 2018-06-28T09:02:26.236337+02:00 sitea-2 kernel: [59550.182306] dlm: > closing connection to node 1 > 2018-06-28T09:02:27.248204+02:00 sitea-2 lrmd[9657]: notice: finished - > rsc:dlm action:stop call_id:83 pid:230674 exit-code:0 exec-time:1039ms > queue-time:0ms > 2018-06-28T09:02:27.248462+02:00 sitea-2 crmd[9660]: notice: Result of > stop operation for dlm on sitea-2: 0 (ok) > 2018-06-28T09:02:30.454215+02:00 sitea-2 lrmd[9657]: notice: finished - > rsc:k45RG action:stop call_id:76 pid:228479 exit-code:0 exec-time:7169ms > queue-time:0ms > 2018-06-28T09:02:30.454595+02:00 sitea-2 crmd[9660]: notice: Result of > stop operation for k45RG on sitea-2: 0 (ok) > 2018-06-28T09:02:38.412808+02:00 sitea-2 corosync[6570]: [TOTEM ] A > processor failed, forming new configuration. > 2018-06-28T09:02:52.078249+02:00 sitea-2 corosync[6570]: [TOTEM ] A new > membership (192.168.121.55:2012) was formed. Members left: 2 > 2018-06-28T09:02:52.078359+02:00 sitea-2 corosync[6570]: [TOTEM ] Failed > to receive the leave message. failed: 2 > 2018-06-28T09:02:52.081949+02:00 sitea-2 cib[9655]: notice: Node siteb-1 > state is now lost > 2018-06-28T09:02:52.082048+02:00 sitea-2 attrd[9658]: notice: Node > siteb-1 state is now lost > 2018-06-28T09:02:52.082138+02:00 sitea-2 cib[9655]: notice: Purged 1 > peer with id=2 and/or uname=siteb-1 from the membership cache > 2018-06-28T09:02:52.082220+02:00 sitea-2 attrd[9658]: notice: Removing > all siteb-1 attributes for peer loss > 2018-06-28T09:02:52.082300+02:00 sitea-2 stonith-ng[9656]: notice: Node > siteb-1 state is now lost > 2018-06-28T09:02:52.082391+02:00 sitea-2 stonith-ng[9656]: notice: > Purged 1 peer with id=2 and/or uname=siteb-1 from the membership cache > 2018-06-28T09:02:52.082470+02:00 sitea-2 attrd[9658]: notice: Purged 1 > peer with id=2 and/or uname=siteb-1 from the membership cache > 2018-06-28T09:02:52.082653+02:00 sitea-2 corosync[6570]: [QUORUM] > Members[3]: 1 3 4 > 2018-06-28T09:02:52.082739+02:00 sitea-2 corosync[6570]: [MAIN ] > Completed service synchronization, ready to provide service. > 2018-06-28T09:02:52.082819+02:00 sitea-2 pacemakerd[9653]: notice: Node > siteb-1 state is now lost > 2018-06-28T09:02:52.082908+02:00 sitea-2 crmd[9660]: notice: Node > siteb-1 state is now lost > 2018-06-28T09:02:52.495697+02:00 sitea-2 stonith-ng[9656]: notice: > stonith-sbd can fence (reboot) siteb-1: dynamic-list > 2018-06-28T09:02:52.495902+02:00 sitea-2 stonith-ng[9656]: notice: > Delaying reboot on stonith-sbd for 25358ms (timeout=300s) > 2018-06-28T09:03:01.357937+02:00 sitea-2 cron[231205]: > pam_unix(crond:session): session opened for user root by (uid=0) > 2018-06-28T09:03:29.093957+02:00 sitea-2 stonith-ng[9656]: notice: > Operation 'reboot' [231293] (call 2 from stonith-api.171087) for host > 'siteb-1' with device 'stonith-sbd' returned: 0 (OK) > 2018-06-28T09:03:29.096254+02:00 sitea-2 stonith-ng[9656]: notice: > Operation reboot of siteb-1 by sitea-2 for > stonith-api.171087@sitea-1.9fe08723: OK > 2018-06-28T09:03:29.096769+02:00 sitea-2 crmd[9660]: notice: Peer > siteb-1 was terminated (reboot) by sitea-2 on behalf of > stonith-api.171087: OK > ------------------- > Node sideb-1 has no entries during this timeframe > > during standby corosync should be up/running - so may the "Failed to > receive the leave message" will be a Problem? > > regards > Philipp > >> Von: "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de> >> An: <users@clusterlabs.org> >> Datum: 16.07.2018 08:33 >> Betreff: [ClusterLabs] Antw: corosync/dlm fencing? >> Gesendet von: "Users" <users-boun...@clusterlabs.org> >> >> Hi! >> >> I don't run SLES 12, but in SLES 11 some RAs tried to unload kernel > modules >> during stop, and when you had installed kernel updates before, the > operation >> failed, and the cluster paniced (i.e.: started to fence the node). >> In your case >> it should be obvious from the logs why the cluster wants to fence nodes. > Maybe >> "grep -i fence /var/log/messages" (or in pacemaker's log) >> >> Regards, >> Ulrich >> >> >>> Philipp Achmüller <philipp.achmuel...@arz.at> schrieb am 15.07.2018 > um >> 15:39 in >> Nachricht > <of678c848f.04d4334f-onc12582cb.00473374-c12582cb.004af...@arz.at>: >> > hi! >> > >> > i have a 4 node cluster running on SLES12 SP3 >> > ‑ pacemaker‑1.1.16‑4.8.x86_64 >> > ‑ corosync‑2.3.6‑9.5.1.x86_64 >> > >> > following configuration: >> > >> > Stack: corosync >> > Current DC: sitea‑2 (version 1.1.16‑4.8‑77ea74d) ‑ partition with > quorum >> > Last updated: Sun Jul 15 15:00:55 2018 >> > Last change: Sat Jul 14 18:54:50 2018 by root via crm_resource on > sitea‑1 >> > >> > 4 nodes configured >> > 23 resources configured >> > >> > Node sitea‑1: online >> > 1 (ocf::pacemaker:controld): Active >> > 1 (ocf::lvm2:clvmd): Active >> > 1 (ocf::pacemaker:SysInfo): Active >> > 5 (ocf::heartbeat:VirtualDomain): Active >> > 1 (ocf::heartbeat:LVM): Active >> > Node siteb‑1: online >> > 1 (ocf::pacemaker:controld): Active >> > 1 (ocf::lvm2:clvmd): Active >> > 1 (ocf::pacemaker:SysInfo): Active >> > 1 (ocf::heartbeat:VirtualDomain): Active >> > 1 (ocf::heartbeat:LVM): Active >> > Node sitea‑2: online >> > 1 (ocf::pacemaker:controld): Active >> > 1 (ocf::lvm2:clvmd): Active >> > 1 (ocf::pacemaker:SysInfo): Active >> > 3 (ocf::heartbeat:VirtualDomain): Active >> > 1 (ocf::heartbeat:LVM): Active >> > Node siteb‑2: online >> > 1 (ocf::pacemaker:ClusterMon): Active >> > 3 (ocf::heartbeat:VirtualDomain): Active >> > 1 (ocf::pacemaker:SysInfo): Active >> > 1 (stonith:external/sbd): Active >> > 1 (ocf::lvm2:clvmd): Active >> > 1 (ocf::heartbeat:LVM): Active >> > 1 (ocf::pacemaker:controld): Active >> > ‑‑‑‑ >> > and these ordering: >> > ... >> > group base‑group dlm clvm vg1 >> > clone base‑clone base‑group \ >> > meta interleave=true target‑role=Started ordered=true >> > colocation colocation‑VM‑base‑clone‑INFINITY inf: VM base‑clone >> > order order‑base‑clone‑VM‑mandatory base‑clone:start VM:start >> > ... >> > >> > for maintenance i would like to standby 1 or 2 nodes from "sitea" so > that >> > every Resources move off from these 2 images. >> > everything works fine until dlm stops as last resource on these nodes, > >> > then dlm_controld send fence_request ‑ sometimes to the remaining > online >> > nodes, so there is online 1 node left in cluster.... >> > >> > messages: >> > >> > .... >> > 2018‑07‑14T14:38:56.441157+02:00 siteb‑1 dlm_controld[39725]: 678 > fence >> > request 3 pid 54428 startup time 1531571371 fence_all dlm_stonith >> > 2018‑07‑14T14:38:56.445284+02:00 siteb‑1 dlm_stonith: > stonith_api_time: >> > Found 0 entries for 3/(null): 0 in progress, 0 completed >> > 2018‑07‑14T14:38:56.446033+02:00 siteb‑1 stonith‑ng[8085]: notice: > >> > Client stonith‑api.54428.ee6a7e02 wants to fence (reboot) '3' with > device >> > '(any)' >> > 2018‑07‑14T14:38:56.446294+02:00 siteb‑1 stonith‑ng[8085]: notice: > >> > Requesting peer fencing (reboot) of sitea‑2 >> > ... >> > >> > # dlm_tool dump_config >> > daemon_debug=0 >> > foreground=0 >> > log_debug=0 >> > timewarn=0 >> > protocol=detect >> > debug_logfile=0 >> > enable_fscontrol=0 >> > enable_plock=1 >> > plock_debug=0 >> > plock_rate_limit=0 >> > plock_ownership=0 >> > drop_resources_time=10000 >> > drop_resources_count=10 >> > drop_resources_age=10000 >> > post_join_delay=30 >> > enable_fencing=1 >> > enable_concurrent_fencing=0 >> > enable_startup_fencing=0 >> > repeat_failed_fencing=1 >> > enable_quorum_fencing=1 >> > enable_quorum_lockspace=1 >> > help=‑1 >> > version=‑1 >> > >> > how to find out what is happening here? >> >> >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org