[ClusterLabs] FYI: regression using 2.0.0 / 1.1.19 Pacemaker Remote node with older cluster nodes

2018-07-16 Thread Ken Gaillot
Hi all,

The just-released Pacemaker 2.0.0 and 1.1.19 releases have an issue
when a Pacemaker Remote node is upgraded before the cluster nodes.

Pacemaker 2.0.0 contains a fix (also backported to 1.1.19) for the
longstanding issue of "crm_node -n" getting the wrong name when run on
the command line of a Pacemaker Remote node whose node name is
different from its local hostname.

However, the fix can cause resource agents running on a Pacemaker
Remote node to hang when used with a cluster node older than 2.0.0 /
1.1.19.

The only workaround is to upgrade all cluster nodes before upgrading
any Pacemaker Remote nodes (which is the recommended practice anyway).
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Antwort: Antw: Antw: Antwort: Antw: corosync/dlm fencing?

2018-07-16 Thread Ulrich Windl
>>> Philipp Achmüller  schrieb am 16.07.2018 um
14:09 in
Nachricht :
> Hi,
> 
>> Von: "Ulrich Windl" 
>> An: 
>> Datum: 16.07.2018 13:46
>> Betreff: [ClusterLabs] Antw: Antw: Antwort: Antw: corosync/dlm fencing?
>> Gesendet von: "Users" 
>> 
>> Hi again!
>> 
>> Oh, I missed "...maintenance i would like to standby 1 or 2 nodes from
>> "sitea""...
>> 
> 
> Yes - sorry for tons of logs - but i think this will catch the whole 
> situation... 
> 
>> I think some time ago I had asked about the same thing for SLES11, and 
> the
>> answer was that with some configurations a standby is not possible (I 
> thought
>> it was related to OCFS2, but mybe it was cLVM or DLM). Despite of quorum
>> issues, why not shutdown the node completely?
> 
> i also tried "systemctl stop pacemaker" or "normal" system shutdown -> 
> same outcome, there is always dlm fencing some other nodes...

Hmm... I looked up my last node shutdown on the three-node cluster. Essential
messages are like this (many messages left out, sometimes indicated by
"(more)"):

attrd[13169]:   notice: attrd_trigger_update: Sending flush op to all hosts
for: shutdown (0)
pengine[13170]:   notice: stage6: Scheduling Node h10 for shutdown
pengine[13170]:   notice: LogActions: Stopprm_DLM:1(h10)
pengine[13170]:   notice: LogActions: Stopprm_cLVMd:1  (h10)
pengine[13170]:   notice: LogActions: Stopprm_LVM_CFS_VMs:1(h10)
(more)
crmd[13171]:   notice: te_rsc_command: Initiating action 97: stop
prm_CFS_VMs_fs_stop_0 on h10
cluster-dlm[13901]: log_config: dlm:ls:490B9FCAFA3D4B2F9A586A5893E00730 conf 2
0 1 memb 739512321 739512325 join left 739512330
cluster-dlm[13901]: add_change: 490B9FCAFA3D4B2F9A586A5893E00730 add_change cg
8 remove nodeid 739512330 reason 2
(more)
cluster-dlm[13901]: receive_plocks_stored: 490B9FCAFA3D4B2F9A586A5893E00730
receive_plocks_stored 739512321:8 flags a sig 0 need_plocks 0
ocfs2_controld[14389]: confchg called
ocfs2_controld[14389]: node daemon left 739512330
cluster-dlm[13901]: log_config: dlm:ls:clvmd conf 2 0 1 memb 739512321
739512325 join left 739512330
cluster-dlm[13901]: stop_kernel: clvmd stop_kernel cg 8
cluster-dlm[13901]: log_config: dlm:controld conf 2 0 1 memb 739512321
739512325 join left 739512330
 crmd[13171]:   notice: run_graph: Transition 10 (Complete=41, Pending=0,
Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-26.bz2): Complete
cib[13166]:   notice: crm_update_peer_state: cib_peer_update_callback: Node
h10[739512330] - state is now lost (was member)
corosync[13162]:  [CLM   ] Members Left:
corosync[13162]:  [CLM   ] r(0) ip(172.20.16.10) r(1) ip(10.2.2.10)
stonith-ng[13167]:   notice: crm_update_peer_state: st_peer_update_callback:
Node h10[739512330] - state is now lost (was member)
cluster-dlm[13901]: dlm_process_node: Removed inactive node 739512330:
born-on=3804, last-seen=3804, this-event=3808, last-event=3804
kernel: [349683.507286] dlm: closing connection to node 739512330
crmd[13171]:   notice: peer_update_callback: do_shutdown of h10 (op 190) is
complete
corosync[13162]:  [MAIN  ] Completed service synchronization, ready to provide
service.
crmd[13171]:   notice: do_state_transition: State transition
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL
origin=notify_crmd ]

Another random thought: How do you collect your logs? The node being fenced
may not have the last few log messages written to disk (SLES11 syslog at
least). Sometimes it's a good idea to "tail -f" the syslogs on each node, so if
the node goes down, you see what is expected to be in the syslog at least.
Chances may be better if you have a remote syslog server via DGRAM sockets...

Regards,
Ulrich

> 
>> 
>> Regards,
>> Ulrich
>> 
>> >>> "Ulrich Windl"  schrieb am 
> 16.07.2018
>> um
>> 13:35 in Nachricht <5b4c82f002a10002c...@gwsmtp1.uni-regensburg.de>:
>>  Philipp Achmüller  schrieb am 16.07.2018 
> um
>> > 11:44 in
>> > Nachricht
>> :
>> >> hi!
>> >> 
>> >> Thank you for comment.
>> >> Unfortunatly it is not obvious for me - the "grep fence" is attached 
> in my
>> 
>> >> original message.
>> > 
>> > Hi!
>> > 
>> > OK, seems I missed finding the needle in all the hay...
>> > Anyway I think the problem is "Cluster node siteb-1 will be fenced: 
> peer is
>> 
>> > no
>> > longer part of the cluster". Looks as if the cluster noticed an 
> unclean
>> > shutdown of node siteb-1. The message "Stonith/shutdown of siteb-1 not 
> 
>> > matched"
>> > seems to confirm that.
>> > When shutting down two nodes, did you wait until shutdown of the first 
> node
>> > succeeded before shutting down the second?
>> > 
>> > Regards,
>> > Ulrich
>> > 
>> >> 
>> >> i searched older logs with activated dubug information for dlm: - 
> this is 
>> >> the sequence from syslog (from another timeframe):
>> >> 
>> >> ---
>> >> Node: siteb-2 (DC):
>> >> 
>> >> 2018-06-28T09:02:23.272415+02:00 siteb-2 crmd[189260]:   notice: 
> State 
>> >> transition S

[ClusterLabs] Antwort: Antw: Antw: Antwort: Antw: corosync/dlm fencing?

2018-07-16 Thread Philipp Achmüller
Hi,

> Von: "Ulrich Windl" 
> An: 
> Datum: 16.07.2018 13:46
> Betreff: [ClusterLabs] Antw: Antw: Antwort: Antw: corosync/dlm fencing?
> Gesendet von: "Users" 
> 
> Hi again!
> 
> Oh, I missed "...maintenance i would like to standby 1 or 2 nodes from
> "sitea""...
> 

Yes - sorry for tons of logs - but i think this will catch the whole 
situation... 

> I think some time ago I had asked about the same thing for SLES11, and 
the
> answer was that with some configurations a standby is not possible (I 
thought
> it was related to OCFS2, but mybe it was cLVM or DLM). Despite of quorum
> issues, why not shutdown the node completely?

i also tried "systemctl stop pacemaker" or "normal" system shutdown -> 
same outcome, there is always dlm fencing some other nodes...

> 
> Regards,
> Ulrich
> 
> >>> "Ulrich Windl"  schrieb am 
16.07.2018
> um
> 13:35 in Nachricht <5b4c82f002a10002c...@gwsmtp1.uni-regensburg.de>:
>  Philipp Achmüller  schrieb am 16.07.2018 
um
> > 11:44 in
> > Nachricht
> :
> >> hi!
> >> 
> >> Thank you for comment.
> >> Unfortunatly it is not obvious for me - the "grep fence" is attached 
in my
> 
> >> original message.
> > 
> > Hi!
> > 
> > OK, seems I missed finding the needle in all the hay...
> > Anyway I think the problem is "Cluster node siteb-1 will be fenced: 
peer is
> 
> > no
> > longer part of the cluster". Looks as if the cluster noticed an 
unclean
> > shutdown of node siteb-1. The message "Stonith/shutdown of siteb-1 not 

> > matched"
> > seems to confirm that.
> > When shutting down two nodes, did you wait until shutdown of the first 
node
> > succeeded before shutting down the second?
> > 
> > Regards,
> > Ulrich
> > 
> >> 
> >> i searched older logs with activated dubug information for dlm: - 
this is 
> >> the sequence from syslog (from another timeframe):
> >> 
> >> ---
> >> Node: siteb-2 (DC):
> >> 
> >> 2018-06-28T09:02:23.272415+02:00 siteb-2 crmd[189260]:   notice: 
State 
> >> transition S_IDLE -> S_POLICY_ENGINE
> >> 2018-06-28T09:02:23.279028+02:00 siteb-2 pengine[189259]:   notice: 
> >> Watchdog will be used via SBD if fencing is required
> >> 2018-06-28T09:02:23.279214+02:00 siteb-2 pengine[189259]:   notice: 
On 
> >> loss of CCM Quorum: Ignore
> >> 2018-06-28T09:02:23.282153+02:00 siteb-2 pengine[189259]:   notice: 
Move 
> >> stonith-sbd#011(Started sitea-1 -> siteb-1)
> >> 2018-06-28T09:02:23.282249+02:00 siteb-2 pengine[189259]:   notice: 
Move 
> >> cl-info#011(Started sitea-1 -> siteb-2)
> >> 2018-06-28T09:02:23.282338+02:00 siteb-2 pengine[189259]:   notice: 
Move 
> >> k45RG#011(Started sitea-2 -> siteb-1)
> >> 2018-06-28T09:02:23.282422+02:00 siteb-2 pengine[189259]:   notice: 
Stop 
> >> dlm:0#011(sitea-1)
> >> 2018-06-28T09:02:23.282505+02:00 siteb-2 pengine[189259]:   notice: 
Stop 
> >> clvm:0#011(sitea-1)
> >> 2018-06-28T09:02:23.282588+02:00 siteb-2 pengine[189259]:   notice: 
Stop 
> >> vg1:0#011(sitea-1)
> >> 2018-06-28T09:02:23.282670+02:00 siteb-2 pengine[189259]:   notice: 
Stop 
> >> dlm:3#011(sitea-2)
> >> 2018-06-28T09:02:23.282752+02:00 siteb-2 pengine[189259]:   notice: 
Stop 
> >> clvm:3#011(sitea-2)
> >> 2018-06-28T09:02:23.282833+02:00 siteb-2 pengine[189259]:   notice: 
Stop 
> >> vg1:3#011(sitea-2)
> >> 2018-06-28T09:02:23.282916+02:00 siteb-2 pengine[189259]:   notice: 
Stop 
> >> sysinfo:0#011(sitea-1)
> >> 2018-06-28T09:02:23.283001+02:00 siteb-2 pengine[189259]:   notice: 
Stop 
> >> sysinfo:3#011(sitea-2)
> >> 2018-06-28T09:02:23.283978+02:00 siteb-2 pengine[189259]:   notice: 
> >> Calculated transition 1056, saving inputs in 
> >> /var/lib/pacemaker/pengine/pe-input-2321.bz2
> >> 2018-06-28T09:02:23.284428+02:00 siteb-2 crmd[189260]:   notice: 
> >> Processing graph 1056 (ref=pe_calc-dc-1530169343-1339) derived from 
> >> /var/lib/pacemaker/pengine/pe-input-2321.bz2
> >> 2018-06-28T09:02:23.284575+02:00 siteb-2 crmd[189260]:   notice: 
> >> Initiating stop operation stonith-sbd_stop_0 on sitea-1
> >> 2018-06-28T09:02:23.284659+02:00 siteb-2 crmd[189260]:   notice: 
> >> Initiating stop operation cl-info_stop_0 on sitea-1
> >> 2018-06-28T09:02:23.284742+02:00 siteb-2 crmd[189260]:   notice: 
> >> Initiating stop operation k45RG_stop_0 on sitea-2
> >> 2018-06-28T09:02:23.284824+02:00 siteb-2 crmd[189260]:   notice: 
> >> Initiating stop operation vg1_stop_0 on sitea-1
> >> 2018-06-28T09:02:23.284908+02:00 siteb-2 crmd[189260]:   notice: 
> >> Initiating stop operation vg1_stop_0 on sitea-2
> >> 2018-06-28T09:02:23.284990+02:00 siteb-2 crmd[189260]:   notice: 
> >> Initiating stop operation sysinfo_stop_0 on sitea-1
> >> 2018-06-28T09:02:23.285072+02:00 siteb-2 crmd[189260]:   notice: 
> >> Initiating stop operation sysinfo_stop_0 on sitea-2
> >> 2018-06-28T09:02:23.288254+02:00 siteb-2 crmd[189260]:   notice: 
> >> Initiating start operation stonith-sbd_start_0 on siteb-1
> >> 2018-06-28T09:02:23.298867+02:00 siteb-2 crmd[189260]:   notice: 
> >> Initiating start operation cl-info_start_0 locally

[ClusterLabs] Antw: Antw: Antwort: Antw: corosync/dlm fencing?

2018-07-16 Thread Ulrich Windl
Hi again!

Oh, I missed "...maintenance i would like to standby 1 or 2 nodes from
"sitea""...

I think some time ago I had asked about the same thing for SLES11, and the
answer was that with some configurations a standby is not possible (I thought
it was related to OCFS2, but mybe it was cLVM or DLM). Despite of quorum
issues, why not shutdown the node completely?

Regards,
Ulrich

>>> "Ulrich Windl"  schrieb am 16.07.2018
um
13:35 in Nachricht <5b4c82f002a10002c...@gwsmtp1.uni-regensburg.de>:
 Philipp Achmüller  schrieb am 16.07.2018 um
> 11:44 in
> Nachricht
:
>> hi!
>> 
>> Thank you for comment.
>> Unfortunatly it is not obvious for me - the "grep fence" is attached in my

>> original message.
> 
> Hi!
> 
> OK, seems I missed finding the needle in all the hay...
> Anyway I think the problem is "Cluster node siteb-1 will be fenced: peer is

> no
> longer part of the cluster". Looks as if the cluster noticed an unclean
> shutdown of node siteb-1. The message "Stonith/shutdown of siteb-1 not 
> matched"
> seems to confirm that.
> When shutting down two nodes, did you wait until shutdown of the first node
> succeeded before shutting down the second?
> 
> Regards,
> Ulrich
> 
>> 
>> i searched older logs with activated dubug information for dlm: - this is 
>> the sequence from syslog (from another timeframe):
>> 
>> ---
>> Node: siteb-2 (DC):
>> 
>> 2018-06-28T09:02:23.272415+02:00 siteb-2 crmd[189260]:   notice: State 
>> transition S_IDLE -> S_POLICY_ENGINE
>> 2018-06-28T09:02:23.279028+02:00 siteb-2 pengine[189259]:   notice: 
>> Watchdog will be used via SBD if fencing is required
>> 2018-06-28T09:02:23.279214+02:00 siteb-2 pengine[189259]:   notice: On 
>> loss of CCM Quorum: Ignore
>> 2018-06-28T09:02:23.282153+02:00 siteb-2 pengine[189259]:   notice: Move  
>> stonith-sbd#011(Started sitea-1 -> siteb-1)
>> 2018-06-28T09:02:23.282249+02:00 siteb-2 pengine[189259]:   notice: Move  
>> cl-info#011(Started sitea-1 -> siteb-2)
>> 2018-06-28T09:02:23.282338+02:00 siteb-2 pengine[189259]:   notice: Move  
>> k45RG#011(Started sitea-2 -> siteb-1)
>> 2018-06-28T09:02:23.282422+02:00 siteb-2 pengine[189259]:   notice: Stop  
>> dlm:0#011(sitea-1)
>> 2018-06-28T09:02:23.282505+02:00 siteb-2 pengine[189259]:   notice: Stop  
>> clvm:0#011(sitea-1)
>> 2018-06-28T09:02:23.282588+02:00 siteb-2 pengine[189259]:   notice: Stop  
>> vg1:0#011(sitea-1)
>> 2018-06-28T09:02:23.282670+02:00 siteb-2 pengine[189259]:   notice: Stop  
>> dlm:3#011(sitea-2)
>> 2018-06-28T09:02:23.282752+02:00 siteb-2 pengine[189259]:   notice: Stop  
>> clvm:3#011(sitea-2)
>> 2018-06-28T09:02:23.282833+02:00 siteb-2 pengine[189259]:   notice: Stop  
>> vg1:3#011(sitea-2)
>> 2018-06-28T09:02:23.282916+02:00 siteb-2 pengine[189259]:   notice: Stop  
>> sysinfo:0#011(sitea-1)
>> 2018-06-28T09:02:23.283001+02:00 siteb-2 pengine[189259]:   notice: Stop  
>> sysinfo:3#011(sitea-2)
>> 2018-06-28T09:02:23.283978+02:00 siteb-2 pengine[189259]:   notice: 
>> Calculated transition 1056, saving inputs in 
>> /var/lib/pacemaker/pengine/pe-input-2321.bz2
>> 2018-06-28T09:02:23.284428+02:00 siteb-2 crmd[189260]:   notice: 
>> Processing graph 1056 (ref=pe_calc-dc-1530169343-1339) derived from 
>> /var/lib/pacemaker/pengine/pe-input-2321.bz2
>> 2018-06-28T09:02:23.284575+02:00 siteb-2 crmd[189260]:   notice: 
>> Initiating stop operation stonith-sbd_stop_0 on sitea-1
>> 2018-06-28T09:02:23.284659+02:00 siteb-2 crmd[189260]:   notice: 
>> Initiating stop operation cl-info_stop_0 on sitea-1
>> 2018-06-28T09:02:23.284742+02:00 siteb-2 crmd[189260]:   notice: 
>> Initiating stop operation k45RG_stop_0 on sitea-2
>> 2018-06-28T09:02:23.284824+02:00 siteb-2 crmd[189260]:   notice: 
>> Initiating stop operation vg1_stop_0 on sitea-1
>> 2018-06-28T09:02:23.284908+02:00 siteb-2 crmd[189260]:   notice: 
>> Initiating stop operation vg1_stop_0 on sitea-2
>> 2018-06-28T09:02:23.284990+02:00 siteb-2 crmd[189260]:   notice: 
>> Initiating stop operation sysinfo_stop_0 on sitea-1
>> 2018-06-28T09:02:23.285072+02:00 siteb-2 crmd[189260]:   notice: 
>> Initiating stop operation sysinfo_stop_0 on sitea-2
>> 2018-06-28T09:02:23.288254+02:00 siteb-2 crmd[189260]:   notice: 
>> Initiating start operation stonith-sbd_start_0 on siteb-1
>> 2018-06-28T09:02:23.298867+02:00 siteb-2 crmd[189260]:   notice: 
>> Initiating start operation cl-info_start_0 locally on siteb-2
>> 2018-06-28T09:02:23.309272+02:00 siteb-2 lrmd[189257]:   notice: executing

>> - rsc:cl-info action:start call_id:105
>> 2018-06-28T09:02:23.384074+02:00 siteb-2 lrmd[189257]:   notice: finished 
>> - rsc:cl-info action:start call_id:105 pid:253747 exit-code:0 
>> exec-time:75ms queue-time:0ms
>> 2018-06-28T09:02:23.393759+02:00 siteb-2 crmd[189260]:   notice: Result of

>> start operation for cl-info on siteb-2: 0 (ok)
>> 2018-06-28T09:02:23.395594+02:00 siteb-2 crmd[189260]:   notice: 
>> Initiating monitor operation cl-info_monitor_6 locally on siteb-2
>> 2018-06-28T09:02:

[ClusterLabs] Antw: Antwort: Antw: corosync/dlm fencing?

2018-07-16 Thread Ulrich Windl
>>> Philipp Achmüller  schrieb am 16.07.2018 um
11:44 in
Nachricht :
> hi!
> 
> Thank you for comment.
> Unfortunatly it is not obvious for me - the "grep fence" is attached in my 
> original message.

Hi!

OK, seems I missed finding the needle in all the hay...
Anyway I think the problem is "Cluster node siteb-1 will be fenced: peer is no
longer part of the cluster". Looks as if the cluster noticed an unclean
shutdown of node siteb-1. The message "Stonith/shutdown of siteb-1 not matched"
seems to confirm that.
When shutting down two nodes, did you wait until shutdown of the first node
succeeded before shutting down the second?

Regards,
Ulrich

> 
> i searched older logs with activated dubug information for dlm: - this is 
> the sequence from syslog (from another timeframe):
> 
> ---
> Node: siteb-2 (DC):
> 
> 2018-06-28T09:02:23.272415+02:00 siteb-2 crmd[189260]:   notice: State 
> transition S_IDLE -> S_POLICY_ENGINE
> 2018-06-28T09:02:23.279028+02:00 siteb-2 pengine[189259]:   notice: 
> Watchdog will be used via SBD if fencing is required
> 2018-06-28T09:02:23.279214+02:00 siteb-2 pengine[189259]:   notice: On 
> loss of CCM Quorum: Ignore
> 2018-06-28T09:02:23.282153+02:00 siteb-2 pengine[189259]:   notice: Move  
> stonith-sbd#011(Started sitea-1 -> siteb-1)
> 2018-06-28T09:02:23.282249+02:00 siteb-2 pengine[189259]:   notice: Move  
> cl-info#011(Started sitea-1 -> siteb-2)
> 2018-06-28T09:02:23.282338+02:00 siteb-2 pengine[189259]:   notice: Move  
> k45RG#011(Started sitea-2 -> siteb-1)
> 2018-06-28T09:02:23.282422+02:00 siteb-2 pengine[189259]:   notice: Stop  
> dlm:0#011(sitea-1)
> 2018-06-28T09:02:23.282505+02:00 siteb-2 pengine[189259]:   notice: Stop  
> clvm:0#011(sitea-1)
> 2018-06-28T09:02:23.282588+02:00 siteb-2 pengine[189259]:   notice: Stop  
> vg1:0#011(sitea-1)
> 2018-06-28T09:02:23.282670+02:00 siteb-2 pengine[189259]:   notice: Stop  
> dlm:3#011(sitea-2)
> 2018-06-28T09:02:23.282752+02:00 siteb-2 pengine[189259]:   notice: Stop  
> clvm:3#011(sitea-2)
> 2018-06-28T09:02:23.282833+02:00 siteb-2 pengine[189259]:   notice: Stop  
> vg1:3#011(sitea-2)
> 2018-06-28T09:02:23.282916+02:00 siteb-2 pengine[189259]:   notice: Stop  
> sysinfo:0#011(sitea-1)
> 2018-06-28T09:02:23.283001+02:00 siteb-2 pengine[189259]:   notice: Stop  
> sysinfo:3#011(sitea-2)
> 2018-06-28T09:02:23.283978+02:00 siteb-2 pengine[189259]:   notice: 
> Calculated transition 1056, saving inputs in 
> /var/lib/pacemaker/pengine/pe-input-2321.bz2
> 2018-06-28T09:02:23.284428+02:00 siteb-2 crmd[189260]:   notice: 
> Processing graph 1056 (ref=pe_calc-dc-1530169343-1339) derived from 
> /var/lib/pacemaker/pengine/pe-input-2321.bz2
> 2018-06-28T09:02:23.284575+02:00 siteb-2 crmd[189260]:   notice: 
> Initiating stop operation stonith-sbd_stop_0 on sitea-1
> 2018-06-28T09:02:23.284659+02:00 siteb-2 crmd[189260]:   notice: 
> Initiating stop operation cl-info_stop_0 on sitea-1
> 2018-06-28T09:02:23.284742+02:00 siteb-2 crmd[189260]:   notice: 
> Initiating stop operation k45RG_stop_0 on sitea-2
> 2018-06-28T09:02:23.284824+02:00 siteb-2 crmd[189260]:   notice: 
> Initiating stop operation vg1_stop_0 on sitea-1
> 2018-06-28T09:02:23.284908+02:00 siteb-2 crmd[189260]:   notice: 
> Initiating stop operation vg1_stop_0 on sitea-2
> 2018-06-28T09:02:23.284990+02:00 siteb-2 crmd[189260]:   notice: 
> Initiating stop operation sysinfo_stop_0 on sitea-1
> 2018-06-28T09:02:23.285072+02:00 siteb-2 crmd[189260]:   notice: 
> Initiating stop operation sysinfo_stop_0 on sitea-2
> 2018-06-28T09:02:23.288254+02:00 siteb-2 crmd[189260]:   notice: 
> Initiating start operation stonith-sbd_start_0 on siteb-1
> 2018-06-28T09:02:23.298867+02:00 siteb-2 crmd[189260]:   notice: 
> Initiating start operation cl-info_start_0 locally on siteb-2
> 2018-06-28T09:02:23.309272+02:00 siteb-2 lrmd[189257]:   notice: executing 
> - rsc:cl-info action:start call_id:105
> 2018-06-28T09:02:23.384074+02:00 siteb-2 lrmd[189257]:   notice: finished 
> - rsc:cl-info action:start call_id:105 pid:253747 exit-code:0 
> exec-time:75ms queue-time:0ms
> 2018-06-28T09:02:23.393759+02:00 siteb-2 crmd[189260]:   notice: Result of 
> start operation for cl-info on siteb-2: 0 (ok)
> 2018-06-28T09:02:23.395594+02:00 siteb-2 crmd[189260]:   notice: 
> Initiating monitor operation cl-info_monitor_6 locally on siteb-2
> 2018-06-28T09:02:24.159586+02:00 siteb-2 crmd[189260]:   notice: 
> Initiating stop operation clvm_stop_0 on sitea-2
> 2018-06-28T09:02:24.193317+02:00 siteb-2 kernel: [80844.122213] dlm: 
> clvmd: dlm_recover 5
> 2018-06-28T09:02:24.193349+02:00 siteb-2 kernel: [80844.122240] dlm: 
> clvmd: dlm_clear_toss 1 done
> 2018-06-28T09:02:24.193351+02:00 siteb-2 kernel: [80844.122251] dlm: 
> clvmd: remove member 3
> 2018-06-28T09:02:24.193352+02:00 siteb-2 kernel: [80844.122579] dlm: 
> clvmd: dlm_recover_members 3 nodes
> 2018-06-28T09:02:24.993269+02:00 siteb-2 kernel: [80844.920751] dlm: 
> clvmd: generation 7 slots 3 1:2 2:4 3:

[ClusterLabs] Antwort: Antw: corosync/dlm fencing?

2018-07-16 Thread Philipp Achmüller
hi!

Thank you for comment.
Unfortunatly it is not obvious for me - the "grep fence" is attached in my 
original message.

i searched older logs with activated dubug information for dlm: - this is 
the sequence from syslog (from another timeframe):

---
Node: siteb-2 (DC):

2018-06-28T09:02:23.272415+02:00 siteb-2 crmd[189260]:   notice: State 
transition S_IDLE -> S_POLICY_ENGINE
2018-06-28T09:02:23.279028+02:00 siteb-2 pengine[189259]:   notice: 
Watchdog will be used via SBD if fencing is required
2018-06-28T09:02:23.279214+02:00 siteb-2 pengine[189259]:   notice: On 
loss of CCM Quorum: Ignore
2018-06-28T09:02:23.282153+02:00 siteb-2 pengine[189259]:   notice: Move  
stonith-sbd#011(Started sitea-1 -> siteb-1)
2018-06-28T09:02:23.282249+02:00 siteb-2 pengine[189259]:   notice: Move  
cl-info#011(Started sitea-1 -> siteb-2)
2018-06-28T09:02:23.282338+02:00 siteb-2 pengine[189259]:   notice: Move  
k45RG#011(Started sitea-2 -> siteb-1)
2018-06-28T09:02:23.282422+02:00 siteb-2 pengine[189259]:   notice: Stop  
dlm:0#011(sitea-1)
2018-06-28T09:02:23.282505+02:00 siteb-2 pengine[189259]:   notice: Stop  
clvm:0#011(sitea-1)
2018-06-28T09:02:23.282588+02:00 siteb-2 pengine[189259]:   notice: Stop  
vg1:0#011(sitea-1)
2018-06-28T09:02:23.282670+02:00 siteb-2 pengine[189259]:   notice: Stop  
dlm:3#011(sitea-2)
2018-06-28T09:02:23.282752+02:00 siteb-2 pengine[189259]:   notice: Stop  
clvm:3#011(sitea-2)
2018-06-28T09:02:23.282833+02:00 siteb-2 pengine[189259]:   notice: Stop  
vg1:3#011(sitea-2)
2018-06-28T09:02:23.282916+02:00 siteb-2 pengine[189259]:   notice: Stop  
sysinfo:0#011(sitea-1)
2018-06-28T09:02:23.283001+02:00 siteb-2 pengine[189259]:   notice: Stop  
sysinfo:3#011(sitea-2)
2018-06-28T09:02:23.283978+02:00 siteb-2 pengine[189259]:   notice: 
Calculated transition 1056, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-2321.bz2
2018-06-28T09:02:23.284428+02:00 siteb-2 crmd[189260]:   notice: 
Processing graph 1056 (ref=pe_calc-dc-1530169343-1339) derived from 
/var/lib/pacemaker/pengine/pe-input-2321.bz2
2018-06-28T09:02:23.284575+02:00 siteb-2 crmd[189260]:   notice: 
Initiating stop operation stonith-sbd_stop_0 on sitea-1
2018-06-28T09:02:23.284659+02:00 siteb-2 crmd[189260]:   notice: 
Initiating stop operation cl-info_stop_0 on sitea-1
2018-06-28T09:02:23.284742+02:00 siteb-2 crmd[189260]:   notice: 
Initiating stop operation k45RG_stop_0 on sitea-2
2018-06-28T09:02:23.284824+02:00 siteb-2 crmd[189260]:   notice: 
Initiating stop operation vg1_stop_0 on sitea-1
2018-06-28T09:02:23.284908+02:00 siteb-2 crmd[189260]:   notice: 
Initiating stop operation vg1_stop_0 on sitea-2
2018-06-28T09:02:23.284990+02:00 siteb-2 crmd[189260]:   notice: 
Initiating stop operation sysinfo_stop_0 on sitea-1
2018-06-28T09:02:23.285072+02:00 siteb-2 crmd[189260]:   notice: 
Initiating stop operation sysinfo_stop_0 on sitea-2
2018-06-28T09:02:23.288254+02:00 siteb-2 crmd[189260]:   notice: 
Initiating start operation stonith-sbd_start_0 on siteb-1
2018-06-28T09:02:23.298867+02:00 siteb-2 crmd[189260]:   notice: 
Initiating start operation cl-info_start_0 locally on siteb-2
2018-06-28T09:02:23.309272+02:00 siteb-2 lrmd[189257]:   notice: executing 
- rsc:cl-info action:start call_id:105
2018-06-28T09:02:23.384074+02:00 siteb-2 lrmd[189257]:   notice: finished 
- rsc:cl-info action:start call_id:105 pid:253747 exit-code:0 
exec-time:75ms queue-time:0ms
2018-06-28T09:02:23.393759+02:00 siteb-2 crmd[189260]:   notice: Result of 
start operation for cl-info on siteb-2: 0 (ok)
2018-06-28T09:02:23.395594+02:00 siteb-2 crmd[189260]:   notice: 
Initiating monitor operation cl-info_monitor_6 locally on siteb-2
2018-06-28T09:02:24.159586+02:00 siteb-2 crmd[189260]:   notice: 
Initiating stop operation clvm_stop_0 on sitea-2
2018-06-28T09:02:24.193317+02:00 siteb-2 kernel: [80844.122213] dlm: 
clvmd: dlm_recover 5
2018-06-28T09:02:24.193349+02:00 siteb-2 kernel: [80844.122240] dlm: 
clvmd: dlm_clear_toss 1 done
2018-06-28T09:02:24.193351+02:00 siteb-2 kernel: [80844.122251] dlm: 
clvmd: remove member 3
2018-06-28T09:02:24.193352+02:00 siteb-2 kernel: [80844.122579] dlm: 
clvmd: dlm_recover_members 3 nodes
2018-06-28T09:02:24.993269+02:00 siteb-2 kernel: [80844.920751] dlm: 
clvmd: generation 7 slots 3 1:2 2:4 3:1
2018-06-28T09:02:24.993283+02:00 siteb-2 kernel: [80844.920755] dlm: 
clvmd: dlm_recover_directory
2018-06-28T09:02:25.189265+02:00 siteb-2 kernel: [80845.118173] dlm: 
clvmd: dlm_recover_directory 1 in 0 new
2018-06-28T09:02:25.604661+02:00 siteb-2 crmd[189260]:   notice: 
Initiating stop operation clvm_stop_0 on sitea-1
2018-06-28T09:02:26.189277+02:00 siteb-2 kernel: [80846.116630] dlm: 
clvmd: dlm_recover_directory 0 out 2 messages
2018-06-28T09:02:26.189291+02:00 siteb-2 kernel: [80846.116635] dlm: 
clvmd: dlm_recover_masters
2018-06-28T09:02:26.189292+02:00 siteb-2 kernel: [80846.116648] dlm: 
clvmd: dlm_recover_masters 0 of 2
2018-06-28T09:02:26.189293+02:00 siteb-2 kernel: