Hi,

i have the following situation:
2-node Cluster, just one node running (ha-idg-1).
The second node (ha-idg-2) is in standby. DLM monitor on ha-idg-1 times out.
Cluster tries to restart all services depending on DLM:
Aug 03 01:07:11 [19367] ha-idg-1    pengine:   notice: LogAction:        * 
Recover    dlm:0                       ( ha-idg-1 )
Aug 03 01:07:11 [19367] ha-idg-1    pengine:   notice: LogAction:        * 
Restart    clvmd:0                     ( ha-idg-1 )   due to required dlm:0 
start
Aug 03 01:07:11 [19367] ha-idg-1    pengine:   notice: LogAction:        * 
Restart    gfs2_share:0                ( ha-idg-1 )   due to required clvmd:0 
start
Aug 03 01:07:11 [19367] ha-idg-1    pengine:   notice: LogAction:        * 
Restart    gfs2_snap:0                 ( ha-idg-1 )   due to required 
gfs2_share:0 start
Aug 03 01:07:11 [19367] ha-idg-1    pengine:   notice: LogAction:        * 
Restart    fs_ocfs2:0                  ( ha-idg-1 )   due to required 
gfs2_snap:0 start
Aug 03 01:07:11 [19367] ha-idg-1    pengine:     info: LogActions:      Leave   
dlm:1   (Stopped)
Aug 03 01:07:11 [19367] ha-idg-1    pengine:     info: LogActions:      Leave   
clvmd:1 (Stopped)
Aug 03 01:07:11 [19367] ha-idg-1    pengine:     info: LogActions:      Leave   
gfs2_share:1    (Stopped)
Aug 03 01:07:11 [19367] ha-idg-1    pengine:     info: LogActions:      Leave   
gfs2_snap:1     (Stopped)
Aug 03 01:07:11 [19367] ha-idg-1    pengine:     info: LogActions:      Leave   
fs_ocfs2:1      (Stopped)
Aug 03 01:07:11 [19367] ha-idg-1    pengine:     info: LogActions:      Leave   
ClusterMon-SMTP:0       (Stopped)
Aug 03 01:07:11 [19367] ha-idg-1    pengine:     info: LogActions:      Leave   
ClusterMon-SMTP:1       (Stopped)
Aug 03 01:07:11 [19367] ha-idg-1    pengine:   notice: LogAction:        * 
Restart    vm-mausdb                   ( ha-idg-1 )   due to required cl_share 
running
Aug 03 01:07:11 [19367] ha-idg-1    pengine:   notice: LogAction:        * 
Restart    vm-sim                      ( ha-idg-1 )   due to required cl_share 
running
Aug 03 01:07:11 [19367] ha-idg-1    pengine:   notice: LogAction:        * 
Restart    vm-geneious                 ( ha-idg-1 )   due to required cl_share 
running
Aug 03 01:07:11 [19367] ha-idg-1    pengine:   notice: LogAction:        * 
Restart    vm-idcc-devel               ( ha-idg-1 )   due to required cl_share 
running
 ...

restart of vm-mausdb failed, stop timed out:
VirtualDomain(vm-mausdb)[32415]:        2022/08/03_01:19:06 INFO: Issuing 
forced shutdown (destroy) request for domain vm-mausdb.
Aug 03 01:19:11 [19365] ha-idg-1       lrmd:  warning: child_timeout_callback:  
vm-mausdb_stop_0 process (PID 32415) timed out
Aug 03 01:19:11 [19365] ha-idg-1       lrmd:  warning: operation_finished:      
vm-mausdb_stop_0:32415 - timed out after 720000ms
 ...
Aug 03 01:19:14 [19367] ha-idg-1    pengine:  warning: pe_fence_node:   Cluster 
node ha-idg-1 will be fenced: vm-mausdb failed there
Aug 03 01:19:15 [19368] ha-idg-1       crmd:   notice: te_fence_node:   
Requesting fencing (Off) of node ha-idg-1 | action=8 timeout=60000

I have two fencing resources defined. One for ha-idg-1, one for ha-idg-2. Both 
are HP ILO network adapters.
I have two location constraints: both take care that the resource for fencing 
node ha-idg-1 is running on ha-idg-2 and vice versa.
I never thought that it's necessary that a node has to fence itself.
So now ha-idg-2 is in standby, there is no fence device to stonith ha-idg-1.
Aug 03 01:19:58 [19364] ha-idg-1 stonith-ng:   notice: log_operation:   
Operation 'Off' [20705] (call 2 from crmd.19368) for host 'ha-idg-1' with 
device 'fence_ilo_ha-idg-2' returned: 0 (OK)
So the cluster starts the resource running on ha-idg-1 and cut off ha-idg-2, 
which isn't necessary.

Finally the cluster seems to realize that something went wrong:
Aug 03 01:19:58 [19368] ha-idg-1       crmd:     crit: tengine_stonith_notify:  
We were allegedly just fenced by ha-idg-1 for ha-idg-1!

So my question now: is it necessary to have a fencing device that a node can 
commit suicide ?

Bernd

-- 
Bernd Lentes 
System Administrator 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 89 3187 1241
       +49 89 3187 49123 
fax:   +49 89 3187 2294 
http://www.helmholtz-muenchen.de/mcd 

Public key: 
30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff 6c 
3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82 fc cc 
96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11 b3 a7 48 
f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32 83 92 67 9e 
ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79 56 be 53 89 70 
51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71 8e 90 ef b2 e3 22 
f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4 4b 6a 23 d3 d2 fa 27 
ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90 d0 f9 92 2d a7 d2 67 53 
e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60 94 f8 e3 03 0b 09 85 08 d0 
6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b 83 04 b4 0a 9f 37 b8 ac 58 f1 
38 43 0e 72 af 02 03 01 00 01

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to