Hello again Ken et all. I realized about many things investigating this issue but I feel I need a bit more help from you guys.
It's clear the monitoring process is reporting a timeout. Although I've increased this timeout to 30c using pcmk_monitoring_timeout, and during this last 2 hours the process did not fail, I'd like to understand more in detail how this process works and if I'm getting a timeout after 20 secs, it looks to me something else could be happening in my systems. I tried enabling debug again and, as before, the 'debug' option creates the file but does not update anything unless I enable 'verbose'. Funny thing because when I enable it, I hit a bug and the fencing does not start: https://bugzilla.redhat.com/show_bug.cgi?id=1549366 I enabled debug at corosync layer and I got some more information that was nice to better understand this issue but still, not enough information to narrow down where the issue comes from. Said this, I'd like to know, if there is a way to review more in detail what the monitoring process is doing like ping, status, etc and it that time is dedicated to the same action all those secs. Any idea will be more than welcome. As always, appreciate your help. Regards Javier Francisco Javier Lopez IT System Engineer | Global IT O: +34 619 728 249<tel:+34%20619%20728%20249> | M: +34 619 728 249<tel:+34%20619%20728%20249> | franciscojavier.lo...@solera.com<mailto:franciscojavier.lo...@solera.com> | Solera.com<https://www.solera.com/> Audatex Datos, S.A. | Avda. de Bruselas, 36, Salida 16, A‑1 (Diversia) , Alcobendas , Madrid , 28108 , Spain [cid:image790996.png@A70D2A26.F4AADDCB] On 5/21/2019 6:19 PM, Ken Gaillot wrote: On Tue, 2019-05-21 at 11:10 +0000, Lopez, Francisco Javier [Global IT] wrote: Hello guys ! Need your help to try to understand and debug what I'm facing in one of my clusters. I set up fencing with this detail: # pcs -f stonith_cfg stonith create fence_ao_pg01 fence_vmware_soap ipaddr=<IP> ssl_insecure=1 login="<User>" passwd="<Passwd>" pcmk_reboot_action=reboot pcmk_host_list="ao-pg01-p.axadmin.net" power_wait=3 op monitor interval=60s # pcs -f stonith_cfg stonith create fence_ao_pg02 fence_vmware_soap ipaddr=<IP> ssl_insecure=1 login="<User>" passwd="<Passwd>" pcmk_reboot_action=reboot pcmk_host_list="ao-pg02-p.axadmin.net" power_wait=3 op monitor interval=60s # pcs -f stonith_cfg constraint location fence_ao_pg01 avoids ao- pg01-p.axadmin.net=INFINITY # pcs -f stonith_cfg constraint location fence_ao_pg02 avoids ao- pg02-p.axadmin.net=INFINITY # pcs cluster cib-push stonith_cfg The pcs status shows all ok during some time and then it turns to: [root@ao-pg01-p ~]# pcs status --full Cluster name: ao_cl_p_01 Stack: corosync Current DC: ao-pg01-p.axadmin.net (1) (version 1.1.19-8.el7_6.4- c3c624ea3d) - partition with quorum Last updated: Tue May 21 12:18:46 2019 Last change: Fri May 17 18:54:32 2019 by hacluster via crmd on ao- pg01-p.axadmin.net 2 nodes configured 3 resources configured Online: [ ao-pg01-p.axadmin.net (1) ao-pg02-p.axadmin.net (2) ] Full list of resources: ao-cl-p-01-vip01 (ocf::heartbeat:IPaddr2): Started ao-pg01- p.axadmin.net fence_ao_pg01 (stonith:fence_vmware_soap): Stopped fence_ao_pg02 (stonith:fence_vmware_soap): Stopped Node Attributes: * Node ao-pg01-p.axadmin.net (1): * Node ao-pg02-p.axadmin.net (2): Migration Summary: * Node ao-pg02-p.axadmin.net (2): fence_ao_pg01: migration-threshold=1000000 fail-count=1000000 last-failure='Sat May 18 00:22:22 2019' * Node ao-pg01-p.axadmin.net (1): fence_ao_pg02: migration-threshold=1000000 fail-count=1000000 last-failure='Fri May 17 20:52:53 2019' Failed Actions: * fence_ao_pg01_start_0 on ao-pg02-p.axadmin.net 'unknown error' (1): call=22, status=Timed Out, exitreason='', last-rc-change='Sat May 18 00:19:49 2019', queued=0ms, exec=20022ms * fence_ao_pg02_start_0 on ao-pg01-p.axadmin.net 'unknown error' (1): call=84, status=Timed Out, exitreason='', last-rc-change='Fri May 17 20:52:33 2019', queued=0ms, exec=20032ms PCSD Status: ao-pg02-p.axadmin.net: Online ao-pg01-p.axadmin.net: Online Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled >From the output I see there seems to be a 'Timed Out' but I'd like to understand if this is a configuration issue or something else I'm not aware of. When pacemaker starts a fence device, it issues a monitor command to the fence agent. That command is what's timing out here. The first thing I'd try is running the monitor command manually using the parameters in the device configuration. The fence agent likely has a debug option you could turn on to get more details. I'm attaching part of the log that shows the problem related to 17- May. Regards Francisco Javier Lopez IT System Engineer | Global IT O: +34 619 728 249 | M: +34 619 728 249 | franciscojavier.lo...@solera.com<mailto:franciscojavier.lo...@solera.com> | Solera.com Aud atex Datos, S.A. | Avda. de Bruselas, 36, Salida 16, A‑1 (Diversia) , Alcobendas , Madrid , 28108 , Spain " Este e-mail y sus archivos adjuntos son confidenciales y están dirigidos exclusivamente a la(s) persona(s) destinataria prevista. Si ha recibido este mensaje por error, por favor, notifique inmediatamente al remitente y elimine este mensaje. La empresa no firma contratos por e-mail y todas las negociaciones están sujetas a la firma de un contrato por escrito. This e-mail and any attached files are confidential and intended for the named addressee(s) only. If you have received this message in error, please notify the sender and delete the email immediately. The company does not conclude contracts by email and all negotiations are subject to written contract. " _______________________________________________ Manage your subscription: https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.clusterlabs.org%2Fmailman%2Flistinfo%2Fusers&data=01%7C01%7C%7Cf499cca6634445d48c4008d6de082302%7Cc45b48f313bb448b9356ba7b863c2189%7C1&sdata=iPCgwWckXvP91cmB9NiZD6hYcPujBe6asBDwjG7avG8%3D&reserved=0 ClusterLabs home: https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.clusterlabs.org%2F&data=01%7C01%7C%7Cf499cca6634445d48c4008d6de082302%7Cc45b48f313bb448b9356ba7b863c2189%7C1&sdata=6C%2BVkrMHkAXJK%2FhCXbUbI94zdAwtM4EC4R8tvKdHim8%3D&reserved=0 ________________________________ " Este e-mail y sus archivos adjuntos son confidenciales y están dirigidos exclusivamente a la(s) persona(s) destinataria prevista. Si ha recibido este mensaje por error, por favor, notifique inmediatamente al remitente y elimine este mensaje. La empresa no firma contratos por e-mail y todas las negociaciones están sujetas a la firma de un contrato por escrito. This e-mail and any attached files are confidential and intended for the named addressee(s) only. If you have received this message in error, please notify the sender and delete the email immediately. The company does not conclude contracts by email and all negotiations are subject to written contract. "
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/