Re: [ClusterLabs] why is node fenced ?
- On Aug 9, 2020, at 10:17 PM, Bernd Lentes bernd.len...@helmholtz-muenchen.de wrote: >> So this appears to be the problem. From these logs I would guess the >> successful stop on ha-idg-1 did not get written to the CIB for some >> reason. I'd look at the pe input from this transition on ha-idg-2 to >> confirm that. >> >> Without the DC knowing about the stop, it tries to schedule a new one, >> but the node is shutting down so it can't do it, which means it has to >> be fenced. I checked all relevant pe-files in this time period. This is what i found out (i just write the important entries): ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-input-3116 -G transition-3116.xml -D transition-3116.dot Current cluster status: ... vm_nextcloud (ocf::heartbeat:VirtualDomain): Started ha-idg-1 Transition Summary: ... * Migratevm_nextcloud ( ha-idg-1 -> ha-idg-2 ) Executing cluster transition: * Resource action: vm_nextcloudmigrate_from on ha-idg-2 <=== migrate vm_nextcloud * Resource action: vm_nextcloudstop on ha-idg-1 * Pseudo action: vm_nextcloud_start_0 Revised cluster status: Node ha-idg-1 (1084777482): standby Online: [ ha-idg-2 ] vm_nextcloud (ocf::heartbeat:VirtualDomain): Started ha-idg-2 ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-error-48 -G transition-4514.xml -D transition-4514.dot Current cluster status: Node ha-idg-1 (1084777482): standby Online: [ ha-idg-2 ] ... vm_nextcloud (ocf::heartbeat:VirtualDomain): FAILED[ ha-idg-2 ha-idg-1 ] <== migration failed Transition Summary: .. * Recovervm_nextcloud( ha-idg-2 ) Executing cluster transition: * Resource action: vm_nextcloudstop on ha-idg-2 * Resource action: vm_nextcloudstop on ha-idg-1 * Resource action: vm_nextcloudstart on ha-idg-2 * Resource action: vm_nextcloudmonitor=3 on ha-idg-2 Revised cluster status: vm_nextcloud (ocf::heartbeat:VirtualDomain): Started ha-idg-2 ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-input-3117 -G transition-3117.xml -D transition-3117.dot Current cluster status: Node ha-idg-1 (1084777482): standby Online: [ ha-idg-2 ] vm_nextcloud (ocf::heartbeat:VirtualDomain): FAILED ha-idg-2 <== start on ha-idg-2 failed Transition Summary: * Stop vm_nextcloud ( ha-idg-2 ) due to node availability < stop vm_nextcloud (what means due to node availability ?) Executing cluster transition: * Resource action: vm_nextcloudstop on ha-idg-2 Revised cluster status: vm_nextcloud (ocf::heartbeat:VirtualDomain): Stopped ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-input-3118 -G transition-4516.xml -D transition-4516.dot Current cluster status: Node ha-idg-1 (1084777482): standby Online: [ ha-idg-2 ] vm_nextcloud (ocf::heartbeat:VirtualDomain): Stopped <== vm_nextcloud is stopped Transition Summary: * Shutdown ha-idg-1 Executing cluster transition: * Resource action: vm_nextcloudstop on ha-idg-1 < why stop ? It is already stopped Revised cluster status: vm_nextcloud (ocf::heartbeat:VirtualDomain): Stopped ha-idg-1:~/why-fenced/ha-idg-2/pengine # crm_simulate -S -x pe-input-3545 -G transition-0.xml -D transition-0.dot Current cluster status: Node ha-idg-1 (1084777482): pending Online: [ ha-idg-2 ] vm_nextcloud (ocf::heartbeat:VirtualDomain): Stopped <== vm_nextcloud is stopped Transition Summary: Executing cluster transition: Using the original execution date of: 2020-07-20 15:05:33Z Revised cluster status: vm_nextcloud (ocf::heartbeat:VirtualDomain): Stopped ha-idg-1:~/why-fenced/ha-idg-2/pengine # crm_simulate -S -x pe-warn-749 -G transition-1.xml -D transition-1.dot Current cluster status: Node ha-idg-1 (1084777482): OFFLINE (standby) Online: [ ha-idg-2 ] vm_nextcloud (ocf::heartbeat:VirtualDomain): Stopped <=== vm_nextcloud is stopped Transition Summary: * Fence (Off) ha-idg-1 'resource actions are unrunnable' Executing cluster transition: * Fencing ha-idg-1 (Off) * Pseudo action: vm_nextcloud_stop_0 <=== why stop ? It is already stopped ? Revised cluster status: Node ha-idg-1 (1084777482): OFFLINE (standby) Online: [ ha-idg-2 ] vm_nextcloud (ocf::heartbeat:VirtualDomain): Stopped I don't understand why the cluster tries to stop a resource which is already stopped. Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Stonith failing
Thanks to all your suggestions, I now have the systems with stonith configured on ipmi. Two questions: - how can I simulate a stonith situation to check that everything is ok? - considering that I have both nodes with stonith against the other node, once the two nodes can communicate, how can I be sure the two nodes will not try to stonith each other? :) Thanks! Gabriele Sonicle S.r.l. : http://www.sonicle.com Music: http://www.gabrielebulfon.com Quantum Mechanics : http://www.cdbaby.com/cd/gabrielebulfon Da: Gabriele Bulfon A: Cluster Labs - All topics related to open-source clustering welcomed Data: 29 luglio 2020 14.22.42 CEST Oggetto: Re: [ClusterLabs] Antw: [EXT] Stonith failing It is a ZFS based illumos system. I don't think SBD is an option. Is there a reliable ZFS based stonith? Gabriele Sonicle S.r.l. : http://www.sonicle.com Music: http://www.gabrielebulfon.com Quantum Mechanics : http://www.cdbaby.com/cd/gabrielebulfon Da: Andrei Borzenkov A: Cluster Labs - All topics related to open-source clustering welcomed Data: 29 luglio 2020 9.46.09 CEST Oggetto: Re: [ClusterLabs] Antw: [EXT] Stonith failing On Wed, Jul 29, 2020 at 9:01 AM Gabriele Bulfon gbul...@sonicle.com wrote: That one was taken from a specific implementation on Solaris 11. The situation is a dual node server with shared storage controller: both nodes see the same disks concurrently. Here we must be sure that the two nodes are not going to import/mount the same zpool at the same time, or we will encounter data corruption: ssh based "stonith" cannot guarantee it. node 1 will be perferred for pool 1, node 2 for pool 2, only in case one of the node goes down or is taken offline the resources should be first free by the leaving node and taken by the other node. Would you suggest one of the available stonith in this case? IPMI, managed PDU, SBD ... In practice, the only stonith method that works in case of complete node outage including any power supply is SBD. ___Manage your subscription:https://lists.clusterlabs.org/mailman/listinfo/usersClusterLabs home: https://www.clusterlabs.org/ ___Manage your subscription:https://lists.clusterlabs.org/mailman/listinfo/usersClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] why is node fenced ?
- On Aug 10, 2020, at 11:59 PM, kgaillot kgail...@redhat.com wrote: > The most recent transition is aborted, but since all its actions are > complete, the only effect is to trigger a new transition. > > We should probably rephrase the log message. In fact, the whole > "transition" terminology is kind of obscure. It's hard to come up with > something better though. > Hi Ken, i don't get it. How can s.th. be aborted which is already completed ? Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/