Re: [ClusterLabs] two virtual domains start and stop every 15 minutes
On Fri, 2019-07-05 at 13:07 +0200, Lentes, Bernd wrote: > > - On Jul 4, 2019, at 1:25 AM, kgaillot kgail...@redhat.com wrote: > > > On Wed, 2019-06-19 at 18:46 +0200, Lentes, Bernd wrote: > > > - On Jun 15, 2019, at 4:30 PM, Bernd Lentes > > > bernd.len...@helmholtz-muenchen.de wrote: > > > > > > > - Am 14. Jun 2019 um 21:20 schrieb kgaillot > > > > kgail...@redhat.com > > > > : > > > > > > > > > On Fri, 2019-06-14 at 18:27 +0200, Lentes, Bernd wrote: > > > > > > Hi, > > > > > > > > > > > > i had that problem already once but still it's not clear > > > > > > for me > > > > > > what > > > > > > really happens. > > > > > > I had this problem some days ago: > > > > > > I have a 2-node cluster with several virtual domains as > > > > > > resources. I > > > > > > put one node (ha-idg-2) into standby, and two running > > > > > > virtual > > > > > > domains > > > > > > were migrated to the other node (ha-idg-1). The other > > > > > > virtual > > > > > > domains > > > > > > were already running on ha-idg-1. > > > > > > Since then the two virtual domains which migrated > > > > > > (vm_idcc_devel and > > > > > > vm_severin) start or stop every 15 minutes on ha-idg-1. > > > > > > ha-idg-2 resides in standby. > > > > > > I know that the 15 minutes interval is related to the > > > > > > "cluster- > > > > > > recheck-interval". > > > > > > But why are these two domains started and stopped ? > > > > > > I looked around much in the logs, checked the pe-input > > > > > > files, > > > > > > watched > > > > > > some graphs created by crm_simulate with dotty ... > > > > > > I always see that the domains are started and 15 minutes > > > > > > later > > > > > > stopped and 15 minutes later started ... > > > > > > but i don't see WHY. I would really like to know that. > > > > > > And why are the domains not started from the monitor > > > > > > resource > > > > > > operation ? It should recognize that the domain is stopped > > > > > > and > > > > > > starts > > > > > > it again. My monitor interval is 30 seconds. > > > > > > I had two errors pending concerning these domains, a failed > > > > > > migrate > > > > > > from ha-idg-1 to ha-idg-2, form some time before. > > > > > > Could that be the culprit ? > > > > It did indeed turn out to be. > > > > The resource history on ha-idg-1 shows the last failed action as a > > migrate_to from ha-idg-1 to ha-idg-2, and the last successful > > action as > > a migrate_from from ha-idg-2 to ha-idg-1. That confused pacemaker > > as to > > the current status of the migration. > > > > A full migration is migrate_to on the source node, migrate_from on > > the > > target node, and stop on the source node. When the resource history > > has > > a failed migrate_to on the source, and a stop but no migrate_from > > on > > the target, the migration is considered "dangling" and forces a > > stop of > > the resource on the source, because it's possible the migrate_from > > never got a chance to be scheduled. > > > > That is wrong in this situation. The resource is happily running on > > the > > node with the failed migrate_to because it was later moved back > > successfully, and the failed migrate_to is no longer relevant. > > > > My current plan for a fix is that if a node with a failed > > migrate_to > > has a successful migrate_from or start that's newer, and the target > > node of the failed migrate_to has a successful stop, then the > > migration > > should not be considered dangling. > > > > A couple of side notes on your configuration: > > > > Instead of putting action=off in fence device configurations, you > > should use pcmk_reboot_action=off. Pacemaker adds action when > > sending > > the fence command. > > I did that already. > > > When keeping a fence device off its target node, use a finite > > negative > > score rather than -INFINITY. This ensures the node can fence itself > > as > > a last resort. > > I will do that. > > Thanks for clarifying this, it happened very often. > I conclude that it's very important to cleanup a resource failure > quickly after finding the cause > and solving the problem, not having any pending errors. This is the first bug I can recall that was triggered by an old failure, so I don't think it's important as a general policy outside of live migrations. I've got a fix I'll merge soon. > > Bernd > > > Helmholtz Zentrum Muenchen > Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) > Ingolstaedter Landstr. 1 > 85764 Neuherberg > www.helmholtz-muenchen.de > Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling > Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, > Heinrich Bassler, Kerstin Guenther > Registergericht: Amtsgericht Muenchen HRB 6466 > USt-IdNr: DE 129521671 -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] two virtual domains start and stop every 15 minutes
- On Jul 4, 2019, at 1:25 AM, kgaillot kgail...@redhat.com wrote: > On Wed, 2019-06-19 at 18:46 +0200, Lentes, Bernd wrote: >> - On Jun 15, 2019, at 4:30 PM, Bernd Lentes >> bernd.len...@helmholtz-muenchen.de wrote: >> >> > - Am 14. Jun 2019 um 21:20 schrieb kgaillot kgail...@redhat.com >> > : >> > >> > > On Fri, 2019-06-14 at 18:27 +0200, Lentes, Bernd wrote: >> > > > Hi, >> > > > >> > > > i had that problem already once but still it's not clear for me >> > > > what >> > > > really happens. >> > > > I had this problem some days ago: >> > > > I have a 2-node cluster with several virtual domains as >> > > > resources. I >> > > > put one node (ha-idg-2) into standby, and two running virtual >> > > > domains >> > > > were migrated to the other node (ha-idg-1). The other virtual >> > > > domains >> > > > were already running on ha-idg-1. >> > > > Since then the two virtual domains which migrated >> > > > (vm_idcc_devel and >> > > > vm_severin) start or stop every 15 minutes on ha-idg-1. >> > > > ha-idg-2 resides in standby. >> > > > I know that the 15 minutes interval is related to the "cluster- >> > > > recheck-interval". >> > > > But why are these two domains started and stopped ? >> > > > I looked around much in the logs, checked the pe-input files, >> > > > watched >> > > > some graphs created by crm_simulate with dotty ... >> > > > I always see that the domains are started and 15 minutes later >> > > > stopped and 15 minutes later started ... >> > > > but i don't see WHY. I would really like to know that. >> > > > And why are the domains not started from the monitor resource >> > > > operation ? It should recognize that the domain is stopped and >> > > > starts >> > > > it again. My monitor interval is 30 seconds. >> > > > I had two errors pending concerning these domains, a failed >> > > > migrate >> > > > from ha-idg-1 to ha-idg-2, form some time before. >> > > > Could that be the culprit ? > > It did indeed turn out to be. > > The resource history on ha-idg-1 shows the last failed action as a > migrate_to from ha-idg-1 to ha-idg-2, and the last successful action as > a migrate_from from ha-idg-2 to ha-idg-1. That confused pacemaker as to > the current status of the migration. > > A full migration is migrate_to on the source node, migrate_from on the > target node, and stop on the source node. When the resource history has > a failed migrate_to on the source, and a stop but no migrate_from on > the target, the migration is considered "dangling" and forces a stop of > the resource on the source, because it's possible the migrate_from > never got a chance to be scheduled. > > That is wrong in this situation. The resource is happily running on the > node with the failed migrate_to because it was later moved back > successfully, and the failed migrate_to is no longer relevant. > > My current plan for a fix is that if a node with a failed migrate_to > has a successful migrate_from or start that's newer, and the target > node of the failed migrate_to has a successful stop, then the migration > should not be considered dangling. > > A couple of side notes on your configuration: > > Instead of putting action=off in fence device configurations, you > should use pcmk_reboot_action=off. Pacemaker adds action when sending > the fence command. I did that already. > When keeping a fence device off its target node, use a finite negative > score rather than -INFINITY. This ensures the node can fence itself as > a last resort. I will do that. Thanks for clarifying this, it happened very often. I conclude that it's very important to cleanup a resource failure quickly after finding the cause and solving the problem, not having any pending errors. Bernd Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] two virtual domains start and stop every 15 minutes
On Wed, 2019-06-19 at 18:46 +0200, Lentes, Bernd wrote: > - On Jun 15, 2019, at 4:30 PM, Bernd Lentes > bernd.len...@helmholtz-muenchen.de wrote: > > > - Am 14. Jun 2019 um 21:20 schrieb kgaillot kgail...@redhat.com > > : > > > > > On Fri, 2019-06-14 at 18:27 +0200, Lentes, Bernd wrote: > > > > Hi, > > > > > > > > i had that problem already once but still it's not clear for me > > > > what > > > > really happens. > > > > I had this problem some days ago: > > > > I have a 2-node cluster with several virtual domains as > > > > resources. I > > > > put one node (ha-idg-2) into standby, and two running virtual > > > > domains > > > > were migrated to the other node (ha-idg-1). The other virtual > > > > domains > > > > were already running on ha-idg-1. > > > > Since then the two virtual domains which migrated > > > > (vm_idcc_devel and > > > > vm_severin) start or stop every 15 minutes on ha-idg-1. > > > > ha-idg-2 resides in standby. > > > > I know that the 15 minutes interval is related to the "cluster- > > > > recheck-interval". > > > > But why are these two domains started and stopped ? > > > > I looked around much in the logs, checked the pe-input files, > > > > watched > > > > some graphs created by crm_simulate with dotty ... > > > > I always see that the domains are started and 15 minutes later > > > > stopped and 15 minutes later started ... > > > > but i don't see WHY. I would really like to know that. > > > > And why are the domains not started from the monitor resource > > > > operation ? It should recognize that the domain is stopped and > > > > starts > > > > it again. My monitor interval is 30 seconds. > > > > I had two errors pending concerning these domains, a failed > > > > migrate > > > > from ha-idg-1 to ha-idg-2, form some time before. > > > > Could that be the culprit ? It did indeed turn out to be. The resource history on ha-idg-1 shows the last failed action as a migrate_to from ha-idg-1 to ha-idg-2, and the last successful action as a migrate_from from ha-idg-2 to ha-idg-1. That confused pacemaker as to the current status of the migration. A full migration is migrate_to on the source node, migrate_from on the target node, and stop on the source node. When the resource history has a failed migrate_to on the source, and a stop but no migrate_from on the target, the migration is considered "dangling" and forces a stop of the resource on the source, because it's possible the migrate_from never got a chance to be scheduled. That is wrong in this situation. The resource is happily running on the node with the failed migrate_to because it was later moved back successfully, and the failed migrate_to is no longer relevant. My current plan for a fix is that if a node with a failed migrate_to has a successful migrate_from or start that's newer, and the target node of the failed migrate_to has a successful stop, then the migration should not be considered dangling. A couple of side notes on your configuration: Instead of putting action=off in fence device configurations, you should use pcmk_reboot_action=off. Pacemaker adds action when sending the fence command. When keeping a fence device off its target node, use a finite negative score rather than -INFINITY. This ensures the node can fence itself as a last resort. > > > > > > > > I still have all the logs from that time, if you need > > > > information > > > > just let me know. > > > > > > Yes the logs and pe-input files would be helpful. It sounds like > > > a bug > > > in the scheduler. What version of pacemaker are you running? > > > > > > > Hi, > > > > here are the log and some pe-input files: > > https://hmgubox.helmholtz-muenchen.de/d/f28f6961722f472eb649/ > > On 6th of june at 15:41:28 i issued "crm node standby ha-idg-2", > > then the > > trouble began. > > I'm running pacemaker-1.1.19+20181105.ccd6b5b10-3.10.1.x86_64 on > > SLES 12 SP4 and > > kernel 4.12.14-95.13. > > > > Hi, > the problem arised again. > And what attracted my attention: i made a change in the > configuration, e.g. some slight changes > of a resource, it immediately start or stop the domains, depending on > the state before. > The fence-resource is not affected by this start/stop. > > Example (some changes of a stonith agent): > > Jun 18 18:07:09 [9577] ha-idg-1cib: info: > cib_process_request: Forwarding cib_replace operation for > section configuration to all (origin=local/crm_shadow/2) > Jun 18 18:07:09 [9577] ha-idg-1cib: info: > __xml_diff_object:Moved nvpair@id (0 -> 2) > Jun 18 18:07:09 [9577] ha-idg-1cib: info: > __xml_diff_object:Moved nvpair@name (1 -> 0) > Jun 18 18:07:09 [9577] ha-idg-1cib: info: > cib_perform_op: Diff: --- 2.6990.1043 2 > Jun 18 18:07:09 [9577] ha-idg-1cib: info: > cib_perform_op: Diff: +++ 2.6991.0 > 6a5f09a19ae7d0a7bae55bddb9d1564f <= new epoch > > Jun 18 18:07:09 [9577] ha-idg-1cib:
Re: [ClusterLabs] two virtual domains start and stop every 15 minutes
- On Jun 15, 2019, at 4:30 PM, Bernd Lentes bernd.len...@helmholtz-muenchen.de wrote: > - Am 14. Jun 2019 um 21:20 schrieb kgaillot kgail...@redhat.com: > >> On Fri, 2019-06-14 at 18:27 +0200, Lentes, Bernd wrote: >>> Hi, >>> >>> i had that problem already once but still it's not clear for me what >>> really happens. >>> I had this problem some days ago: >>> I have a 2-node cluster with several virtual domains as resources. I >>> put one node (ha-idg-2) into standby, and two running virtual domains >>> were migrated to the other node (ha-idg-1). The other virtual domains >>> were already running on ha-idg-1. >>> Since then the two virtual domains which migrated (vm_idcc_devel and >>> vm_severin) start or stop every 15 minutes on ha-idg-1. >>> ha-idg-2 resides in standby. >>> I know that the 15 minutes interval is related to the "cluster- >>> recheck-interval". >>> But why are these two domains started and stopped ? >>> I looked around much in the logs, checked the pe-input files, watched >>> some graphs created by crm_simulate with dotty ... >>> I always see that the domains are started and 15 minutes later >>> stopped and 15 minutes later started ... >>> but i don't see WHY. I would really like to know that. >>> And why are the domains not started from the monitor resource >>> operation ? It should recognize that the domain is stopped and starts >>> it again. My monitor interval is 30 seconds. >>> I had two errors pending concerning these domains, a failed migrate >>> from ha-idg-1 to ha-idg-2, form some time before. >>> Could that be the culprit ? >>> >>> I still have all the logs from that time, if you need information >>> just let me know. >> >> Yes the logs and pe-input files would be helpful. It sounds like a bug >> in the scheduler. What version of pacemaker are you running? >> > > Hi, > > here are the log and some pe-input files: > https://hmgubox.helmholtz-muenchen.de/d/f28f6961722f472eb649/ > On 6th of june at 15:41:28 i issued "crm node standby ha-idg-2", then the > trouble began. > I'm running pacemaker-1.1.19+20181105.ccd6b5b10-3.10.1.x86_64 on SLES 12 SP4 > and > kernel 4.12.14-95.13. > Hi, the problem arised again. And what attracted my attention: i made a change in the configuration, e.g. some slight changes of a resource, it immediately start or stop the domains, depending on the state before. The fence-resource is not affected by this start/stop. Example (some changes of a stonith agent): Jun 18 18:07:09 [9577] ha-idg-1cib: info: cib_process_request: Forwarding cib_replace operation for section configuration to all (origin=local/crm_shadow/2) Jun 18 18:07:09 [9577] ha-idg-1cib: info: __xml_diff_object: Moved nvpair@id (0 -> 2) Jun 18 18:07:09 [9577] ha-idg-1cib: info: __xml_diff_object: Moved nvpair@name (1 -> 0) Jun 18 18:07:09 [9577] ha-idg-1cib: info: cib_perform_op: Diff: --- 2.6990.1043 2 Jun 18 18:07:09 [9577] ha-idg-1cib: info: cib_perform_op: Diff: +++ 2.6991.0 6a5f09a19ae7d0a7bae55bddb9d1564f <= new epoch Jun 18 18:07:09 [9577] ha-idg-1cib: info: cib_perform_op: -- /cib/configuration/resources/primitive[@id='fence_ilo_ha-idg-2']/instance_attributes[@id='fence_ha-idg-2-instance_attributes']/nvpair[@id='fence_ha-idg-2-instance _attributes-action'] Jun 18 18:07:09 [9577] ha-idg-1cib: info: cib_perform_op: -- /cib/configuration/resources/primitive[@id='fence_ilo_ha-idg-2']/instance_attributes[@id='fence_ha-idg-2-instance_attributes-0']/nvpair[@id='fence_ha-idg-2-instan ce_attributes-0-ipaddr'] Jun 18 18:07:09 [9577] ha-idg-1cib: info: cib_perform_op: -- /cib/configuration/resources/primitive[@id='fence_ilo_ha-idg-2']/instance_attributes[@id='fence_ha-idg-2-instance_attributes-1']/nvpair[@id='fence_ha-idg-2-instan ce_attributes-1-login'] Jun 18 18:07:09 [9577] ha-idg-1cib: info: cib_perform_op: -- /cib/configuration/resources/primitive[@id='fence_ilo_ha-idg-2']/instance_attributes[@id='fence_ha-idg-2-instance_attributes-2']/nvpair[@id='fence_ha-idg-2-instan ce_attributes-2-passwd'] Jun 18 18:07:09 [9577] ha-idg-1cib: info: cib_perform_op: -- /cib/configuration/resources/primitive[@id='fence_ilo_ha-idg-2']/instance_attributes[@id='fence_ha-idg-2-instance_attributes-3']/nvpair[@id='fence_ha-idg-2-instan ce_attributes-3-ssl_insecure'] Jun 18 18:07:09 [9577] ha-idg-1cib: info: cib_perform_op: -- /cib/configuration/resources/primitive[@id='fence_ilo_ha-idg-2']/instance_attributes[@id='fence_ha-idg-2-instance_attributes-4']/nvpair[@id='fence_ha-idg-2-instan ce_attributes-4-delay'] Jun 18 18:07:09 [9577] ha-idg-1cib: info: cib_perform_op: + /cib: @epoch=6991, @num_updates=0 Jun 18 18:07:09 [9577] ha-idg-1cib: info: cib_perform_op: ++
Re: [ClusterLabs] two virtual domains start and stop every 15 minutes
- Am 14. Jun 2019 um 21:20 schrieb kgaillot kgail...@redhat.com: > On Fri, 2019-06-14 at 18:27 +0200, Lentes, Bernd wrote: >> Hi, >> >> i had that problem already once but still it's not clear for me what >> really happens. >> I had this problem some days ago: >> I have a 2-node cluster with several virtual domains as resources. I >> put one node (ha-idg-2) into standby, and two running virtual domains >> were migrated to the other node (ha-idg-1). The other virtual domains >> were already running on ha-idg-1. >> Since then the two virtual domains which migrated (vm_idcc_devel and >> vm_severin) start or stop every 15 minutes on ha-idg-1. >> ha-idg-2 resides in standby. >> I know that the 15 minutes interval is related to the "cluster- >> recheck-interval". >> But why are these two domains started and stopped ? >> I looked around much in the logs, checked the pe-input files, watched >> some graphs created by crm_simulate with dotty ... >> I always see that the domains are started and 15 minutes later >> stopped and 15 minutes later started ... >> but i don't see WHY. I would really like to know that. >> And why are the domains not started from the monitor resource >> operation ? It should recognize that the domain is stopped and starts >> it again. My monitor interval is 30 seconds. >> I had two errors pending concerning these domains, a failed migrate >> from ha-idg-1 to ha-idg-2, form some time before. >> Could that be the culprit ? >> >> I still have all the logs from that time, if you need information >> just let me know. > > Yes the logs and pe-input files would be helpful. It sounds like a bug > in the scheduler. What version of pacemaker are you running? > Hi, here are the log and some pe-input files: https://hmgubox.helmholtz-muenchen.de/d/f28f6961722f472eb649/ On 6th of june at 15:41:28 i issued "crm node standby ha-idg-2", then the trouble began. I'm running pacemaker-1.1.19+20181105.ccd6b5b10-3.10.1.x86_64 on SLES 12 SP4 and kernel 4.12.14-95.13. Bernd Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] two virtual domains start and stop every 15 minutes
On Fri, 2019-06-14 at 18:27 +0200, Lentes, Bernd wrote: > Hi, > > i had that problem already once but still it's not clear for me what > really happens. > I had this problem some days ago: > I have a 2-node cluster with several virtual domains as resources. I > put one node (ha-idg-2) into standby, and two running virtual domains > were migrated to the other node (ha-idg-1). The other virtual domains > were already running on ha-idg-1. > Since then the two virtual domains which migrated (vm_idcc_devel and > vm_severin) start or stop every 15 minutes on ha-idg-1. > ha-idg-2 resides in standby. > I know that the 15 minutes interval is related to the "cluster- > recheck-interval". > But why are these two domains started and stopped ? > I looked around much in the logs, checked the pe-input files, watched > some graphs created by crm_simulate with dotty ... > I always see that the domains are started and 15 minutes later > stopped and 15 minutes later started ... > but i don't see WHY. I would really like to know that. > And why are the domains not started from the monitor resource > operation ? It should recognize that the domain is stopped and starts > it again. My monitor interval is 30 seconds. > I had two errors pending concerning these domains, a failed migrate > from ha-idg-1 to ha-idg-2, form some time before. > Could that be the culprit ? > > I still have all the logs from that time, if you need information > just let me know. Yes the logs and pe-input files would be helpful. It sounds like a bug in the scheduler. What version of pacemaker are you running? > > Thanks. > > > Bernd -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/