Re: [ClusterLabs] two virtual domains start and stop every 15 minutes

2019-07-05 Thread Ken Gaillot
On Fri, 2019-07-05 at 13:07 +0200, Lentes, Bernd wrote:
> 
> - On Jul 4, 2019, at 1:25 AM, kgaillot kgail...@redhat.com wrote:
> 
> > On Wed, 2019-06-19 at 18:46 +0200, Lentes, Bernd wrote:
> > > - On Jun 15, 2019, at 4:30 PM, Bernd Lentes
> > > bernd.len...@helmholtz-muenchen.de wrote:
> > > 
> > > > - Am 14. Jun 2019 um 21:20 schrieb kgaillot 
> > > > kgail...@redhat.com
> > > > :
> > > > 
> > > > > On Fri, 2019-06-14 at 18:27 +0200, Lentes, Bernd wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > i had that problem already once but still it's not clear
> > > > > > for me
> > > > > > what
> > > > > > really happens.
> > > > > > I had this problem some days ago:
> > > > > > I have a 2-node cluster with several virtual domains as
> > > > > > resources. I
> > > > > > put one node (ha-idg-2) into standby, and two running
> > > > > > virtual
> > > > > > domains
> > > > > > were migrated to the other node (ha-idg-1). The other
> > > > > > virtual
> > > > > > domains
> > > > > > were already running on ha-idg-1.
> > > > > > Since then the two virtual domains which migrated
> > > > > > (vm_idcc_devel and
> > > > > > vm_severin) start or stop every 15 minutes on ha-idg-1.
> > > > > > ha-idg-2 resides in standby.
> > > > > > I know that the 15 minutes interval is related to the
> > > > > > "cluster-
> > > > > > recheck-interval".
> > > > > > But why are these two domains started and stopped ?
> > > > > > I looked around much in the logs, checked the pe-input
> > > > > > files,
> > > > > > watched
> > > > > > some graphs created by crm_simulate with dotty ...
> > > > > > I always see that the domains are started and 15 minutes
> > > > > > later
> > > > > > stopped and 15 minutes later started ...
> > > > > > but i don't see WHY. I would really like to know that.
> > > > > > And why are the domains not started from the monitor
> > > > > > resource
> > > > > > operation ? It should recognize that the domain is stopped
> > > > > > and
> > > > > > starts
> > > > > > it again. My monitor interval is 30 seconds.
> > > > > > I had two errors pending concerning these domains, a failed
> > > > > > migrate
> > > > > > from ha-idg-1 to ha-idg-2, form some time before.
> > > > > > Could that be the culprit ?
> > 
> > It did indeed turn out to be.
> > 
> > The resource history on ha-idg-1 shows the last failed action as a
> > migrate_to from ha-idg-1 to ha-idg-2, and the last successful
> > action as
> > a migrate_from from ha-idg-2 to ha-idg-1. That confused pacemaker
> > as to
> > the current status of the migration.
> > 
> > A full migration is migrate_to on the source node, migrate_from on
> > the
> > target node, and stop on the source node. When the resource history
> > has
> > a failed migrate_to on the source, and a stop but no migrate_from
> > on
> > the target, the migration is considered "dangling" and forces a
> > stop of
> > the resource on the source, because it's possible the migrate_from
> > never got a chance to be scheduled.
> > 
> > That is wrong in this situation. The resource is happily running on
> > the
> > node with the failed migrate_to because it was later moved back
> > successfully, and the failed migrate_to is no longer relevant.
> > 
> > My current plan for a fix is that if a node with a failed
> > migrate_to
> > has a successful migrate_from or start that's newer, and the target
> > node of the failed migrate_to has a successful stop, then the
> > migration
> > should not be considered dangling.
> > 
> > A couple of side notes on your configuration:
> > 
> > Instead of putting action=off in fence device configurations, you
> > should use pcmk_reboot_action=off. Pacemaker adds action when
> > sending
> > the fence command.
> 
> I did that already.
>  
> > When keeping a fence device off its target node, use a finite
> > negative
> > score rather than -INFINITY. This ensures the node can fence itself
> > as
> > a last resort.
> 
> I will do that.
> 
> Thanks for clarifying this, it happened very often.
> I conclude that it's very important to cleanup a resource failure
> quickly after finding the cause
> and solving the problem, not having any pending errors.

This is the first bug I can recall that was triggered by an old
failure, so I don't think it's important as a general policy outside of
live migrations.

I've got a fix I'll merge soon.

> 
> Bernd
>  
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
> Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep,
> Heinrich Bassler, Kerstin Guenther
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] PAF fails to promote slave: Can not get current node LSN location

2019-07-05 Thread Jehan-Guillaume de Rorthais
On Thu, 4 Jul 2019 11:38:05 +0200
Tiemen Ruiten  wrote:

> Hello,
> 
> Yesterday, my three node cluster (CentOS 7, PostgreSQL with the PAF
> resource agent) went down. For an as of yet unknown reason, the master
> (ph-sql-04) did not report to the rest of the cluster and was fenced. (I'll
> take the advice given earlier now to setup an rsyslog server...).
> Unfortunately, the cluster failed to promote on of the slaves (ph-sql-03)
> so that node was fenced too. Then quorum was lost and the stop action for
> the pgsqld resource on the last node (ph-sql-05) was executed and although
> it timed out (see my earlier post on this list) the PostgreSQL daemon was
> eventually stopped, leaving all nodes down.
> 
> The error message on ph-sql-03 was:
> 
> pgsqlms(pgsqld)[5006]: Jul 03 19:32:38  ERROR: Can not get current node LSN
> location
> Jul 03 19:32:38 [30148] ph-sql-03.prod.ams.i.rdmedia.com   lrmd:
> notice: operation_finished: pgsqld_promote_0:5006:stderr [
> ocf-exit-reason:Can not get current node LSN location ]
> Jul 03 19:32:38 [30148] ph-sql-03.prod.ams.i.rdmedia.com   lrmd:
> info: log_finished: finished - rsc:pgsqld action:promote call_id:87
> pid:5006 exit-code:1 exec-time:237ms queue-time:0ms
> Jul 03 19:32:38 [30151] ph-sql-03.prod.ams.i.rdmedia.com   crmd:
> notice: process_lrm_event: Result of promote operation for pgsqld on
> ph-sql-03: 1 (unknown error) | call=87 key=pgsqld_promote_0 confirmed=true
> cib-update=8309
> Jul 03 19:32:38 [30151] ph-sql-03.prod.ams.i.rdmedia.com   crmd:
> notice: process_lrm_event: ph-sql-03-pgsqld_promote_0:87 [
> ocf-exit-reason:Can not get current node LSN location\n ]
> 
> I've seen some PAF Github issues that mention this error, but not sure they
> apply to my situation. Is this a bug or is there something wrong with my
> setup?

It seems to me the problem comes from here:

  Jul 03 19:31:38 [30151] ph-sql-03.prod.ams.i.rdmedia.com   crmd:   notice:
te_rsc_command: Initiating notify operation
pgsqld_pre_notify_promote_0 on ph-sql-05 | action 67
  Jul 03 19:32:38 [30148] ph-sql-03.prod.ams.i.rdmedia.com   lrmd:  warning:
operation_finished: pgsqld_notify_0:30939 - timed out after 6ms

and here:

  Jul 03 19:31:38 [11914] ph-sql-05.prod.ams.i.rdmedia.com   lrmd: info:
log_execute:executing - rsc:pgsqld action:notify call_id:38
pgsqlms(pgsqld)[20881]: 
  Jul 03 19:32:38 [11914] ph-sql-05.prod.ams.i.rdmedia.com   lrmd:
warning: operation_finished:pgsqld_notify_0:20881 - timed out after
6ms

The pgsql election occurs during the pre-promote action where all remaining
nodes set their LSN location. During the promote action, the designed primary
checks its LSN location is the highest one. If it is not, it just cancel the
promotion so the next round will elect the best one.

Back to your issue. According to the logs, both standby nodes timed out during
the pre-promote action. No LSN location has been set accros the whole cluster.
I couldn't see any messages from the attrd daemon related to the lsn_location
attribute or other cleanup actions.

I couldn't even find the INFO message from pgsqlms giving its current status
before actually setting it ("Current node TL#LSN: %s"). But this message
appears soon after a...checkpoint. See:
https://github.com/ClusterLabs/PAF/blob/master/script/pgsqlms#L2017

Maybe you have very long checkpoints on both nodes that timed out the
pre-promote action? Do you have some PostgreSQL logs showing some useful info
around this timelapse?
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] two virtual domains start and stop every 15 minutes

2019-07-05 Thread Lentes, Bernd



- On Jul 4, 2019, at 1:25 AM, kgaillot kgail...@redhat.com wrote:

> On Wed, 2019-06-19 at 18:46 +0200, Lentes, Bernd wrote:
>> - On Jun 15, 2019, at 4:30 PM, Bernd Lentes
>> bernd.len...@helmholtz-muenchen.de wrote:
>> 
>> > - Am 14. Jun 2019 um 21:20 schrieb kgaillot kgail...@redhat.com
>> > :
>> > 
>> > > On Fri, 2019-06-14 at 18:27 +0200, Lentes, Bernd wrote:
>> > > > Hi,
>> > > > 
>> > > > i had that problem already once but still it's not clear for me
>> > > > what
>> > > > really happens.
>> > > > I had this problem some days ago:
>> > > > I have a 2-node cluster with several virtual domains as
>> > > > resources. I
>> > > > put one node (ha-idg-2) into standby, and two running virtual
>> > > > domains
>> > > > were migrated to the other node (ha-idg-1). The other virtual
>> > > > domains
>> > > > were already running on ha-idg-1.
>> > > > Since then the two virtual domains which migrated
>> > > > (vm_idcc_devel and
>> > > > vm_severin) start or stop every 15 minutes on ha-idg-1.
>> > > > ha-idg-2 resides in standby.
>> > > > I know that the 15 minutes interval is related to the "cluster-
>> > > > recheck-interval".
>> > > > But why are these two domains started and stopped ?
>> > > > I looked around much in the logs, checked the pe-input files,
>> > > > watched
>> > > > some graphs created by crm_simulate with dotty ...
>> > > > I always see that the domains are started and 15 minutes later
>> > > > stopped and 15 minutes later started ...
>> > > > but i don't see WHY. I would really like to know that.
>> > > > And why are the domains not started from the monitor resource
>> > > > operation ? It should recognize that the domain is stopped and
>> > > > starts
>> > > > it again. My monitor interval is 30 seconds.
>> > > > I had two errors pending concerning these domains, a failed
>> > > > migrate
>> > > > from ha-idg-1 to ha-idg-2, form some time before.
>> > > > Could that be the culprit ?
> 
> It did indeed turn out to be.
> 
> The resource history on ha-idg-1 shows the last failed action as a
> migrate_to from ha-idg-1 to ha-idg-2, and the last successful action as
> a migrate_from from ha-idg-2 to ha-idg-1. That confused pacemaker as to
> the current status of the migration.
> 
> A full migration is migrate_to on the source node, migrate_from on the
> target node, and stop on the source node. When the resource history has
> a failed migrate_to on the source, and a stop but no migrate_from on
> the target, the migration is considered "dangling" and forces a stop of
> the resource on the source, because it's possible the migrate_from
> never got a chance to be scheduled.
> 
> That is wrong in this situation. The resource is happily running on the
> node with the failed migrate_to because it was later moved back
> successfully, and the failed migrate_to is no longer relevant.
> 
> My current plan for a fix is that if a node with a failed migrate_to
> has a successful migrate_from or start that's newer, and the target
> node of the failed migrate_to has a successful stop, then the migration
> should not be considered dangling.
> 
> A couple of side notes on your configuration:
> 
> Instead of putting action=off in fence device configurations, you
> should use pcmk_reboot_action=off. Pacemaker adds action when sending
> the fence command.

I did that already.
 
> When keeping a fence device off its target node, use a finite negative
> score rather than -INFINITY. This ensures the node can fence itself as
> a last resort.
I will do that.

Thanks for clarifying this, it happened very often.
I conclude that it's very important to cleanup a resource failure quickly after 
finding the cause
and solving the problem, not having any pending errors.

Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler, 
Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/