Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-07-11 Thread Danka Ivanovic
We tried to fix ldap issue with nss_initgroups_ignoreusers option in
nslcd.conf for postgres and hacluster users. So cluster shouldn't contact
ldap server every 15 seconds when it checks psql with postgres user:
/usr/lib/postgresql/9.5/bin/pg_isready -h /var/run/postgresql/ -p 5432
We have two ldap servers, and when one was unavailable, cluster failed
immediately due to timeout, even if it can reach other ldap server.
I know it should be avoided starting master database with systemctl, but I
didn't find a way to start it with pacemaker. I will test again, but I am
out of ideas. Because I tried with different pgsqlms options, different
versions of postgres..
But now it looks like something else happened..

On Wed, Jul 10, 2019 at 4:57 PM Jehan-Guillaume de Rorthais 
wrote:

> On Wed, 10 Jul 2019 16:34:17 +0200
> Danka Ivanovic  wrote:
>
> > Hi, Thank you all for responding so quickly. Part of corosync.log file is
> > attached. Cluster failure occured in 09:16  AM yesterday.
> > Debug mode is turned on in corosync configuration, but I didn't turn it
> on
> > in pacemaker config. I will test that.
>
> There's really nothing interesting in there sadly. It could even be like
> pgsqlms hadn't been called at all and the action timed out...
>
> > Postgres log is also attached.
>
> Nothing really revelent there as well.
>
> > Several times cluster failed because of ldap time out, even if I tried to
> > disable ldap searching for local postgres user,
>
> This is really anoying. IIRC, this was already happening last time. Fix
> this
> first if you didn't yet?
>
> ...
> > From syslog it looks like postgres systemd process was
> > stoped,
>
> Again, systemd shouldn't take part of anything in your cluster irw
> postgresql.
> If Pacemaker manage PostgreSQL, systemd should have nothing to do with it.
>
> If you really need to start/stop it by hands (I really discourage you to
> do so), do it using pg_ctl. And make sure to unmanage the Pacemaker
> resource
> before.
>
> > On Tue, 9 Jul 2019 19:57:06 +0300
> > > Andrei Borzenkov  wrote:
> > >
> > > > 09.07.2019 13:08, Danka Ivanović пишет:
> > > > > Hi I didn't manage to start master with postgres, even if I
> increased
> > > start
> > > > > timeout. I checked executable paths and start options.
> > >
> > > We would require much more logs from this failure...
> > >
> > > > > When cluster is running with manually started master and slave
> started
> > > over
> > > > > pacemaker, everything works ok.
> > >
> > > Logs from this scenario might be interesting as well to check and
> compare.
> > >
> > > > > Today we had failover again.
> > > > > I cannot find reason from the logs, can you help me with
> debugging?
> > > Thanks.
> > >
> > > logs logs logs please.
> > >
> > > > > Jul 09 09:16:32 [2679] postgres1   lrmd:debug:
> > > > > child_kill_helper:  Kill pid 12735's group Jul 09 09:16:34 [2679]
> > > > > postgres1   lrmd:  warning: child_timeout_callback:
> > > > > PGSQL_monitor_15000 process (PID 12735) timed out
> > > >
> > > > You probably want to enable debug output in resource agent. As far
> as I
> > > > can tell, this requires HA_debug=1 in environment of resource agent,
> but
> > > > for the life of me I cannot find where it is possible to set it.
> > > >
> > > > Probably setting it directly in resource agent for debugging is the
> most
> > > > simple way.
> > >
> > > I usually set this in "/etc/sysconfig/pacemaker". Never tried to add it
> > > to pgsqlms, interesting.
> > >
> > > > P.S. crm_resource is called by resource agent (pgsqlms). And it shows
> > > > result of original resource probing which makes it confusing. At
> least
> > > > it explains where these logs entries come from.
> > >
> > > Not sure tu understand what you mean :/
> > >
>
>
>
> --
> Jehan-Guillaume de Rorthais
> Dalibo
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-07-10 Thread Jehan-Guillaume de Rorthais
On Wed, 10 Jul 2019 17:25:57 +0200
Danka Ivanovic  wrote:
...
> I know it should be avoided starting master database with systemctl, but I
> didn't find a way to start it with pacemaker. I will test again, but I am
> out of ideas.

Put the cluster in debug mode and provide the full logs + pacemaker conf +
pgsql confs.

It will certainly help understand.

> On Wed, Jul 10, 2019 at 4:57 PM Jehan-Guillaume de Rorthais 
> wrote:
> 
> > On Wed, 10 Jul 2019 16:34:17 +0200
> > Danka Ivanovic  wrote:
> >  
> > > Hi, Thank you all for responding so quickly. Part of corosync.log file is
> > > attached. Cluster failure occured in 09:16  AM yesterday.
> > > Debug mode is turned on in corosync configuration, but I didn't turn it  
> > on  
> > > in pacemaker config. I will test that.  
> >
> > There's really nothing interesting in there sadly. It could even be like
> > pgsqlms hadn't been called at all and the action timed out...
> >  
> > > Postgres log is also attached.  
> >
> > Nothing really revelent there as well.
> >  
> > > Several times cluster failed because of ldap time out, even if I tried to
> > > disable ldap searching for local postgres user,  
> >
> > This is really anoying. IIRC, this was already happening last time. Fix
> > this
> > first if you didn't yet?
> >
> > ...  
> > > From syslog it looks like postgres systemd process was
> > > stoped,  
> >
> > Again, systemd shouldn't take part of anything in your cluster irw
> > postgresql.
> > If Pacemaker manage PostgreSQL, systemd should have nothing to do with it.
> >
> > If you really need to start/stop it by hands (I really discourage you to
> > do so), do it using pg_ctl. And make sure to unmanage the Pacemaker
> > resource
> > before.
> >  
> > > On Tue, 9 Jul 2019 19:57:06 +0300  
> > > > Andrei Borzenkov  wrote:
> > > >  
> > > > > 09.07.2019 13:08, Danka Ivanović пишет:  
> > > > > > Hi I didn't manage to start master with postgres, even if I  
> > increased  
> > > > start  
> > > > > > timeout. I checked executable paths and start options.  
> > > >
> > > > We would require much more logs from this failure...
> > > >  
> > > > > > When cluster is running with manually started master and slave  
> > started  
> > > > over  
> > > > > > pacemaker, everything works ok.  
> > > >
> > > > Logs from this scenario might be interesting as well to check and  
> > compare.  
> > > >  
> > > > > > Today we had failover again.
> > > > > > I cannot find reason from the logs, can you help me with  
> > debugging?  
> > > > Thanks.
> > > >
> > > > logs logs logs please.
> > > >  
> > > > > > Jul 09 09:16:32 [2679] postgres1   lrmd:debug:
> > > > > > child_kill_helper:  Kill pid 12735's group Jul 09 09:16:34 [2679]
> > > > > > postgres1   lrmd:  warning: child_timeout_callback:
> > > > > > PGSQL_monitor_15000 process (PID 12735) timed out  
> > > > >
> > > > > You probably want to enable debug output in resource agent. As far  
> > as I  
> > > > > can tell, this requires HA_debug=1 in environment of resource agent,  
> > but  
> > > > > for the life of me I cannot find where it is possible to set it.
> > > > >
> > > > > Probably setting it directly in resource agent for debugging is the  
> > most  
> > > > > simple way.  
> > > >
> > > > I usually set this in "/etc/sysconfig/pacemaker". Never tried to add it
> > > > to pgsqlms, interesting.
> > > >  
> > > > > P.S. crm_resource is called by resource agent (pgsqlms). And it shows
> > > > > result of original resource probing which makes it confusing. At  
> > least  
> > > > > it explains where these logs entries come from.  
> > > >
> > > > Not sure tu understand what you mean :/
> > > >  
> >
> >
> >
> > --
> > Jehan-Guillaume de Rorthais
> > Dalibo  



-- 
Jehan-Guillaume de Rorthais
Dalibo
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-07-10 Thread Jehan-Guillaume de Rorthais
On Wed, 10 Jul 2019 16:34:17 +0200
Danka Ivanovic  wrote:

> Hi, Thank you all for responding so quickly. Part of corosync.log file is
> attached. Cluster failure occured in 09:16  AM yesterday.
> Debug mode is turned on in corosync configuration, but I didn't turn it on
> in pacemaker config. I will test that.

There's really nothing interesting in there sadly. It could even be like 
pgsqlms hadn't been called at all and the action timed out...

> Postgres log is also attached.

Nothing really revelent there as well.

> Several times cluster failed because of ldap time out, even if I tried to
> disable ldap searching for local postgres user,

This is really anoying. IIRC, this was already happening last time. Fix this
first if you didn't yet?

...
> From syslog it looks like postgres systemd process was
> stoped,

Again, systemd shouldn't take part of anything in your cluster irw postgresql.
If Pacemaker manage PostgreSQL, systemd should have nothing to do with it.

If you really need to start/stop it by hands (I really discourage you to
do so), do it using pg_ctl. And make sure to unmanage the Pacemaker resource
before.

> On Tue, 9 Jul 2019 19:57:06 +0300
> > Andrei Borzenkov  wrote:
> >  
> > > 09.07.2019 13:08, Danka Ivanović пишет:  
> > > > Hi I didn't manage to start master with postgres, even if I increased  
> > start  
> > > > timeout. I checked executable paths and start options.  
> >
> > We would require much more logs from this failure...
> >  
> > > > When cluster is running with manually started master and slave started  
> > over  
> > > > pacemaker, everything works ok.  
> >
> > Logs from this scenario might be interesting as well to check and compare.
> >  
> > > > Today we had failover again.
> > > > I cannot find reason from the logs, can you help me with debugging?  
> > Thanks.
> >
> > logs logs logs please.
> >  
> > > > Jul 09 09:16:32 [2679] postgres1   lrmd:debug:
> > > > child_kill_helper:  Kill pid 12735's group Jul 09 09:16:34 [2679]
> > > > postgres1   lrmd:  warning: child_timeout_callback:
> > > > PGSQL_monitor_15000 process (PID 12735) timed out  
> > >
> > > You probably want to enable debug output in resource agent. As far as I
> > > can tell, this requires HA_debug=1 in environment of resource agent, but
> > > for the life of me I cannot find where it is possible to set it.
> > >
> > > Probably setting it directly in resource agent for debugging is the most
> > > simple way.  
> >
> > I usually set this in "/etc/sysconfig/pacemaker". Never tried to add it
> > to pgsqlms, interesting.
> >  
> > > P.S. crm_resource is called by resource agent (pgsqlms). And it shows
> > > result of original resource probing which makes it confusing. At least
> > > it explains where these logs entries come from.  
> >
> > Not sure tu understand what you mean :/
> >  



-- 
Jehan-Guillaume de Rorthais
Dalibo
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-07-10 Thread Jehan-Guillaume de Rorthais
On Wed, 10 Jul 2019 12:53:59 +0300
Andrei Borzenkov  wrote:

> On Wed, Jul 10, 2019 at 12:42 PM Jehan-Guillaume de Rorthais
>  wrote:
> 
> >  
> > > > Jul 09 09:16:32 [2679] postgres1   lrmd:debug:
> > > > child_kill_helper:  Kill pid 12735's group Jul 09 09:16:34 [2679]
> > > > postgres1   lrmd:  warning: child_timeout_callback:
> > > > PGSQL_monitor_15000 process (PID 12735) timed out  
> > >
> > > You probably want to enable debug output in resource agent. As far as I
> > > can tell, this requires HA_debug=1 in environment of resource agent, but
> > > for the life of me I cannot find where it is possible to set it.
> > >
> > > Probably setting it directly in resource agent for debugging is the most
> > > simple way.  
> >
> > I usually set this in "/etc/sysconfig/pacemaker". Never tried to add it
> > to pgsqlms, interesting.  
> 
> As far as I understand it will set it for every process spawned by
> pacemaker which may be too much (it would enable debug output in every
> resource agent for every resource).

Indeed, it does.

> Some generic mean to set it for
> specific resource only may be useful for targeted troubleshooting.

This would be useful.

I had a quick look in resource-agents and seen no RA setting HA_debug
themselves based on some reloadable parameters. Is it possible?
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-07-10 Thread Andrei Borzenkov
On Wed, Jul 10, 2019 at 12:42 PM Jehan-Guillaume de Rorthais
 wrote:

>
> > > Jul 09 09:16:32 [2679] postgres1   lrmd:debug:
> > > child_kill_helper:  Kill pid 12735's group Jul 09 09:16:34 [2679]
> > > postgres1   lrmd:  warning: child_timeout_callback:
> > > PGSQL_monitor_15000 process (PID 12735) timed out
> >
> > You probably want to enable debug output in resource agent. As far as I
> > can tell, this requires HA_debug=1 in environment of resource agent, but
> > for the life of me I cannot find where it is possible to set it.
> >
> > Probably setting it directly in resource agent for debugging is the most
> > simple way.
>
> I usually set this in "/etc/sysconfig/pacemaker". Never tried to add it
> to pgsqlms, interesting.

As far as I understand it will set it for every process spawned by
pacemaker which may be too much (it would enable debug output in every
resource agent for every resource). Some generic mean to set it for
specific resource only may be useful for targeted troubleshooting.

Today one could also simply set environment variable in systemd unit
definition but that will have the same global effect.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-07-10 Thread Andrei Borzenkov
On Wed, Jul 10, 2019 at 12:42 PM Jehan-Guillaume de Rorthais
 wrote:
>
> > P.S. crm_resource is called by resource agent (pgsqlms). And it shows
> > result of original resource probing which makes it confusing. At least
> > it explains where these logs entries come from.
>
> Not sure tu understand what you mean :/

I probably mixed up with another thread where it was unclear where
crm_resource debug output originated from. Sorry.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-07-10 Thread Jehan-Guillaume de Rorthais
On Tue, 9 Jul 2019 19:57:06 +0300
Andrei Borzenkov  wrote:

> 09.07.2019 13:08, Danka Ivanović пишет:
> > Hi I didn't manage to start master with postgres, even if I increased start
> > timeout. I checked executable paths and start options.

We would require much more logs from this failure...

> > When cluster is running with manually started master and slave started over
> > pacemaker, everything works ok.

Logs from this scenario might be interesting as well to check and compare.

> > Today we had failover again.
> > I cannot find reason from the logs, can you help me with debugging? Thanks.

logs logs logs please.

> > Jul 09 09:16:32 [2679] postgres1   lrmd:debug:
> > child_kill_helper:  Kill pid 12735's group Jul 09 09:16:34 [2679]
> > postgres1   lrmd:  warning: child_timeout_callback:
> > PGSQL_monitor_15000 process (PID 12735) timed out  
> 
> You probably want to enable debug output in resource agent. As far as I
> can tell, this requires HA_debug=1 in environment of resource agent, but
> for the life of me I cannot find where it is possible to set it.
> 
> Probably setting it directly in resource agent for debugging is the most
> simple way.

I usually set this in "/etc/sysconfig/pacemaker". Never tried to add it
to pgsqlms, interesting.

> P.S. crm_resource is called by resource agent (pgsqlms). And it shows
> result of original resource probing which makes it confusing. At least
> it explains where these logs entries come from.

Not sure tu understand what you mean :/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-07-09 Thread Andrei Borzenkov
09.07.2019 13:08, Danka Ivanović пишет:
> Hi I didn't manage to start master with postgres, even if I increased start
> timeout. I checked executable paths and start options.
> When cluster is running with manually started master and slave started over
> pacemaker, everything works ok. Today we had failover again.
> I cannot find reason from the logs, can you help me with debugging? Thanks.
> 

> Jul 09 09:16:32 [2679] postgres1   lrmd:debug: child_kill_helper: 
> Kill pid 12735's group
> Jul 09 09:16:34 [2679] postgres1   lrmd:  warning: 
> child_timeout_callback:PGSQL_monitor_15000 process (PID 12735) timed 
> out

You probably want to enable debug output in resource agent. As far as I
can tell, this requires HA_debug=1 in environment of resource agent, but
for the life of me I cannot find where it is possible to set it.

Probably setting it directly in resource agent for debugging is the most
simple way.

P.S. crm_resource is called by resource agent (pgsqlms). And it shows
result of original resource probing which makes it confusing. At least
it explains where these logs entries come from.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-05-16 Thread Ken Gaillot
On Thu, 2019-05-16 at 10:20 +0200, Jehan-Guillaume de Rorthais wrote:
> On Wed, 15 May 2019 16:53:48 -0500
> Ken Gaillot  wrote:
> 
> > On Wed, 2019-05-15 at 11:50 +0200, Jehan-Guillaume de Rorthais
> > wrote:
> > > On Mon, 29 Apr 2019 19:59:49 +0300
> > > Andrei Borzenkov  wrote:
> > >   
> > > > 29.04.2019 18:05, Ken Gaillot пишет:  
> > > > > >
> > > > > > > Why does not it check OCF_RESKEY_CRM_meta_notify?
> > > > > > 
> > > > > > I was just not aware of this env variable. Sadly, it is not
> > > > > > documented
> > > > > > anywhere :(
> > > > > 
> > > > > It's not a Pacemaker-created value like the other notify
> > > > > variables --
> > > > > all user-specified meta-attributes are passed that way. We do
> > > > > need to
> > > > > document that.
> > > > 
> > > > OCF_RESKEY_CRM_meta_notify is passed also when "notify" meta-
> > > > attribute
> > > > is *not* specified, as well as a couple of others. But not
> > > > all   
> > 
> > Hopefully in that case it's passed as false? I vaguely remember
> > some
> > case where clone attributes were mistakenly passed to non-clone
> > resources, but I think notify is always accurate for clone
> > resources.
> 
> [1]
> 
> > > > possible
> > > > attributes. And some OCF_RESKEY_CRM_meta_* variables that are
> > > > passed do
> > > > not correspond to any user settable and documented meta-
> > > > attribute,
> > > > like
> > > > OCF_RESKEY_CRM_meta_clone.  
> > > 
> > > Sorry guys, now I am confused.  
> > 
> > A well-known side effect of pacemaker ;)
> > 
> > > Is it safe or not to use OCF_RESKEY_CRM_meta_notify? You both
> > > doesn't
> > > seem to
> > > agree where it comes from. Is it only a non expected side effect
> > > or
> > > is it safe
> > > and stable code path in Pacemaker we can rely on?  
> > 
> > It's reliable. All user-specified meta-attributes end up as
> > environment
> > variables 
> 
> OK...
> 
> > -- it's just meta-attributes that *aren't* specified by the
> > user that may or may not show up
> 
> OK...
> 
> > (but hopefully with the correct value).
> 
> And that's where I am now loosing some confidence about this
> environment vars :)
> "Hopefully" and "I think is accurate" ([1]) are quite scary to me :/

It looks perfectly reliable to me :) but Andrei's comments make me want
more information.

If I understand correctly, he's saying that the presence of the notify
variable is unreliable. That's fine if the option is not specified by
the user, and the variable is either not present or present as false.
But it would indicate a bug if the variable is not present when the
option *is* specified by the user, or if the variable is present as
true when the option is not specified by the user.

Personally I'd rely on it.

The controller gets the environment variable values from the
 entries in the scheduler's result. We have numerous
examples in the scheduler regression test data, typically installed
under /usr/share/pacemaker/tests in scheduler/*.exp (for 2.0) or
pengine/test10/*.exp (for 1.1).
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-05-16 Thread Jehan-Guillaume de Rorthais
On Wed, 15 May 2019 16:53:48 -0500
Ken Gaillot  wrote:

> On Wed, 2019-05-15 at 11:50 +0200, Jehan-Guillaume de Rorthais wrote:
> > On Mon, 29 Apr 2019 19:59:49 +0300
> > Andrei Borzenkov  wrote:
> >   
> > > 29.04.2019 18:05, Ken Gaillot пишет:  
> > > > >
> > > > > > Why does not it check OCF_RESKEY_CRM_meta_notify?
> > > > > 
> > > > > I was just not aware of this env variable. Sadly, it is not
> > > > > documented
> > > > > anywhere :(
> > > > 
> > > > It's not a Pacemaker-created value like the other notify
> > > > variables --
> > > > all user-specified meta-attributes are passed that way. We do
> > > > need to
> > > > document that.
> > > 
> > > OCF_RESKEY_CRM_meta_notify is passed also when "notify" meta-
> > > attribute
> > > is *not* specified, as well as a couple of others. But not all   
> 
> Hopefully in that case it's passed as false? I vaguely remember some
> case where clone attributes were mistakenly passed to non-clone
> resources, but I think notify is always accurate for clone resources.

[1]

> > > possible
> > > attributes. And some OCF_RESKEY_CRM_meta_* variables that are
> > > passed do
> > > not correspond to any user settable and documented meta-attribute,
> > > like
> > > OCF_RESKEY_CRM_meta_clone.  
> > 
> > Sorry guys, now I am confused.  
> 
> A well-known side effect of pacemaker ;)
> 
> > Is it safe or not to use OCF_RESKEY_CRM_meta_notify? You both doesn't
> > seem to
> > agree where it comes from. Is it only a non expected side effect or
> > is it safe
> > and stable code path in Pacemaker we can rely on?  
> 
> It's reliable. All user-specified meta-attributes end up as environment
> variables 

OK...

>-- it's just meta-attributes that *aren't* specified by the
> user that may or may not show up

OK...

> (but hopefully with the correct value).

And that's where I am now loosing some confidence about this environment vars :)
"Hopefully" and "I think is accurate" ([1]) are quite scary to me :/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-05-15 Thread Ken Gaillot
On Wed, 2019-05-15 at 11:50 +0200, Jehan-Guillaume de Rorthais wrote:
> On Mon, 29 Apr 2019 19:59:49 +0300
> Andrei Borzenkov  wrote:
> 
> > 29.04.2019 18:05, Ken Gaillot пишет:
> > > >  
> > > > > Why does not it check OCF_RESKEY_CRM_meta_notify?  
> > > > 
> > > > I was just not aware of this env variable. Sadly, it is not
> > > > documented
> > > > anywhere :(  
> > > 
> > > It's not a Pacemaker-created value like the other notify
> > > variables --
> > > all user-specified meta-attributes are passed that way. We do
> > > need to
> > > document that.  
> > 
> > OCF_RESKEY_CRM_meta_notify is passed also when "notify" meta-
> > attribute
> > is *not* specified, as well as a couple of others. But not all 

Hopefully in that case it's passed as false? I vaguely remember some
case where clone attributes were mistakenly passed to non-clone
resources, but I think notify is always accurate for clone resources.

> > possible
> > attributes. And some OCF_RESKEY_CRM_meta_* variables that are
> > passed do
> > not correspond to any user settable and documented meta-attribute,
> > like
> > OCF_RESKEY_CRM_meta_clone.
> 
> Sorry guys, now I am confused.

A well-known side effect of pacemaker ;)

> Is it safe or not to use OCF_RESKEY_CRM_meta_notify? You both doesn't
> seem to
> agree where it comes from. Is it only a non expected side effect or
> is it safe
> and stable code path in Pacemaker we can rely on?

It's reliable. All user-specified meta-attributes end up as environment
variables -- it's just meta-attributes that *aren't* specified by the
user that may or may not show up (but hopefully with the correct
value).

> 
> Does it worth a patch in pgsqlms RA?
> 
> Thanks,
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-05-15 Thread Jehan-Guillaume de Rorthais
On Mon, 29 Apr 2019 19:59:49 +0300
Andrei Borzenkov  wrote:

> 29.04.2019 18:05, Ken Gaillot пишет:
> >>  
> >>> Why does not it check OCF_RESKEY_CRM_meta_notify?  
> >>
> >> I was just not aware of this env variable. Sadly, it is not
> >> documented
> >> anywhere :(  
> > 
> > It's not a Pacemaker-created value like the other notify variables --
> > all user-specified meta-attributes are passed that way. We do need to
> > document that.  
> 
> OCF_RESKEY_CRM_meta_notify is passed also when "notify" meta-attribute
> is *not* specified, as well as a couple of others. But not all possible
> attributes. And some OCF_RESKEY_CRM_meta_* variables that are passed do
> not correspond to any user settable and documented meta-attribute, like
> OCF_RESKEY_CRM_meta_clone.

Sorry guys, now I am confused.

Is it safe or not to use OCF_RESKEY_CRM_meta_notify? You both doesn't seem to
agree where it comes from. Is it only a non expected side effect or is it safe
and stable code path in Pacemaker we can rely on?

Does it worth a patch in pgsqlms RA?

Thanks,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-05-15 Thread Jehan-Guillaume de Rorthais
On Tue, 30 Apr 2019 17:28:44 +0200
Danka Ivanović  wrote:

> Hi, I tried new clean config with upgraded postgres and corosync and
> pacemaker packages.


In this attempt, your PostgreSQL resource timed out while starting up:

  Apr 30 15:09:43 [13342] master   lrmd:debug: operation_finished:
PGSQL_start_0:13864:stdout [FATAL:  the database system is starting
up ]
  Apr 30 15:09:43 [13342] master   lrmd: info: log_finished:
finished - rsc:PGSQL action:start call_id:21 pid:13864 exit-code:1
exec-time:60003ms queue-time:0ms
  Apr 30 15:09:43 [13345] master   crmd:debug:
create_operation_update:do_update_resource: Updating resource PGSQL
after start op Timed Out (interval=0)

I suppose your local instance had many WAL to replay before being consistent
and accepting connections and the 60s timeout wasn't enough.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-29 Thread Andrei Borzenkov
29.04.2019 18:05, Ken Gaillot пишет:
>>
>>> Why does not it check OCF_RESKEY_CRM_meta_notify?
>>
>> I was just not aware of this env variable. Sadly, it is not
>> documented
>> anywhere :(
> 
> It's not a Pacemaker-created value like the other notify variables --
> all user-specified meta-attributes are passed that way. We do need to
> document that.

OCF_RESKEY_CRM_meta_notify is passed also when "notify" meta-attribute
is *not* specified, as well as a couple of others. But not all possible
attributes. And some OCF_RESKEY_CRM_meta_* variables that are passed do
not correspond to any user settable and documented meta-attribute, like
OCF_RESKEY_CRM_meta_clone.

Yes, this needs documentation indeed ...
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-29 Thread Ken Gaillot
On Sun, 2019-04-28 at 00:27 +0200, Jehan-Guillaume de Rorthais wrote:
> On Sat, 27 Apr 2019 09:15:29 +0300
> Andrei Borzenkov  wrote:
> 
> > 27.04.2019 1:04, Danka Ivanović пишет:
> > > Hi, here is a complete cluster configuration:
> > > 
> > > node 1: master
> > > node 2: secondary
> > > primitive AWSVIP awsvip \
> > > params secondary_private_ip=10.x.x.x api_delay=5
> > > primitive PGSQL pgsqlms \
> > > params pgdata="/var/lib/postgresql/9.5/main"
> > > bindir="/usr/lib/postgresql/9.5/bin"
> > > pghost="/var/run/postgresql/"
> > > recovery_template="/etc/postgresql/9.5/main/recovery.conf.pcmk"
> > > start_opts="-c
> > > config_file=/etc/postgresql/9.5/main/postgresql.conf" \
> > > op start timeout=60s interval=0 \
> > > op stop timeout=60s interval=0 \
> > > op promote timeout=15s interval=0 \
> > > op demote timeout=120s interval=0 \
> > > op monitor interval=15s timeout=10s role=Master \
> > > op monitor interval=16s timeout=10s role=Slave \
> > > op notify timeout=60 interval=0
> > > primitive fencing-postgres-ha-2 stonith:external/ec2 \
> > > params port=master \
> > > op start interval=0s timeout=60s \
> > > op monitor interval=360s timeout=60s \
> > > op stop interval=0s timeout=60s
> > > primitive fencing-test-rsyslog stonith:external/ec2 \
> > > params port=secondary \
> > > op start interval=0s timeout=60s \
> > > op monitor interval=360s timeout=60s \
> > > op stop interval=0s timeout=60s
> > > ms PGSQL-HA PGSQL \
> > > meta notify=true
> > > colocation IPAWSIP-WITH-MASTER inf: AWSVIP PGSQL-HA:Master
> > > order demote-then-stop-ip Mandatory: _rsc_set_ PGSQL-HA:demote
> > > AWSVIP:stop
> > > symmetrical=false
> > > location loc-fence-master fencing-postgres-ha-2 -inf: master
> > > location loc-fence-secondary fencing-test-rsyslog -inf: secondary
> > > order promote-then-ip Mandatory: _rsc_set_ PGSQL-HA:promote
> > > AWSVIP:start
> > > symmetrical=false
> > > property cib-bootstrap-options: \
> > > have-watchdog=false \
> > > dc-version=1.1.14-70404b0 \
> > > cluster-infrastructure=corosync \
> > > cluster-name=psql-ha \
> > > stonith-enabled=true \
> > > no-quorum-policy=ignore \
> > > last-lrm-refresh=1556315444 \
> > > maintenance-mode=false
> > > rsc_defaults rsc-options: \
> > > resource-stickiness=10 \
> > > migration-threshold=2
> > > 
> > > I tried to start manually postgres to be sure it is ok. There are
> > > no error
> > > in postgres log. I also tried with different meta parameters, but
> > > always
> > > with notify=true.
> > > I also tried this:
> > > ms PGSQL-HA PGSQL \
> > > meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
> > > notify=true interleave=true
> > > I have followed this link:
> > > https://clusterlabs.github.io/PAF/Quick_Start-Debian-9-crm.html
> > > When stonith is enabled and working I imported all other
> > > resources and
> > > constraints all together in the same time.
> > > 
> > > On Fri, 26 Apr 2019 at 13:46, Jehan-Guillaume de Rorthais <
> > > j...@dalibo.com>
> > > wrote:
> > >   
> > > > Hi,
> > > > 
> > > > On Thu, 25 Apr 2019 18:57:55 +0200
> > > > Danka Ivanović  wrote:
> > > >  
> > > > > Apr 25 16:39:50 [4213] master   lrmd:   notice:
> > > > > operation_finished:   PGSQL_monitor_0:5849:stderr [ ocf-exit-
> > > > > reason:You
> > > > > must set meta parameter notify=true for your master resource
> > > > > ]  
> > > > 
> > > > Resource agent pgsqlms refuse to start PgSQL because your
> > > > configuration
> > > > lacks
> > > > the "notify=true" attribute in your master definition.
> > > >  
> > 
> > PAF pgsqlms contains:
> > 
> > # check notify=true
> > $ans = qx{ $CRM_RESOURCE --resource "$OCF_RESOURCE_INSTANCE" \\
> >  --meta --get-parameter notify 2>/dev/null };
> > chomp $ans;
> > unless ( lc($ans) =~ /^true$|^on$|^yes$|^y$|^1$/ ) {
> > ocf_exit_reason(
> > 'You must set meta parameter notify=true for your
> > master
> > resource'
> > );
> > exit $OCF_ERR_INSTALLED;
> > }
> > 
> > but that is wrong - "notify" is set on ms definition, while
> > $OCF_RESOURCE_INSTANCE refers to individual clone member. There is
> > no
> > notify option on PGSQL primitive.
> 
> Interesting...and disturbing. I wonder why I never faced a bug
> related to this
> after so many tests in various OS and a bunch of running clusters in
> various
> environments. Plus, it hasn't been reported sooner by anyone.
> 
> Is it possible the clone members inherit this from the master
> definition or
> "crm_resource" to look at this higher level?

That's correct. For clone/master/group/bundle resources, setting meta-
attributes on the collective resource makes them effective for the
inner resources as well. So I don't think that's causing any issues
here.

> If I set a meta attr

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-27 Thread Jehan-Guillaume de Rorthais
On Sat, 27 Apr 2019 09:15:29 +0300
Andrei Borzenkov  wrote:

> 27.04.2019 1:04, Danka Ivanović пишет:
> > Hi, here is a complete cluster configuration:
> > 
> > node 1: master
> > node 2: secondary
> > primitive AWSVIP awsvip \
> > params secondary_private_ip=10.x.x.x api_delay=5
> > primitive PGSQL pgsqlms \
> > params pgdata="/var/lib/postgresql/9.5/main"
> > bindir="/usr/lib/postgresql/9.5/bin" pghost="/var/run/postgresql/"
> > recovery_template="/etc/postgresql/9.5/main/recovery.conf.pcmk"
> > start_opts="-c config_file=/etc/postgresql/9.5/main/postgresql.conf" \
> > op start timeout=60s interval=0 \
> > op stop timeout=60s interval=0 \
> > op promote timeout=15s interval=0 \
> > op demote timeout=120s interval=0 \
> > op monitor interval=15s timeout=10s role=Master \
> > op monitor interval=16s timeout=10s role=Slave \
> > op notify timeout=60 interval=0
> > primitive fencing-postgres-ha-2 stonith:external/ec2 \
> > params port=master \
> > op start interval=0s timeout=60s \
> > op monitor interval=360s timeout=60s \
> > op stop interval=0s timeout=60s
> > primitive fencing-test-rsyslog stonith:external/ec2 \
> > params port=secondary \
> > op start interval=0s timeout=60s \
> > op monitor interval=360s timeout=60s \
> > op stop interval=0s timeout=60s
> > ms PGSQL-HA PGSQL \
> > meta notify=true
> > colocation IPAWSIP-WITH-MASTER inf: AWSVIP PGSQL-HA:Master
> > order demote-then-stop-ip Mandatory: _rsc_set_ PGSQL-HA:demote AWSVIP:stop
> > symmetrical=false
> > location loc-fence-master fencing-postgres-ha-2 -inf: master
> > location loc-fence-secondary fencing-test-rsyslog -inf: secondary
> > order promote-then-ip Mandatory: _rsc_set_ PGSQL-HA:promote AWSVIP:start
> > symmetrical=false
> > property cib-bootstrap-options: \
> > have-watchdog=false \
> > dc-version=1.1.14-70404b0 \
> > cluster-infrastructure=corosync \
> > cluster-name=psql-ha \
> > stonith-enabled=true \
> > no-quorum-policy=ignore \
> > last-lrm-refresh=1556315444 \
> > maintenance-mode=false
> > rsc_defaults rsc-options: \
> > resource-stickiness=10 \
> > migration-threshold=2
> > 
> > I tried to start manually postgres to be sure it is ok. There are no error
> > in postgres log. I also tried with different meta parameters, but always
> > with notify=true.
> > I also tried this:
> > ms PGSQL-HA PGSQL \
> > meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
> > notify=true interleave=true
> > I have followed this link:
> > https://clusterlabs.github.io/PAF/Quick_Start-Debian-9-crm.html
> > When stonith is enabled and working I imported all other resources and
> > constraints all together in the same time.
> > 
> > On Fri, 26 Apr 2019 at 13:46, Jehan-Guillaume de Rorthais 
> > wrote:
> >   
> >> Hi,
> >>
> >> On Thu, 25 Apr 2019 18:57:55 +0200
> >> Danka Ivanović  wrote:
> >>  
> >>> Apr 25 16:39:50 [4213] master   lrmd:   notice:
> >>> operation_finished:   PGSQL_monitor_0:5849:stderr [ ocf-exit-reason:You
> >>> must set meta parameter notify=true for your master resource ]  
> >>
> >> Resource agent pgsqlms refuse to start PgSQL because your configuration
> >> lacks
> >> the "notify=true" attribute in your master definition.
> >>  
> 
> PAF pgsqlms contains:
> 
> # check notify=true
> $ans = qx{ $CRM_RESOURCE --resource "$OCF_RESOURCE_INSTANCE" \\
>  --meta --get-parameter notify 2>/dev/null };
> chomp $ans;
> unless ( lc($ans) =~ /^true$|^on$|^yes$|^y$|^1$/ ) {
> ocf_exit_reason(
> 'You must set meta parameter notify=true for your master
> resource'
> );
> exit $OCF_ERR_INSTALLED;
> }
> 
> but that is wrong - "notify" is set on ms definition, while
> $OCF_RESOURCE_INSTANCE refers to individual clone member. There is no
> notify option on PGSQL primitive.

Interesting...and disturbing. I wonder why I never faced a bug related to this
after so many tests in various OS and a bunch of running clusters in various
environments. Plus, it hasn't been reported sooner by anyone.

Is it possible the clone members inherit this from the master definition or
"crm_resource" to look at this higher level?

If I set a meta attribute at master level, it appears on clones as well:

  > crm_resource --resource pgsql-ha --meta --get-parameter=clone-max
  pgsql-ha is active on more than one node, returning the default value for
  clone-max 
  Attribute 'clone-max' not found for 'pgsql-ha' 
  Error performing operation: No such device or address

  > crm_resource --resource pgsqld --meta --get-parameter=clone-max
  Attribute 'clone-max' not found for 'pgsqld:0'
  Error performing operation: No such device or address

  > crm_resource --resource=pgsql-ha --meta --set-parameter=clone-max \
--parameter-value=3

  Set 'pgsql-ha' option: id=pgsql-h

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-26 Thread Andrei Borzenkov
27.04.2019 1:04, Danka Ivanović пишет:
> Hi, here is a complete cluster configuration:
> 
> node 1: master
> node 2: secondary
> primitive AWSVIP awsvip \
> params secondary_private_ip=10.x.x.x api_delay=5
> primitive PGSQL pgsqlms \
> params pgdata="/var/lib/postgresql/9.5/main"
> bindir="/usr/lib/postgresql/9.5/bin" pghost="/var/run/postgresql/"
> recovery_template="/etc/postgresql/9.5/main/recovery.conf.pcmk"
> start_opts="-c config_file=/etc/postgresql/9.5/main/postgresql.conf" \
> op start timeout=60s interval=0 \
> op stop timeout=60s interval=0 \
> op promote timeout=15s interval=0 \
> op demote timeout=120s interval=0 \
> op monitor interval=15s timeout=10s role=Master \
> op monitor interval=16s timeout=10s role=Slave \
> op notify timeout=60 interval=0
> primitive fencing-postgres-ha-2 stonith:external/ec2 \
> params port=master \
> op start interval=0s timeout=60s \
> op monitor interval=360s timeout=60s \
> op stop interval=0s timeout=60s
> primitive fencing-test-rsyslog stonith:external/ec2 \
> params port=secondary \
> op start interval=0s timeout=60s \
> op monitor interval=360s timeout=60s \
> op stop interval=0s timeout=60s
> ms PGSQL-HA PGSQL \
> meta notify=true
> colocation IPAWSIP-WITH-MASTER inf: AWSVIP PGSQL-HA:Master
> order demote-then-stop-ip Mandatory: _rsc_set_ PGSQL-HA:demote AWSVIP:stop
> symmetrical=false
> location loc-fence-master fencing-postgres-ha-2 -inf: master
> location loc-fence-secondary fencing-test-rsyslog -inf: secondary
> order promote-then-ip Mandatory: _rsc_set_ PGSQL-HA:promote AWSVIP:start
> symmetrical=false
> property cib-bootstrap-options: \
> have-watchdog=false \
> dc-version=1.1.14-70404b0 \
> cluster-infrastructure=corosync \
> cluster-name=psql-ha \
> stonith-enabled=true \
> no-quorum-policy=ignore \
> last-lrm-refresh=1556315444 \
> maintenance-mode=false
> rsc_defaults rsc-options: \
> resource-stickiness=10 \
> migration-threshold=2
> 
> I tried to start manually postgres to be sure it is ok. There are no error
> in postgres log. I also tried with different meta parameters, but always
> with notify=true.
> I also tried this:
> ms PGSQL-HA PGSQL \
> meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
> notify=true interleave=true
> I have followed this link:
> https://clusterlabs.github.io/PAF/Quick_Start-Debian-9-crm.html
> When stonith is enabled and working I imported all other resources and
> constraints all together in the same time.
> 
> On Fri, 26 Apr 2019 at 13:46, Jehan-Guillaume de Rorthais 
> wrote:
> 
>> Hi,
>>
>> On Thu, 25 Apr 2019 18:57:55 +0200
>> Danka Ivanović  wrote:
>>
>>> Apr 25 16:39:50 [4213] master   lrmd:   notice:
>>> operation_finished:   PGSQL_monitor_0:5849:stderr [ ocf-exit-reason:You
>>> must set meta parameter notify=true for your master resource ]
>>
>> Resource agent pgsqlms refuse to start PgSQL because your configuration
>> lacks
>> the "notify=true" attribute in your master definition.
>>

PAF pgsqlms contains:

# check notify=true
$ans = qx{ $CRM_RESOURCE --resource "$OCF_RESOURCE_INSTANCE" \\
 --meta --get-parameter notify 2>/dev/null };
chomp $ans;
unless ( lc($ans) =~ /^true$|^on$|^yes$|^y$|^1$/ ) {
ocf_exit_reason(
'You must set meta parameter notify=true for your master
resource'
);
exit $OCF_ERR_INSTALLED;
}

but that is wrong - "notify" is set on ms definition, while
$OCF_RESOURCE_INSTANCE refers to individual clone member. There is no
notify option on PGSQL primitive. Why does not it check
OCF_RESKEY_CRM_meta_notify?
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-26 Thread Danka Ivanović
Hi, here is a complete cluster configuration:

node 1: master
node 2: secondary
primitive AWSVIP awsvip \
params secondary_private_ip=10.x.x.x api_delay=5
primitive PGSQL pgsqlms \
params pgdata="/var/lib/postgresql/9.5/main"
bindir="/usr/lib/postgresql/9.5/bin" pghost="/var/run/postgresql/"
recovery_template="/etc/postgresql/9.5/main/recovery.conf.pcmk"
start_opts="-c config_file=/etc/postgresql/9.5/main/postgresql.conf" \
op start timeout=60s interval=0 \
op stop timeout=60s interval=0 \
op promote timeout=15s interval=0 \
op demote timeout=120s interval=0 \
op monitor interval=15s timeout=10s role=Master \
op monitor interval=16s timeout=10s role=Slave \
op notify timeout=60 interval=0
primitive fencing-postgres-ha-2 stonith:external/ec2 \
params port=master \
op start interval=0s timeout=60s \
op monitor interval=360s timeout=60s \
op stop interval=0s timeout=60s
primitive fencing-test-rsyslog stonith:external/ec2 \
params port=secondary \
op start interval=0s timeout=60s \
op monitor interval=360s timeout=60s \
op stop interval=0s timeout=60s
ms PGSQL-HA PGSQL \
meta notify=true
colocation IPAWSIP-WITH-MASTER inf: AWSVIP PGSQL-HA:Master
order demote-then-stop-ip Mandatory: _rsc_set_ PGSQL-HA:demote AWSVIP:stop
symmetrical=false
location loc-fence-master fencing-postgres-ha-2 -inf: master
location loc-fence-secondary fencing-test-rsyslog -inf: secondary
order promote-then-ip Mandatory: _rsc_set_ PGSQL-HA:promote AWSVIP:start
symmetrical=false
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.14-70404b0 \
cluster-infrastructure=corosync \
cluster-name=psql-ha \
stonith-enabled=true \
no-quorum-policy=ignore \
last-lrm-refresh=1556315444 \
maintenance-mode=false
rsc_defaults rsc-options: \
resource-stickiness=10 \
migration-threshold=2

I tried to start manually postgres to be sure it is ok. There are no error
in postgres log. I also tried with different meta parameters, but always
with notify=true.
I also tried this:
ms PGSQL-HA PGSQL \
meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
notify=true interleave=true
I have followed this link:
https://clusterlabs.github.io/PAF/Quick_Start-Debian-9-crm.html
When stonith is enabled and working I imported all other resources and
constraints all together in the same time.

On Fri, 26 Apr 2019 at 13:46, Jehan-Guillaume de Rorthais 
wrote:

> Hi,
>
> On Thu, 25 Apr 2019 18:57:55 +0200
> Danka Ivanović  wrote:
>
> > Apr 25 16:39:50 [4213] master   lrmd:   notice:
> > operation_finished:   PGSQL_monitor_0:5849:stderr [ ocf-exit-reason:You
> > must set meta parameter notify=true for your master resource ]
>
> Resource agent pgsqlms refuse to start PgSQL because your configuration
> lacks
> the "notify=true" attribute in your master definition.
>
> Could you please share your full Pacemaker configuration?
>
> Regards,
>


-- 
Pozdrav
Danka Ivanovic
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-26 Thread Jehan-Guillaume de Rorthais
Hi,

On Thu, 25 Apr 2019 18:57:55 +0200
Danka Ivanović  wrote:

> Apr 25 16:39:50 [4213] master   lrmd:   notice:
> operation_finished:   PGSQL_monitor_0:5849:stderr [ ocf-exit-reason:You
> must set meta parameter notify=true for your master resource ]

Resource agent pgsqlms refuse to start PgSQL because your configuration lacks
the "notify=true" attribute in your master definition.

Could you please share your full Pacemaker configuration?

Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-25 Thread Danka Ivanović
Hi,
Here are the logs when pacemaker fails to start postgres service on master.
It manage to start only postgres slave.
I tried different configuration with pgslqms and pgsql resource agents.
Those errors are when I use pgsqlms agent, which configuration I have sent
in first mail:

Apr 25 16:40:23 [4213] master   lrmd: info: log_execute:  executing
- rsc:PGSQL action:start call_id:51
launching as "postgres" command "/usr/lib/postgresql/9.5/bin/pg_ctl
--pgdata /var/lib/postgresql/9.5/main -w --timeout 120 start -o -c
config_file=/etc/postgresql/9.5/main/postgresql.conf"
Apr 25 16:40:24 [4211] mastercib: info: cib_perform_op: +
/cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='PGSQL']/lrm_rsc_op[@id='PGSQL_last_0']:
@operation_key=PGSQL_start_0, @operation=start,
@transition-key=12:30:0:078c2b66-b095-49c4-947b-2427dd7852bf,
@transition-magic=0:0;12:30:0:078c2b66-b095-49c4-947b-2427dd7852bf,
@call-id=176, @rc-code=0, @exec-time=1146, @queue-time=0
Apr 25 16:40:53 [4216] master   crmd:debug: crm_timer_start: Started
Shutdown Escalation (I_STOP:120ms), src=53
Apr 25 16:41:23 [4213] master   lrmd:  warning:
child_timeout_callback: PGSQL_start_0
process (PID 5986) timed out

Part of the log is attached.

On Tue, 23 Apr 2019 at 17:28, Danka Ivanović 
wrote:

> Hi,
> It seems that ldap timeout caused cluster failure. Cluster is checking
> status every 15s on master and 16s on slave. Cluster needs postgres user
> for authentication, but ldap first query user on ldap server and then
> localy on host. When connection to ldap server was interrupted, cluster
> couldn't find postgres user and authenticate on db to check state. Problem
> is solved with reconfiguring /etc/ldap.conf and /etc/nslcd.conf. Following
> variable is added: nss_initgroups_ignoreusers with specified local users
> which should be ignored when querying ldap server. Thanks for your help. :)
> Another problem is that I cannot start postgres master with pacemaker.
> When I start postgres manually (with systemd) and then start pacemaker on
> slave, pacemaker is able to recognize master and start slave and failover
> works.
> That is another problem which I didn't manage to solve. Should I send a
> new mail for that issue or we can continue in this thread?
>
> On Fri, 19 Apr 2019 at 19:19, Jehan-Guillaume de Rorthais 
> wrote:
>
>> On Fri, 19 Apr 2019 17:26:14 +0200
>> Danka Ivanović  wrote:
>> ...
>> > Should I change any of those timeout parameters in order to avoid
>> timeout?
>>
>> You can try to raise the timeout, indeed. But as far as we don't know
>> **why**
>> your VMs froze for some time, it is difficult to guess how high should be
>> these timeouts.
>>
>> Not to mention that it will raise your RTO.
>>
>
>
> --
> Pozdrav
> Danka Ivanovic
>


-- 
Pozdrav
Danka Ivanovic
Apr 25 16:39:50 [4211] mastercib:debug: crm_client_new: 
Connecting 0x55d8444e8e80 for uid=0 gid=0 pid=5791 
id=c93d535d-77d8-4556-9a63-d9a1c2b45de9
Apr 25 16:39:50 [4211] mastercib:debug: handle_new_connection:  
IPC credentials authenticated (4211-5791-13)
Apr 25 16:39:50 [4211] mastercib:debug: qb_ipcs_shm_connect:
connecting to client [5791]
Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_open_2:   shm 
size:524301; real_size:528384; rb->word_size:132096
Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_open_2:   shm 
size:524301; real_size:528384; rb->word_size:132096
Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_open_2:   shm 
size:524301; real_size:528384; rb->word_size:132096
Apr 25 16:39:50 [4211] mastercib:debug: cib_acl_enabled:CIB ACL 
is disabled
Apr 25 16:39:50 [4211] mastercib:debug: 
qb_ipcs_dispatch_connection_request:HUP conn (4211-5791-13)
Apr 25 16:39:50 [4211] mastercib:debug: qb_ipcs_disconnect: 
qb_ipcs_disconnect(4211-5791-13) state:2
Apr 25 16:39:50 [4211] mastercib:debug: crm_client_destroy: 
Destroying 0 events
Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_close:
Free'ing ringbuffer: /dev/shm/qb-cib_rw-response-4211-5791-13-header
Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_close:
Free'ing ringbuffer: /dev/shm/qb-cib_rw-event-4211-5791-13-header
Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_close:
Free'ing ringbuffer: /dev/shm/qb-cib_rw-request-4211-5791-13-header
Apr 25 16:39:50 [15544] master corosync debug   [QB] IPC credentials 
authenticated (15544-5837-24)
Apr 25 16:39:50 [15544] master corosync debug   [QB] connecting to client 
[5837]
Apr 25 16:39:50 [15544] master corosync debug   [QB] shm size:1048589; 
real_size:1052672; rb->word_size:263168
Apr 25 16:39:50 [15544] master corosync debug   [QB] shm size:1048589; 
real_size:1052672; rb->word_size:263168
Apr 25 16:39:50 [15544] master corosync debug   [QB] shm size:1048589; 
real_size:1052672; rb->word_si

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-23 Thread Danka Ivanović
Hi,
It seems that ldap timeout caused cluster failure. Cluster is checking
status every 15s on master and 16s on slave. Cluster needs postgres user
for authentication, but ldap first query user on ldap server and then
localy on host. When connection to ldap server was interrupted, cluster
couldn't find postgres user and authenticate on db to check state. Problem
is solved with reconfiguring /etc/ldap.conf and /etc/nslcd.conf. Following
variable is added: nss_initgroups_ignoreusers with specified local users
which should be ignored when querying ldap server. Thanks for your help. :)
Another problem is that I cannot start postgres master with pacemaker. When
I start postgres manually (with systemd) and then start pacemaker on slave,
pacemaker is able to recognize master and start slave and failover works.
That is another problem which I didn't manage to solve. Should I send a new
mail for that issue or we can continue in this thread?

On Fri, 19 Apr 2019 at 19:19, Jehan-Guillaume de Rorthais 
wrote:

> On Fri, 19 Apr 2019 17:26:14 +0200
> Danka Ivanović  wrote:
> ...
> > Should I change any of those timeout parameters in order to avoid
> timeout?
>
> You can try to raise the timeout, indeed. But as far as we don't know
> **why**
> your VMs froze for some time, it is difficult to guess how high should be
> these timeouts.
>
> Not to mention that it will raise your RTO.
>


-- 
Pozdrav
Danka Ivanovic
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-19 Thread Jehan-Guillaume de Rorthais
On Fri, 19 Apr 2019 17:26:14 +0200
Danka Ivanović  wrote:
...
> Should I change any of those timeout parameters in order to avoid timeout?

You can try to raise the timeout, indeed. But as far as we don't know **why**
your VMs froze for some time, it is difficult to guess how high should be
these timeouts. 

Not to mention that it will raise your RTO.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-19 Thread Danka Ivanović
Here is the command output from crm configure show:

node 1: master \
attributes master-PGSQL=1001
node 2: secondary \
attributes master-PGSQL=1000
primitive AWSVIP awsvip \
params secondary_private_ip=10.x.x.x api_delay=5
primitive PGSQL pgsqlms \
params pgdata="/var/lib/postgresql/9.5/main"
bindir="/usr/lib/postgresql/9.5/bin" pghost="/var/run/postgresql/"
recovery_template="/etc/postgresql/9.5/main/recovery.conf.pcmk"
start_opts="-c config_file=/etc/postgresql/9.5/main/postgresql.conf" \
op start timeout=60s interval=0 \
op stop timeout=60s interval=0 \
op promote timeout=15s interval=0 \
op demote timeout=120s interval=0 \
op monitor interval=15s timeout=10s role=Master \
op monitor interval=16s timeout=10s role=Slave \
op notify timeout=60 interval=0
primitive fencing-master stonith:external/ec2 \
params port=master \
op start interval=0s timeout=60s \
op monitor interval=360s timeout=60s \
op stop interval=0s timeout=60s
primitive fencing-secondary stonith:external/ec2 \
params port=secondary \
op start interval=0s timeout=60s \
op monitor interval=360s timeout=60s \
op stop interval=0s timeout=60s
ms PGSQL-HA PGSQL \
meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
notify=true interleave=true
colocation IPAWSIP-WITH-MASTER inf: AWSVIP PGSQL-HA:Master
order demote-then-stop-ip Mandatory: _rsc_set_ PGSQL-HA:demote AWSVIP:stop
symmetrical=false
location loc-fence-master fencing-master -inf: master
location loc-fence-secondary fencing-secondary -inf: secondary
order promote-then-ip Mandatory: _rsc_set_ PGSQL-HA:promote AWSVIP:start
symmetrical=false
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.14-70404b0 \
cluster-infrastructure=corosync \
cluster-name=pgc-psql-ha \
stonith-enabled=true \
no-quorum-policy=ignore \
maintenance-mode=false \
last-lrm-refresh=1551885417
rsc_defaults rsc-options: \
resource-stickiness=10 \
migration-threshold=1

Should I change any of those timeout parameters in order to avoid timeout?

On Fri, 19 Apr 2019 at 12:23, Danka Ivanović 
wrote:

> Thanks for the clarification about failure-timeout, migration threshold
> and pacemaker.
> Instances are hosted on AWS cloud, and they are in the same security
> groups and availability zones.
> I don't have information about hardware which hosts those VMs since they
> are non dedicated. UTC timezone is configured on both machines and default
> ntp configuration.
>  remote   refid  st t when poll reach   delay   offset
> jitter
>
> ==
>  0.ubuntu.pool.n .POOL.  16 p-   6400.0000.000
>  0.000
>  1.ubuntu.pool.n .POOL.  16 p-   6400.0000.000
>  0.000
>  2.ubuntu.pool.n .POOL.  16 p-   6400.0000.000
>  0.000
>  3.ubuntu.pool.n .POOL.  16 p-   6400.0000.000
>  0.000
>  ntp.ubuntu.com  .POOL.  16 p-   6400.0000.000
>  0.000
> +198.46.223.227  204.9.54.119 2 u   65  512  377   22.3180.096
>  1.111
> -time1.plumdev.n .GPS.1 u  116  512  377   72.4871.386
>  0.544
> -199.180.133.100 140.142.2.8  3 u  839 1024  377   65.574   -1.199
>  1.167
> +helium.constant 128.59.0.245 2 u  217  512  3777.3680.952
>  0.090
> *i.will.not.be.e 213.251.128.249  2 u  207  512  377   14.7331.185
>  0.305
>
>
> On Fri, 19 Apr 2019 at 11:46, Jehan-Guillaume de Rorthais 
> wrote:
>
>> On Fri, 19 Apr 2019 11:08:33 +0200
>> Danka Ivanović  wrote:
>>
>> > Hi,
>> > Thank you for your response.
>> >
>> > Ok, It seems that fencing resources and secondary timed out at the same
>> > time, together with ldap.
>> > I understand that because of "migration-threshold=1", standby tried to
>> > recover just once and then was stopped. Is this ok, or the threshold
>> should
>> > be increased?
>>
>> It depend on your usecase really.
>>
>> Note that as soon as a resource hit migration threashold, there's an
>> implicit
>> constraint forbidding it to come back on this node until you reset the
>> failcount. That's why your pgsql master resource never came back anywhere.
>>
>> You can as well set failure-timeout if you are brave enough to automate
>> the
>> failure reset. See:
>>
>> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-migration.html
>>
>> > Master server is started with systmectl, then pacemaker is started on
>> > master, which detects master and then when starting pacemaker on
>> secondary
>> > it brings up postgres service in slave mode.
>>
>> You should not. Systemd should not mess with resources handled by
>> Pacemaker.
>>
>> > I didn't manage to start postgres master over pacemaker. I tested
>> > failover with setup like this and it works. I will try to setup
>> postgres to
>> > be run with pacemaker,
>>
>> Pacemaker is suppose to start the resource itself if it is enabled in its
>> setup. Look at this whole chapter (its end is importa

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-19 Thread Danka Ivanović
Thanks for the clarification about failure-timeout, migration threshold and
pacemaker.
Instances are hosted on AWS cloud, and they are in the same security groups
and availability zones.
I don't have information about hardware which hosts those VMs since they
are non dedicated. UTC timezone is configured on both machines and default
ntp configuration.
 remote   refid  st t when poll reach   delay   offset
jitter
==
 0.ubuntu.pool.n .POOL.  16 p-   6400.0000.000
 0.000
 1.ubuntu.pool.n .POOL.  16 p-   6400.0000.000
 0.000
 2.ubuntu.pool.n .POOL.  16 p-   6400.0000.000
 0.000
 3.ubuntu.pool.n .POOL.  16 p-   6400.0000.000
 0.000
 ntp.ubuntu.com  .POOL.  16 p-   6400.0000.000
 0.000
+198.46.223.227  204.9.54.119 2 u   65  512  377   22.3180.096
 1.111
-time1.plumdev.n .GPS.1 u  116  512  377   72.4871.386
 0.544
-199.180.133.100 140.142.2.8  3 u  839 1024  377   65.574   -1.199
 1.167
+helium.constant 128.59.0.245 2 u  217  512  3777.3680.952
 0.090
*i.will.not.be.e 213.251.128.249  2 u  207  512  377   14.7331.185
 0.305


On Fri, 19 Apr 2019 at 11:46, Jehan-Guillaume de Rorthais 
wrote:

> On Fri, 19 Apr 2019 11:08:33 +0200
> Danka Ivanović  wrote:
>
> > Hi,
> > Thank you for your response.
> >
> > Ok, It seems that fencing resources and secondary timed out at the same
> > time, together with ldap.
> > I understand that because of "migration-threshold=1", standby tried to
> > recover just once and then was stopped. Is this ok, or the threshold
> should
> > be increased?
>
> It depend on your usecase really.
>
> Note that as soon as a resource hit migration threashold, there's an
> implicit
> constraint forbidding it to come back on this node until you reset the
> failcount. That's why your pgsql master resource never came back anywhere.
>
> You can as well set failure-timeout if you are brave enough to automate the
> failure reset. See:
>
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-migration.html
>
> > Master server is started with systmectl, then pacemaker is started on
> > master, which detects master and then when starting pacemaker on
> secondary
> > it brings up postgres service in slave mode.
>
> You should not. Systemd should not mess with resources handled by
> Pacemaker.
>
> > I didn't manage to start postgres master over pacemaker. I tested
> > failover with setup like this and it works. I will try to setup postgres
> to
> > be run with pacemaker,
>
> Pacemaker is suppose to start the resource itself if it is enabled in its
> setup. Look at this whole chapter (its end is important):
>
> https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#starting-or-stopping-the-cluster
>
> > but I am concerned about those timeouts which
> > caused cluster to crash. Can you help me investigate why this happened or
> > what should I change in order to avoid it? For aws virtual ip is used AWS
> > secondary IP.
>
> Really I can't help on this. It looks like suddenly both VMs froze most of
> their processes, or maybe some kind of clock jump, exhausting the
> timeouts...I
> really don't know.
>
> It sounds more related to your virtualization stack I suppose. Maybe some
> kind
> of "hot" backup? Maybe the hypervisor didn't schedule enough CPU to your
> VMs
> for too long?
>
> This is surprising both VM had timeouts in almost the same time. Do you
> know if
> they are on the same hypervisor host? If they do, this is a SPoF: you
> should
> move one of them in another host.
>
> ++
>
> > On Thu, 18 Apr 2019 at 18:24, Jehan-Guillaume de Rorthais <
> j...@dalibo.com>
> > wrote:
> >
> > > On Thu, 18 Apr 2019 14:19:44 +0200
> > > Danka Ivanović  wrote:
> > >
> > >
> > >
> > > It seems you had timeout for both fencing resources and your standby in
> > > the same
> > > time here:
> > >
> > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> > > >   monitor for fencing-secondary on master: unknown error (1)
> > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> > > >   monitor for fencing-master on secondary: unknown error (1)
> > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> > > >   monitor for PGSQL:1 on secondary: unknown error (1)
> > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing
> fencing-secondary
> > > >   away from master after 1 failures (max=1)
> > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing
> fencing-master
> > > away
> > > >   from secondary after 1 failures (max=1)
> > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA
> away
> > > from
> > > >   secondary after 1 failures (max=1)
> > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA
> away
> > > from
> > > >   seco

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-19 Thread Jehan-Guillaume de Rorthais
On Fri, 19 Apr 2019 11:08:33 +0200
Danka Ivanović  wrote:

> Hi,
> Thank you for your response.
> 
> Ok, It seems that fencing resources and secondary timed out at the same
> time, together with ldap.
> I understand that because of "migration-threshold=1", standby tried to
> recover just once and then was stopped. Is this ok, or the threshold should
> be increased?

It depend on your usecase really.

Note that as soon as a resource hit migration threashold, there's an implicit
constraint forbidding it to come back on this node until you reset the
failcount. That's why your pgsql master resource never came back anywhere.

You can as well set failure-timeout if you are brave enough to automate the
failure reset. See:
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-migration.html

> Master server is started with systmectl, then pacemaker is started on
> master, which detects master and then when starting pacemaker on secondary
> it brings up postgres service in slave mode.

You should not. Systemd should not mess with resources handled by Pacemaker.

> I didn't manage to start postgres master over pacemaker. I tested
> failover with setup like this and it works. I will try to setup postgres to
> be run with pacemaker,

Pacemaker is suppose to start the resource itself if it is enabled in its
setup. Look at this whole chapter (its end is important):
https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#starting-or-stopping-the-cluster

> but I am concerned about those timeouts which
> caused cluster to crash. Can you help me investigate why this happened or
> what should I change in order to avoid it? For aws virtual ip is used AWS
> secondary IP.

Really I can't help on this. It looks like suddenly both VMs froze most of
their processes, or maybe some kind of clock jump, exhausting the timeouts...I
really don't know.

It sounds more related to your virtualization stack I suppose. Maybe some kind
of "hot" backup? Maybe the hypervisor didn't schedule enough CPU to your VMs
for too long?

This is surprising both VM had timeouts in almost the same time. Do you know if
they are on the same hypervisor host? If they do, this is a SPoF: you should
move one of them in another host.

++

> On Thu, 18 Apr 2019 at 18:24, Jehan-Guillaume de Rorthais 
> wrote:
> 
> > On Thu, 18 Apr 2019 14:19:44 +0200
> > Danka Ivanović  wrote:
> >
> >
> >
> > It seems you had timeout for both fencing resources and your standby in
> > the same
> > time here:
> >  
> > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> > >   monitor for fencing-secondary on master: unknown error (1)
> > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> > >   monitor for fencing-master on secondary: unknown error (1)
> > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> > >   monitor for PGSQL:1 on secondary: unknown error (1)
> > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary
> > >   away from master after 1 failures (max=1)
> > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master  
> > away  
> > >   from secondary after 1 failures (max=1)
> > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away  
> > from  
> > >   secondary after 1 failures (max=1)
> > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away  
> > from  
> > >   secondary after 1 failures (max=1)  
> >
> > Because you have "migration-threshold=1", the standby will be shut down:
> >  
> > > Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1 (secondary)  
> >
> > The transition is stopped because the pgsql master timed out in the
> > meantime
> > :
> >  
> > > Apr 17 10:03:40 master crmd[12481]: notice: Transition 3462 (Complete=5,
> > > Pending=0, Fired=0, Skipped=1, Incomplete=6,
> > > Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Stopped  
> >
> > and as you mentioned, your ldap as well:
> >  
> > > Apr 17 10:03:40 master nslcd[1518]: [d7e446]  ldap_result()
> > > timed out  
> >
> > Here are the four timeout errors (2 fencings and 2 pgsql instances):
> >  
> > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> > >   monitor for fencing-secondary on master: unknown error (1)
> > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> > >   monitor for PGSQL:0 on master: unknown error (1)
> > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> > >   monitor for fencing-master on secondary: unknown error (1)
> > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> > >   monitor for PGSQL:1 on secondary: unknown error (1)  
> >
> > As a reaction, Pacemaker decide to stop everything because it can not move
> > resources anywhere:
> >  
> > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away  
> > from  
> > > master after 1 failures (max=1)
> > > Apr 17 10:03:40 master pengine

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-19 Thread Danka Ivanović
Hi,
Thank you for your response.

Ok, It seems that fencing resources and secondary timed out at the same
time, together with ldap.
I understand that because of "migration-threshold=1", standby tried to
recover just once and then was stopped. Is this ok, or the threshold should
be increased?

Master server is started with systmectl, then pacemaker is started on
master, which detects master and then when starting pacemaker on secondary
it brings up postgres service in slave mode.
I didn't manage to start postgres master over pacemaker. I tested
failover with setup like this and it works. I will try to setup postgres to
be run with pacemaker, but I am concerned about those timeouts which
caused cluster to crash. Can you help me investigate why this happened or
what should I change in order to avoid it? For aws virtual ip is used AWS
secondary IP.
Link to the awsvip resource:

https://raw.githubusercontent.com/ClusterLabs/resource-agents/master/heartbeat/awsvip


Link to the ec2 stonith reosurce agent:


https://raw.githubusercontent.com/ClusterLabs/cluster-glue/master/lib/plugins/stonith/external/ec2


Command output when cluster works:

crm status

Output:

Stack: corosync

Current DC: postgres-ha-1 (version 1.1.14-70404b0) - partition with quorum

2 nodes and 5 resources configured

Online: [ postgres-ha-1 postgres-ha-2 ]

Full list of resources:

AWSVIP (ocf::heartbeat:awsvip): Started postgres-ha-1

Master/Slave Set: PGSQL-HA [PGSQL]

Masters: [ postgres-ha-1 ]

Slaves: [ postgres-ha-2 ]

fencing-postgres-ha-1 (stonith:external/ec2): Started postgres-ha-2

fencing-postgres-ha-2 (stonith:external/ec2): Started postgres-ha-1


On Thu, 18 Apr 2019 at 18:24, Jehan-Guillaume de Rorthais 
wrote:

> On Thu, 18 Apr 2019 14:19:44 +0200
> Danka Ivanović  wrote:
>
>
>
> It seems you had timeout for both fencing resources and your standby in
> the same
> time here:
>
> > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> >   monitor for fencing-secondary on master: unknown error (1)
> > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> >   monitor for fencing-master on secondary: unknown error (1)
> > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
> >   monitor for PGSQL:1 on secondary: unknown error (1)
> > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary
> >   away from master after 1 failures (max=1)
> > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master
> away
> >   from secondary after 1 failures (max=1)
> > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away
> from
> >   secondary after 1 failures (max=1)
> > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away
> from
> >   secondary after 1 failures (max=1)
>
> Because you have "migration-threshold=1", the standby will be shut down:
>
> > Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1 (secondary)
>
> The transition is stopped because the pgsql master timed out in the
> meantime
> :
>
> > Apr 17 10:03:40 master crmd[12481]: notice: Transition 3462 (Complete=5,
> > Pending=0, Fired=0, Skipped=1, Incomplete=6,
> > Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Stopped
>
> and as you mentioned, your ldap as well:
>
> > Apr 17 10:03:40 master nslcd[1518]: [d7e446]  ldap_result()
> > timed out
>
> Here are the four timeout errors (2 fencings and 2 pgsql instances):
>
> > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> >   monitor for fencing-secondary on master: unknown error (1)
> > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> >   monitor for PGSQL:0 on master: unknown error (1)
> > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> >   monitor for fencing-master on secondary: unknown error (1)
> > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
> >   monitor for PGSQL:1 on secondary: unknown error (1)
>
> As a reaction, Pacemaker decide to stop everything because it can not move
> resources anywhere:
>
> > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away
> from
> > master after 1 failures (max=1)
> > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away
> from
> > master after 1 failures (max=1)
> > Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-secondary
> > away from master after 1 failures (max=1)
> > Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-master
> away
> > from secondary after 1 failures (max=1)
> > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away
> from
> > secondary after 1 failures (max=1)
> > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away
> from
> > secondary after 1 failures (max=1)
> > Apr 17 10:03:40 master pengine[12480]: notice: Stop AWSVIP (master)
> > Apr 17 10:03:40 master pengine[12480]: notice: Demote PGSQL:0 (Master ->
> > Stopped master)
> > Apr 17 10:03:40 master peng

Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-18 Thread Jehan-Guillaume de Rorthais
On Thu, 18 Apr 2019 14:19:44 +0200
Danka Ivanović  wrote:



It seems you had timeout for both fencing resources and your standby in the same
time here:

> Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
>   monitor for fencing-secondary on master: unknown error (1)
> Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
>   monitor for fencing-master on secondary: unknown error (1)
> Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
>   monitor for PGSQL:1 on secondary: unknown error (1)
> Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary
>   away from master after 1 failures (max=1)
> Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master away
>   from secondary after 1 failures (max=1)
> Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away from
>   secondary after 1 failures (max=1)
> Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away from
>   secondary after 1 failures (max=1)

Because you have "migration-threshold=1", the standby will be shut down:

> Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1 (secondary)

The transition is stopped because the pgsql master timed out in the meantime
:

> Apr 17 10:03:40 master crmd[12481]: notice: Transition 3462 (Complete=5,
> Pending=0, Fired=0, Skipped=1, Incomplete=6,
> Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Stopped

and as you mentioned, your ldap as well:

> Apr 17 10:03:40 master nslcd[1518]: [d7e446]  ldap_result()
> timed out

Here are the four timeout errors (2 fencings and 2 pgsql instances):

> Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
>   monitor for fencing-secondary on master: unknown error (1)
> Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
>   monitor for PGSQL:0 on master: unknown error (1)
> Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
>   monitor for fencing-master on secondary: unknown error (1)
> Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op
>   monitor for PGSQL:1 on secondary: unknown error (1)

As a reaction, Pacemaker decide to stop everything because it can not move
resources anywhere:

> Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from
> master after 1 failures (max=1)
> Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from
> master after 1 failures (max=1)
> Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-secondary
> away from master after 1 failures (max=1)
> Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-master away
> from secondary after 1 failures (max=1)
> Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from
> secondary after 1 failures (max=1)
> Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from
> secondary after 1 failures (max=1)
> Apr 17 10:03:40 master pengine[12480]: notice: Stop AWSVIP (master)
> Apr 17 10:03:40 master pengine[12480]: notice: Demote PGSQL:0 (Master ->
> Stopped master)
> Apr 17 10:03:40 master pengine[12480]: notice: Stop PGSQL:1 (secondary)

Now, following lines are really not expected. Why systemd detects PostgreSQL
stopped?

> Apr 17 10:03:40 master postgresql@9.5-main[32458]: Cluster is not running.
> Apr 17 10:03:40 master systemd[1]: postgresql@9.5-main.service: Control
> process exited, code=exited status=2
> Apr 17 10:03:40 master systemd[1]: postgresql@9.5-main.service: Unit
> entered failed state.
> Apr 17 10:03:40 master systemd[1]: postgresql@9.5-main.service: Failed with
> result 'exit-code'.

I suspect the service is still enabled or has been started by hand.

As soon as you setup a resource in Pacemaker, admin show **always** ask
Pacemaker to start/stop it. Never use systemctl to handle the resource yourself.

You must disable this service in systemd.

++
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Fwd: Postgres pacemaker cluster failure

2019-04-18 Thread Danka Ivanović
Hi,

Can you help me with troubleshooting postgres pacemaker cluster failure?
Today cluster failed without promoting secondary to master. At the same
time appeared ldap time out.
Here are the logs, master was stopped by pacemaker at 10:03:40 AM UTC.
Thank you in advance.

corosync.log

Apr 17 10:03:34 master crmd[12481]: notice: State transition S_IDLE ->
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
Apr 17 10:03:34 master pengine[12480]: notice: On loss of CCM Quorum: Ignore
Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
monitor for fencing-secondary on master: unknown error (1)
Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
monitor for fencing-master on secondary: unknown error (1)
Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
monitor for PGSQL:1 on secondary: unknown error (1)
Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary
away from master after 1 failures (max=1)
Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master away
from secondary after 1 failures (max=1)
Apr 17 10:03:34 master pengine[12480]: notice: Recover PGSQL:1 (Slave
secondary)
Apr 17 10:03:34 master pengine[12480]: notice: Calculated Transition 3461:
/var/lib/pacemaker/pengine/pe-input-58.bz2
Apr 17 10:03:34 master pengine[12480]: notice: On loss of CCM Quorum: Ignore
Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
monitor for fencing-secondary on master: unknown error (1)
Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
monitor for fencing-master on secondary: unknown error (1)
Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op
monitor for PGSQL:1 on secondary: unknown error (1)
Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary
away from master after 1 failures (max=1)
Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master away
from secondary after 1 failures (max=1)
Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away from
secondary after 1 failures (max=1)
Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away from
secondary after 1 failures (max=1)
Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1 (secondary)
Apr 17 10:03:34 master pengine[12480]: notice: Calculated Transition 3462:
/var/lib/pacemaker/pengine/pe-input-59.bz2
Apr 17 10:03:40 master lrmd[12477]: warning: PGSQL_monitor_15000 process
(PID 32372) timed out
Apr 17 10:03:40 master lrmd[12477]: warning: PGSQL_monitor_15000:32372 -
timed out after 1ms
Apr 17 10:03:40 master crmd[12481]: notice: Transition aborted by
PGSQL_monitor_15000 'modify' on master: Old event
(magic=2:1;8:7:8:319e4083-ccc0-440a-ae43-1bbd39275fe7, cib=0.93.14,
source=process_graph_event:605, 0)
Apr 17 10:03:40 master corosync[23321]: [QB ] IPC credentials authenticated
(23321-32400-25)
Apr 17 10:03:40 master corosync[23321]: [QB ] connecting to client [32400]
Apr 17 10:03:40 master corosync[23321]: [QB ] shm size:1048589;
real_size:1052672; rb->word_size:263168
Apr 17 10:03:40 master corosync[23321]: message repeated 2 times: [ [QB ]
shm size:1048589; real_size:1052672; rb->word_size:263168]
Apr 17 10:03:40 master corosync[23321]: [QB ] HUP conn (23321-32400-25)
Apr 17 10:03:40 master corosync[23321]: [QB ]
qb_ipcs_disconnect(23321-32400-25) state:2
Apr 17 10:03:40 master corosync[23321]: [QB ] epoll_ctl(del): Bad file
descriptor (9)
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cpg-response-23321-32400-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cpg-event-23321-32400-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cpg-request-23321-32400-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] IPC credentials authenticated
(23321-32400-25)
Apr 17 10:03:40 master corosync[23321]: [QB ] connecting to client [32400]
Apr 17 10:03:40 master corosync[23321]: [QB ] shm size:1048589;
real_size:1052672; rb->word_size:263168
Apr 17 10:03:40 master corosync[23321]: message repeated 2 times: [ [QB ]
shm size:1048589; real_size:1052672; rb->word_size:263168]
Apr 17 10:03:40 master corosync[23321]: [QB ] HUP conn (23321-32400-25)
Apr 17 10:03:40 master corosync[23321]: [QB ]
qb_ipcs_disconnect(23321-32400-25) state:2
Apr 17 10:03:40 master corosync[23321]: [QB ] epoll_ctl(del): Bad file
descriptor (9)
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cmap-response-23321-32400-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cmap-event-23321-32400-25-header
Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer:
/dev/shm/qb-cmap-request-23321-32400-25-header
Apr 17 10:03:40 master pgsqlms(PGSQL)[32393]: DEBUG: _get_controldata:
found: {
Apr 17 10:03:40 master pgsqlms(PGSQL)[32393]: DEBUG: pgsql_notify:
environment variables: {
Apr 17