Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
We tried to fix ldap issue with nss_initgroups_ignoreusers option in nslcd.conf for postgres and hacluster users. So cluster shouldn't contact ldap server every 15 seconds when it checks psql with postgres user: /usr/lib/postgresql/9.5/bin/pg_isready -h /var/run/postgresql/ -p 5432 We have two ldap servers, and when one was unavailable, cluster failed immediately due to timeout, even if it can reach other ldap server. I know it should be avoided starting master database with systemctl, but I didn't find a way to start it with pacemaker. I will test again, but I am out of ideas. Because I tried with different pgsqlms options, different versions of postgres.. But now it looks like something else happened.. On Wed, Jul 10, 2019 at 4:57 PM Jehan-Guillaume de Rorthais wrote: > On Wed, 10 Jul 2019 16:34:17 +0200 > Danka Ivanovic wrote: > > > Hi, Thank you all for responding so quickly. Part of corosync.log file is > > attached. Cluster failure occured in 09:16 AM yesterday. > > Debug mode is turned on in corosync configuration, but I didn't turn it > on > > in pacemaker config. I will test that. > > There's really nothing interesting in there sadly. It could even be like > pgsqlms hadn't been called at all and the action timed out... > > > Postgres log is also attached. > > Nothing really revelent there as well. > > > Several times cluster failed because of ldap time out, even if I tried to > > disable ldap searching for local postgres user, > > This is really anoying. IIRC, this was already happening last time. Fix > this > first if you didn't yet? > > ... > > From syslog it looks like postgres systemd process was > > stoped, > > Again, systemd shouldn't take part of anything in your cluster irw > postgresql. > If Pacemaker manage PostgreSQL, systemd should have nothing to do with it. > > If you really need to start/stop it by hands (I really discourage you to > do so), do it using pg_ctl. And make sure to unmanage the Pacemaker > resource > before. > > > On Tue, 9 Jul 2019 19:57:06 +0300 > > > Andrei Borzenkov wrote: > > > > > > > 09.07.2019 13:08, Danka Ivanović пишет: > > > > > Hi I didn't manage to start master with postgres, even if I > increased > > > start > > > > > timeout. I checked executable paths and start options. > > > > > > We would require much more logs from this failure... > > > > > > > > When cluster is running with manually started master and slave > started > > > over > > > > > pacemaker, everything works ok. > > > > > > Logs from this scenario might be interesting as well to check and > compare. > > > > > > > > Today we had failover again. > > > > > I cannot find reason from the logs, can you help me with > debugging? > > > Thanks. > > > > > > logs logs logs please. > > > > > > > > Jul 09 09:16:32 [2679] postgres1 lrmd:debug: > > > > > child_kill_helper: Kill pid 12735's group Jul 09 09:16:34 [2679] > > > > > postgres1 lrmd: warning: child_timeout_callback: > > > > > PGSQL_monitor_15000 process (PID 12735) timed out > > > > > > > > You probably want to enable debug output in resource agent. As far > as I > > > > can tell, this requires HA_debug=1 in environment of resource agent, > but > > > > for the life of me I cannot find where it is possible to set it. > > > > > > > > Probably setting it directly in resource agent for debugging is the > most > > > > simple way. > > > > > > I usually set this in "/etc/sysconfig/pacemaker". Never tried to add it > > > to pgsqlms, interesting. > > > > > > > P.S. crm_resource is called by resource agent (pgsqlms). And it shows > > > > result of original resource probing which makes it confusing. At > least > > > > it explains where these logs entries come from. > > > > > > Not sure tu understand what you mean :/ > > > > > > > -- > Jehan-Guillaume de Rorthais > Dalibo ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
On Wed, 10 Jul 2019 17:25:57 +0200 Danka Ivanovic wrote: ... > I know it should be avoided starting master database with systemctl, but I > didn't find a way to start it with pacemaker. I will test again, but I am > out of ideas. Put the cluster in debug mode and provide the full logs + pacemaker conf + pgsql confs. It will certainly help understand. > On Wed, Jul 10, 2019 at 4:57 PM Jehan-Guillaume de Rorthais > wrote: > > > On Wed, 10 Jul 2019 16:34:17 +0200 > > Danka Ivanovic wrote: > > > > > Hi, Thank you all for responding so quickly. Part of corosync.log file is > > > attached. Cluster failure occured in 09:16 AM yesterday. > > > Debug mode is turned on in corosync configuration, but I didn't turn it > > on > > > in pacemaker config. I will test that. > > > > There's really nothing interesting in there sadly. It could even be like > > pgsqlms hadn't been called at all and the action timed out... > > > > > Postgres log is also attached. > > > > Nothing really revelent there as well. > > > > > Several times cluster failed because of ldap time out, even if I tried to > > > disable ldap searching for local postgres user, > > > > This is really anoying. IIRC, this was already happening last time. Fix > > this > > first if you didn't yet? > > > > ... > > > From syslog it looks like postgres systemd process was > > > stoped, > > > > Again, systemd shouldn't take part of anything in your cluster irw > > postgresql. > > If Pacemaker manage PostgreSQL, systemd should have nothing to do with it. > > > > If you really need to start/stop it by hands (I really discourage you to > > do so), do it using pg_ctl. And make sure to unmanage the Pacemaker > > resource > > before. > > > > > On Tue, 9 Jul 2019 19:57:06 +0300 > > > > Andrei Borzenkov wrote: > > > > > > > > > 09.07.2019 13:08, Danka Ivanović пишет: > > > > > > Hi I didn't manage to start master with postgres, even if I > > increased > > > > start > > > > > > timeout. I checked executable paths and start options. > > > > > > > > We would require much more logs from this failure... > > > > > > > > > > When cluster is running with manually started master and slave > > started > > > > over > > > > > > pacemaker, everything works ok. > > > > > > > > Logs from this scenario might be interesting as well to check and > > compare. > > > > > > > > > > Today we had failover again. > > > > > > I cannot find reason from the logs, can you help me with > > debugging? > > > > Thanks. > > > > > > > > logs logs logs please. > > > > > > > > > > Jul 09 09:16:32 [2679] postgres1 lrmd:debug: > > > > > > child_kill_helper: Kill pid 12735's group Jul 09 09:16:34 [2679] > > > > > > postgres1 lrmd: warning: child_timeout_callback: > > > > > > PGSQL_monitor_15000 process (PID 12735) timed out > > > > > > > > > > You probably want to enable debug output in resource agent. As far > > as I > > > > > can tell, this requires HA_debug=1 in environment of resource agent, > > but > > > > > for the life of me I cannot find where it is possible to set it. > > > > > > > > > > Probably setting it directly in resource agent for debugging is the > > most > > > > > simple way. > > > > > > > > I usually set this in "/etc/sysconfig/pacemaker". Never tried to add it > > > > to pgsqlms, interesting. > > > > > > > > > P.S. crm_resource is called by resource agent (pgsqlms). And it shows > > > > > result of original resource probing which makes it confusing. At > > least > > > > > it explains where these logs entries come from. > > > > > > > > Not sure tu understand what you mean :/ > > > > > > > > > > > > -- > > Jehan-Guillaume de Rorthais > > Dalibo -- Jehan-Guillaume de Rorthais Dalibo ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
On Wed, 10 Jul 2019 16:34:17 +0200 Danka Ivanovic wrote: > Hi, Thank you all for responding so quickly. Part of corosync.log file is > attached. Cluster failure occured in 09:16 AM yesterday. > Debug mode is turned on in corosync configuration, but I didn't turn it on > in pacemaker config. I will test that. There's really nothing interesting in there sadly. It could even be like pgsqlms hadn't been called at all and the action timed out... > Postgres log is also attached. Nothing really revelent there as well. > Several times cluster failed because of ldap time out, even if I tried to > disable ldap searching for local postgres user, This is really anoying. IIRC, this was already happening last time. Fix this first if you didn't yet? ... > From syslog it looks like postgres systemd process was > stoped, Again, systemd shouldn't take part of anything in your cluster irw postgresql. If Pacemaker manage PostgreSQL, systemd should have nothing to do with it. If you really need to start/stop it by hands (I really discourage you to do so), do it using pg_ctl. And make sure to unmanage the Pacemaker resource before. > On Tue, 9 Jul 2019 19:57:06 +0300 > > Andrei Borzenkov wrote: > > > > > 09.07.2019 13:08, Danka Ivanović пишет: > > > > Hi I didn't manage to start master with postgres, even if I increased > > start > > > > timeout. I checked executable paths and start options. > > > > We would require much more logs from this failure... > > > > > > When cluster is running with manually started master and slave started > > over > > > > pacemaker, everything works ok. > > > > Logs from this scenario might be interesting as well to check and compare. > > > > > > Today we had failover again. > > > > I cannot find reason from the logs, can you help me with debugging? > > Thanks. > > > > logs logs logs please. > > > > > > Jul 09 09:16:32 [2679] postgres1 lrmd:debug: > > > > child_kill_helper: Kill pid 12735's group Jul 09 09:16:34 [2679] > > > > postgres1 lrmd: warning: child_timeout_callback: > > > > PGSQL_monitor_15000 process (PID 12735) timed out > > > > > > You probably want to enable debug output in resource agent. As far as I > > > can tell, this requires HA_debug=1 in environment of resource agent, but > > > for the life of me I cannot find where it is possible to set it. > > > > > > Probably setting it directly in resource agent for debugging is the most > > > simple way. > > > > I usually set this in "/etc/sysconfig/pacemaker". Never tried to add it > > to pgsqlms, interesting. > > > > > P.S. crm_resource is called by resource agent (pgsqlms). And it shows > > > result of original resource probing which makes it confusing. At least > > > it explains where these logs entries come from. > > > > Not sure tu understand what you mean :/ > > -- Jehan-Guillaume de Rorthais Dalibo ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
On Wed, 10 Jul 2019 12:53:59 +0300 Andrei Borzenkov wrote: > On Wed, Jul 10, 2019 at 12:42 PM Jehan-Guillaume de Rorthais > wrote: > > > > > > > Jul 09 09:16:32 [2679] postgres1 lrmd:debug: > > > > child_kill_helper: Kill pid 12735's group Jul 09 09:16:34 [2679] > > > > postgres1 lrmd: warning: child_timeout_callback: > > > > PGSQL_monitor_15000 process (PID 12735) timed out > > > > > > You probably want to enable debug output in resource agent. As far as I > > > can tell, this requires HA_debug=1 in environment of resource agent, but > > > for the life of me I cannot find where it is possible to set it. > > > > > > Probably setting it directly in resource agent for debugging is the most > > > simple way. > > > > I usually set this in "/etc/sysconfig/pacemaker". Never tried to add it > > to pgsqlms, interesting. > > As far as I understand it will set it for every process spawned by > pacemaker which may be too much (it would enable debug output in every > resource agent for every resource). Indeed, it does. > Some generic mean to set it for > specific resource only may be useful for targeted troubleshooting. This would be useful. I had a quick look in resource-agents and seen no RA setting HA_debug themselves based on some reloadable parameters. Is it possible? ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
On Wed, Jul 10, 2019 at 12:42 PM Jehan-Guillaume de Rorthais wrote: > > > > Jul 09 09:16:32 [2679] postgres1 lrmd:debug: > > > child_kill_helper: Kill pid 12735's group Jul 09 09:16:34 [2679] > > > postgres1 lrmd: warning: child_timeout_callback: > > > PGSQL_monitor_15000 process (PID 12735) timed out > > > > You probably want to enable debug output in resource agent. As far as I > > can tell, this requires HA_debug=1 in environment of resource agent, but > > for the life of me I cannot find where it is possible to set it. > > > > Probably setting it directly in resource agent for debugging is the most > > simple way. > > I usually set this in "/etc/sysconfig/pacemaker". Never tried to add it > to pgsqlms, interesting. As far as I understand it will set it for every process spawned by pacemaker which may be too much (it would enable debug output in every resource agent for every resource). Some generic mean to set it for specific resource only may be useful for targeted troubleshooting. Today one could also simply set environment variable in systemd unit definition but that will have the same global effect. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
On Wed, Jul 10, 2019 at 12:42 PM Jehan-Guillaume de Rorthais wrote: > > > P.S. crm_resource is called by resource agent (pgsqlms). And it shows > > result of original resource probing which makes it confusing. At least > > it explains where these logs entries come from. > > Not sure tu understand what you mean :/ I probably mixed up with another thread where it was unclear where crm_resource debug output originated from. Sorry. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
On Tue, 9 Jul 2019 19:57:06 +0300 Andrei Borzenkov wrote: > 09.07.2019 13:08, Danka Ivanović пишет: > > Hi I didn't manage to start master with postgres, even if I increased start > > timeout. I checked executable paths and start options. We would require much more logs from this failure... > > When cluster is running with manually started master and slave started over > > pacemaker, everything works ok. Logs from this scenario might be interesting as well to check and compare. > > Today we had failover again. > > I cannot find reason from the logs, can you help me with debugging? Thanks. logs logs logs please. > > Jul 09 09:16:32 [2679] postgres1 lrmd:debug: > > child_kill_helper: Kill pid 12735's group Jul 09 09:16:34 [2679] > > postgres1 lrmd: warning: child_timeout_callback: > > PGSQL_monitor_15000 process (PID 12735) timed out > > You probably want to enable debug output in resource agent. As far as I > can tell, this requires HA_debug=1 in environment of resource agent, but > for the life of me I cannot find where it is possible to set it. > > Probably setting it directly in resource agent for debugging is the most > simple way. I usually set this in "/etc/sysconfig/pacemaker". Never tried to add it to pgsqlms, interesting. > P.S. crm_resource is called by resource agent (pgsqlms). And it shows > result of original resource probing which makes it confusing. At least > it explains where these logs entries come from. Not sure tu understand what you mean :/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
09.07.2019 13:08, Danka Ivanović пишет: > Hi I didn't manage to start master with postgres, even if I increased start > timeout. I checked executable paths and start options. > When cluster is running with manually started master and slave started over > pacemaker, everything works ok. Today we had failover again. > I cannot find reason from the logs, can you help me with debugging? Thanks. > > Jul 09 09:16:32 [2679] postgres1 lrmd:debug: child_kill_helper: > Kill pid 12735's group > Jul 09 09:16:34 [2679] postgres1 lrmd: warning: > child_timeout_callback:PGSQL_monitor_15000 process (PID 12735) timed > out You probably want to enable debug output in resource agent. As far as I can tell, this requires HA_debug=1 in environment of resource agent, but for the life of me I cannot find where it is possible to set it. Probably setting it directly in resource agent for debugging is the most simple way. P.S. crm_resource is called by resource agent (pgsqlms). And it shows result of original resource probing which makes it confusing. At least it explains where these logs entries come from. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
On Thu, 2019-05-16 at 10:20 +0200, Jehan-Guillaume de Rorthais wrote: > On Wed, 15 May 2019 16:53:48 -0500 > Ken Gaillot wrote: > > > On Wed, 2019-05-15 at 11:50 +0200, Jehan-Guillaume de Rorthais > > wrote: > > > On Mon, 29 Apr 2019 19:59:49 +0300 > > > Andrei Borzenkov wrote: > > > > > > > 29.04.2019 18:05, Ken Gaillot пишет: > > > > > > > > > > > > > Why does not it check OCF_RESKEY_CRM_meta_notify? > > > > > > > > > > > > I was just not aware of this env variable. Sadly, it is not > > > > > > documented > > > > > > anywhere :( > > > > > > > > > > It's not a Pacemaker-created value like the other notify > > > > > variables -- > > > > > all user-specified meta-attributes are passed that way. We do > > > > > need to > > > > > document that. > > > > > > > > OCF_RESKEY_CRM_meta_notify is passed also when "notify" meta- > > > > attribute > > > > is *not* specified, as well as a couple of others. But not > > > > all > > > > Hopefully in that case it's passed as false? I vaguely remember > > some > > case where clone attributes were mistakenly passed to non-clone > > resources, but I think notify is always accurate for clone > > resources. > > [1] > > > > > possible > > > > attributes. And some OCF_RESKEY_CRM_meta_* variables that are > > > > passed do > > > > not correspond to any user settable and documented meta- > > > > attribute, > > > > like > > > > OCF_RESKEY_CRM_meta_clone. > > > > > > Sorry guys, now I am confused. > > > > A well-known side effect of pacemaker ;) > > > > > Is it safe or not to use OCF_RESKEY_CRM_meta_notify? You both > > > doesn't > > > seem to > > > agree where it comes from. Is it only a non expected side effect > > > or > > > is it safe > > > and stable code path in Pacemaker we can rely on? > > > > It's reliable. All user-specified meta-attributes end up as > > environment > > variables > > OK... > > > -- it's just meta-attributes that *aren't* specified by the > > user that may or may not show up > > OK... > > > (but hopefully with the correct value). > > And that's where I am now loosing some confidence about this > environment vars :) > "Hopefully" and "I think is accurate" ([1]) are quite scary to me :/ It looks perfectly reliable to me :) but Andrei's comments make me want more information. If I understand correctly, he's saying that the presence of the notify variable is unreliable. That's fine if the option is not specified by the user, and the variable is either not present or present as false. But it would indicate a bug if the variable is not present when the option *is* specified by the user, or if the variable is present as true when the option is not specified by the user. Personally I'd rely on it. The controller gets the environment variable values from the entries in the scheduler's result. We have numerous examples in the scheduler regression test data, typically installed under /usr/share/pacemaker/tests in scheduler/*.exp (for 2.0) or pengine/test10/*.exp (for 1.1). -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
On Wed, 15 May 2019 16:53:48 -0500 Ken Gaillot wrote: > On Wed, 2019-05-15 at 11:50 +0200, Jehan-Guillaume de Rorthais wrote: > > On Mon, 29 Apr 2019 19:59:49 +0300 > > Andrei Borzenkov wrote: > > > > > 29.04.2019 18:05, Ken Gaillot пишет: > > > > > > > > > > > Why does not it check OCF_RESKEY_CRM_meta_notify? > > > > > > > > > > I was just not aware of this env variable. Sadly, it is not > > > > > documented > > > > > anywhere :( > > > > > > > > It's not a Pacemaker-created value like the other notify > > > > variables -- > > > > all user-specified meta-attributes are passed that way. We do > > > > need to > > > > document that. > > > > > > OCF_RESKEY_CRM_meta_notify is passed also when "notify" meta- > > > attribute > > > is *not* specified, as well as a couple of others. But not all > > Hopefully in that case it's passed as false? I vaguely remember some > case where clone attributes were mistakenly passed to non-clone > resources, but I think notify is always accurate for clone resources. [1] > > > possible > > > attributes. And some OCF_RESKEY_CRM_meta_* variables that are > > > passed do > > > not correspond to any user settable and documented meta-attribute, > > > like > > > OCF_RESKEY_CRM_meta_clone. > > > > Sorry guys, now I am confused. > > A well-known side effect of pacemaker ;) > > > Is it safe or not to use OCF_RESKEY_CRM_meta_notify? You both doesn't > > seem to > > agree where it comes from. Is it only a non expected side effect or > > is it safe > > and stable code path in Pacemaker we can rely on? > > It's reliable. All user-specified meta-attributes end up as environment > variables OK... >-- it's just meta-attributes that *aren't* specified by the > user that may or may not show up OK... > (but hopefully with the correct value). And that's where I am now loosing some confidence about this environment vars :) "Hopefully" and "I think is accurate" ([1]) are quite scary to me :/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
On Wed, 2019-05-15 at 11:50 +0200, Jehan-Guillaume de Rorthais wrote: > On Mon, 29 Apr 2019 19:59:49 +0300 > Andrei Borzenkov wrote: > > > 29.04.2019 18:05, Ken Gaillot пишет: > > > > > > > > > Why does not it check OCF_RESKEY_CRM_meta_notify? > > > > > > > > I was just not aware of this env variable. Sadly, it is not > > > > documented > > > > anywhere :( > > > > > > It's not a Pacemaker-created value like the other notify > > > variables -- > > > all user-specified meta-attributes are passed that way. We do > > > need to > > > document that. > > > > OCF_RESKEY_CRM_meta_notify is passed also when "notify" meta- > > attribute > > is *not* specified, as well as a couple of others. But not all Hopefully in that case it's passed as false? I vaguely remember some case where clone attributes were mistakenly passed to non-clone resources, but I think notify is always accurate for clone resources. > > possible > > attributes. And some OCF_RESKEY_CRM_meta_* variables that are > > passed do > > not correspond to any user settable and documented meta-attribute, > > like > > OCF_RESKEY_CRM_meta_clone. > > Sorry guys, now I am confused. A well-known side effect of pacemaker ;) > Is it safe or not to use OCF_RESKEY_CRM_meta_notify? You both doesn't > seem to > agree where it comes from. Is it only a non expected side effect or > is it safe > and stable code path in Pacemaker we can rely on? It's reliable. All user-specified meta-attributes end up as environment variables -- it's just meta-attributes that *aren't* specified by the user that may or may not show up (but hopefully with the correct value). > > Does it worth a patch in pgsqlms RA? > > Thanks, -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
On Mon, 29 Apr 2019 19:59:49 +0300 Andrei Borzenkov wrote: > 29.04.2019 18:05, Ken Gaillot пишет: > >> > >>> Why does not it check OCF_RESKEY_CRM_meta_notify? > >> > >> I was just not aware of this env variable. Sadly, it is not > >> documented > >> anywhere :( > > > > It's not a Pacemaker-created value like the other notify variables -- > > all user-specified meta-attributes are passed that way. We do need to > > document that. > > OCF_RESKEY_CRM_meta_notify is passed also when "notify" meta-attribute > is *not* specified, as well as a couple of others. But not all possible > attributes. And some OCF_RESKEY_CRM_meta_* variables that are passed do > not correspond to any user settable and documented meta-attribute, like > OCF_RESKEY_CRM_meta_clone. Sorry guys, now I am confused. Is it safe or not to use OCF_RESKEY_CRM_meta_notify? You both doesn't seem to agree where it comes from. Is it only a non expected side effect or is it safe and stable code path in Pacemaker we can rely on? Does it worth a patch in pgsqlms RA? Thanks, ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
On Tue, 30 Apr 2019 17:28:44 +0200 Danka Ivanović wrote: > Hi, I tried new clean config with upgraded postgres and corosync and > pacemaker packages. In this attempt, your PostgreSQL resource timed out while starting up: Apr 30 15:09:43 [13342] master lrmd:debug: operation_finished: PGSQL_start_0:13864:stdout [FATAL: the database system is starting up ] Apr 30 15:09:43 [13342] master lrmd: info: log_finished: finished - rsc:PGSQL action:start call_id:21 pid:13864 exit-code:1 exec-time:60003ms queue-time:0ms Apr 30 15:09:43 [13345] master crmd:debug: create_operation_update:do_update_resource: Updating resource PGSQL after start op Timed Out (interval=0) I suppose your local instance had many WAL to replay before being consistent and accepting connections and the 60s timeout wasn't enough. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
29.04.2019 18:05, Ken Gaillot пишет: >> >>> Why does not it check OCF_RESKEY_CRM_meta_notify? >> >> I was just not aware of this env variable. Sadly, it is not >> documented >> anywhere :( > > It's not a Pacemaker-created value like the other notify variables -- > all user-specified meta-attributes are passed that way. We do need to > document that. OCF_RESKEY_CRM_meta_notify is passed also when "notify" meta-attribute is *not* specified, as well as a couple of others. But not all possible attributes. And some OCF_RESKEY_CRM_meta_* variables that are passed do not correspond to any user settable and documented meta-attribute, like OCF_RESKEY_CRM_meta_clone. Yes, this needs documentation indeed ... ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
On Sun, 2019-04-28 at 00:27 +0200, Jehan-Guillaume de Rorthais wrote: > On Sat, 27 Apr 2019 09:15:29 +0300 > Andrei Borzenkov wrote: > > > 27.04.2019 1:04, Danka Ivanović пишет: > > > Hi, here is a complete cluster configuration: > > > > > > node 1: master > > > node 2: secondary > > > primitive AWSVIP awsvip \ > > > params secondary_private_ip=10.x.x.x api_delay=5 > > > primitive PGSQL pgsqlms \ > > > params pgdata="/var/lib/postgresql/9.5/main" > > > bindir="/usr/lib/postgresql/9.5/bin" > > > pghost="/var/run/postgresql/" > > > recovery_template="/etc/postgresql/9.5/main/recovery.conf.pcmk" > > > start_opts="-c > > > config_file=/etc/postgresql/9.5/main/postgresql.conf" \ > > > op start timeout=60s interval=0 \ > > > op stop timeout=60s interval=0 \ > > > op promote timeout=15s interval=0 \ > > > op demote timeout=120s interval=0 \ > > > op monitor interval=15s timeout=10s role=Master \ > > > op monitor interval=16s timeout=10s role=Slave \ > > > op notify timeout=60 interval=0 > > > primitive fencing-postgres-ha-2 stonith:external/ec2 \ > > > params port=master \ > > > op start interval=0s timeout=60s \ > > > op monitor interval=360s timeout=60s \ > > > op stop interval=0s timeout=60s > > > primitive fencing-test-rsyslog stonith:external/ec2 \ > > > params port=secondary \ > > > op start interval=0s timeout=60s \ > > > op monitor interval=360s timeout=60s \ > > > op stop interval=0s timeout=60s > > > ms PGSQL-HA PGSQL \ > > > meta notify=true > > > colocation IPAWSIP-WITH-MASTER inf: AWSVIP PGSQL-HA:Master > > > order demote-then-stop-ip Mandatory: _rsc_set_ PGSQL-HA:demote > > > AWSVIP:stop > > > symmetrical=false > > > location loc-fence-master fencing-postgres-ha-2 -inf: master > > > location loc-fence-secondary fencing-test-rsyslog -inf: secondary > > > order promote-then-ip Mandatory: _rsc_set_ PGSQL-HA:promote > > > AWSVIP:start > > > symmetrical=false > > > property cib-bootstrap-options: \ > > > have-watchdog=false \ > > > dc-version=1.1.14-70404b0 \ > > > cluster-infrastructure=corosync \ > > > cluster-name=psql-ha \ > > > stonith-enabled=true \ > > > no-quorum-policy=ignore \ > > > last-lrm-refresh=1556315444 \ > > > maintenance-mode=false > > > rsc_defaults rsc-options: \ > > > resource-stickiness=10 \ > > > migration-threshold=2 > > > > > > I tried to start manually postgres to be sure it is ok. There are > > > no error > > > in postgres log. I also tried with different meta parameters, but > > > always > > > with notify=true. > > > I also tried this: > > > ms PGSQL-HA PGSQL \ > > > meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 > > > notify=true interleave=true > > > I have followed this link: > > > https://clusterlabs.github.io/PAF/Quick_Start-Debian-9-crm.html > > > When stonith is enabled and working I imported all other > > > resources and > > > constraints all together in the same time. > > > > > > On Fri, 26 Apr 2019 at 13:46, Jehan-Guillaume de Rorthais < > > > j...@dalibo.com> > > > wrote: > > > > > > > Hi, > > > > > > > > On Thu, 25 Apr 2019 18:57:55 +0200 > > > > Danka Ivanović wrote: > > > > > > > > > Apr 25 16:39:50 [4213] master lrmd: notice: > > > > > operation_finished: PGSQL_monitor_0:5849:stderr [ ocf-exit- > > > > > reason:You > > > > > must set meta parameter notify=true for your master resource > > > > > ] > > > > > > > > Resource agent pgsqlms refuse to start PgSQL because your > > > > configuration > > > > lacks > > > > the "notify=true" attribute in your master definition. > > > > > > > > PAF pgsqlms contains: > > > > # check notify=true > > $ans = qx{ $CRM_RESOURCE --resource "$OCF_RESOURCE_INSTANCE" \\ > > --meta --get-parameter notify 2>/dev/null }; > > chomp $ans; > > unless ( lc($ans) =~ /^true$|^on$|^yes$|^y$|^1$/ ) { > > ocf_exit_reason( > > 'You must set meta parameter notify=true for your > > master > > resource' > > ); > > exit $OCF_ERR_INSTALLED; > > } > > > > but that is wrong - "notify" is set on ms definition, while > > $OCF_RESOURCE_INSTANCE refers to individual clone member. There is > > no > > notify option on PGSQL primitive. > > Interesting...and disturbing. I wonder why I never faced a bug > related to this > after so many tests in various OS and a bunch of running clusters in > various > environments. Plus, it hasn't been reported sooner by anyone. > > Is it possible the clone members inherit this from the master > definition or > "crm_resource" to look at this higher level? That's correct. For clone/master/group/bundle resources, setting meta- attributes on the collective resource makes them effective for the inner resources as well. So I don't think that's causing any issues here. > If I set a meta attr
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
On Sat, 27 Apr 2019 09:15:29 +0300 Andrei Borzenkov wrote: > 27.04.2019 1:04, Danka Ivanović пишет: > > Hi, here is a complete cluster configuration: > > > > node 1: master > > node 2: secondary > > primitive AWSVIP awsvip \ > > params secondary_private_ip=10.x.x.x api_delay=5 > > primitive PGSQL pgsqlms \ > > params pgdata="/var/lib/postgresql/9.5/main" > > bindir="/usr/lib/postgresql/9.5/bin" pghost="/var/run/postgresql/" > > recovery_template="/etc/postgresql/9.5/main/recovery.conf.pcmk" > > start_opts="-c config_file=/etc/postgresql/9.5/main/postgresql.conf" \ > > op start timeout=60s interval=0 \ > > op stop timeout=60s interval=0 \ > > op promote timeout=15s interval=0 \ > > op demote timeout=120s interval=0 \ > > op monitor interval=15s timeout=10s role=Master \ > > op monitor interval=16s timeout=10s role=Slave \ > > op notify timeout=60 interval=0 > > primitive fencing-postgres-ha-2 stonith:external/ec2 \ > > params port=master \ > > op start interval=0s timeout=60s \ > > op monitor interval=360s timeout=60s \ > > op stop interval=0s timeout=60s > > primitive fencing-test-rsyslog stonith:external/ec2 \ > > params port=secondary \ > > op start interval=0s timeout=60s \ > > op monitor interval=360s timeout=60s \ > > op stop interval=0s timeout=60s > > ms PGSQL-HA PGSQL \ > > meta notify=true > > colocation IPAWSIP-WITH-MASTER inf: AWSVIP PGSQL-HA:Master > > order demote-then-stop-ip Mandatory: _rsc_set_ PGSQL-HA:demote AWSVIP:stop > > symmetrical=false > > location loc-fence-master fencing-postgres-ha-2 -inf: master > > location loc-fence-secondary fencing-test-rsyslog -inf: secondary > > order promote-then-ip Mandatory: _rsc_set_ PGSQL-HA:promote AWSVIP:start > > symmetrical=false > > property cib-bootstrap-options: \ > > have-watchdog=false \ > > dc-version=1.1.14-70404b0 \ > > cluster-infrastructure=corosync \ > > cluster-name=psql-ha \ > > stonith-enabled=true \ > > no-quorum-policy=ignore \ > > last-lrm-refresh=1556315444 \ > > maintenance-mode=false > > rsc_defaults rsc-options: \ > > resource-stickiness=10 \ > > migration-threshold=2 > > > > I tried to start manually postgres to be sure it is ok. There are no error > > in postgres log. I also tried with different meta parameters, but always > > with notify=true. > > I also tried this: > > ms PGSQL-HA PGSQL \ > > meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 > > notify=true interleave=true > > I have followed this link: > > https://clusterlabs.github.io/PAF/Quick_Start-Debian-9-crm.html > > When stonith is enabled and working I imported all other resources and > > constraints all together in the same time. > > > > On Fri, 26 Apr 2019 at 13:46, Jehan-Guillaume de Rorthais > > wrote: > > > >> Hi, > >> > >> On Thu, 25 Apr 2019 18:57:55 +0200 > >> Danka Ivanović wrote: > >> > >>> Apr 25 16:39:50 [4213] master lrmd: notice: > >>> operation_finished: PGSQL_monitor_0:5849:stderr [ ocf-exit-reason:You > >>> must set meta parameter notify=true for your master resource ] > >> > >> Resource agent pgsqlms refuse to start PgSQL because your configuration > >> lacks > >> the "notify=true" attribute in your master definition. > >> > > PAF pgsqlms contains: > > # check notify=true > $ans = qx{ $CRM_RESOURCE --resource "$OCF_RESOURCE_INSTANCE" \\ > --meta --get-parameter notify 2>/dev/null }; > chomp $ans; > unless ( lc($ans) =~ /^true$|^on$|^yes$|^y$|^1$/ ) { > ocf_exit_reason( > 'You must set meta parameter notify=true for your master > resource' > ); > exit $OCF_ERR_INSTALLED; > } > > but that is wrong - "notify" is set on ms definition, while > $OCF_RESOURCE_INSTANCE refers to individual clone member. There is no > notify option on PGSQL primitive. Interesting...and disturbing. I wonder why I never faced a bug related to this after so many tests in various OS and a bunch of running clusters in various environments. Plus, it hasn't been reported sooner by anyone. Is it possible the clone members inherit this from the master definition or "crm_resource" to look at this higher level? If I set a meta attribute at master level, it appears on clones as well: > crm_resource --resource pgsql-ha --meta --get-parameter=clone-max pgsql-ha is active on more than one node, returning the default value for clone-max Attribute 'clone-max' not found for 'pgsql-ha' Error performing operation: No such device or address > crm_resource --resource pgsqld --meta --get-parameter=clone-max Attribute 'clone-max' not found for 'pgsqld:0' Error performing operation: No such device or address > crm_resource --resource=pgsql-ha --meta --set-parameter=clone-max \ --parameter-value=3 Set 'pgsql-ha' option: id=pgsql-h
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
27.04.2019 1:04, Danka Ivanović пишет: > Hi, here is a complete cluster configuration: > > node 1: master > node 2: secondary > primitive AWSVIP awsvip \ > params secondary_private_ip=10.x.x.x api_delay=5 > primitive PGSQL pgsqlms \ > params pgdata="/var/lib/postgresql/9.5/main" > bindir="/usr/lib/postgresql/9.5/bin" pghost="/var/run/postgresql/" > recovery_template="/etc/postgresql/9.5/main/recovery.conf.pcmk" > start_opts="-c config_file=/etc/postgresql/9.5/main/postgresql.conf" \ > op start timeout=60s interval=0 \ > op stop timeout=60s interval=0 \ > op promote timeout=15s interval=0 \ > op demote timeout=120s interval=0 \ > op monitor interval=15s timeout=10s role=Master \ > op monitor interval=16s timeout=10s role=Slave \ > op notify timeout=60 interval=0 > primitive fencing-postgres-ha-2 stonith:external/ec2 \ > params port=master \ > op start interval=0s timeout=60s \ > op monitor interval=360s timeout=60s \ > op stop interval=0s timeout=60s > primitive fencing-test-rsyslog stonith:external/ec2 \ > params port=secondary \ > op start interval=0s timeout=60s \ > op monitor interval=360s timeout=60s \ > op stop interval=0s timeout=60s > ms PGSQL-HA PGSQL \ > meta notify=true > colocation IPAWSIP-WITH-MASTER inf: AWSVIP PGSQL-HA:Master > order demote-then-stop-ip Mandatory: _rsc_set_ PGSQL-HA:demote AWSVIP:stop > symmetrical=false > location loc-fence-master fencing-postgres-ha-2 -inf: master > location loc-fence-secondary fencing-test-rsyslog -inf: secondary > order promote-then-ip Mandatory: _rsc_set_ PGSQL-HA:promote AWSVIP:start > symmetrical=false > property cib-bootstrap-options: \ > have-watchdog=false \ > dc-version=1.1.14-70404b0 \ > cluster-infrastructure=corosync \ > cluster-name=psql-ha \ > stonith-enabled=true \ > no-quorum-policy=ignore \ > last-lrm-refresh=1556315444 \ > maintenance-mode=false > rsc_defaults rsc-options: \ > resource-stickiness=10 \ > migration-threshold=2 > > I tried to start manually postgres to be sure it is ok. There are no error > in postgres log. I also tried with different meta parameters, but always > with notify=true. > I also tried this: > ms PGSQL-HA PGSQL \ > meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 > notify=true interleave=true > I have followed this link: > https://clusterlabs.github.io/PAF/Quick_Start-Debian-9-crm.html > When stonith is enabled and working I imported all other resources and > constraints all together in the same time. > > On Fri, 26 Apr 2019 at 13:46, Jehan-Guillaume de Rorthais > wrote: > >> Hi, >> >> On Thu, 25 Apr 2019 18:57:55 +0200 >> Danka Ivanović wrote: >> >>> Apr 25 16:39:50 [4213] master lrmd: notice: >>> operation_finished: PGSQL_monitor_0:5849:stderr [ ocf-exit-reason:You >>> must set meta parameter notify=true for your master resource ] >> >> Resource agent pgsqlms refuse to start PgSQL because your configuration >> lacks >> the "notify=true" attribute in your master definition. >> PAF pgsqlms contains: # check notify=true $ans = qx{ $CRM_RESOURCE --resource "$OCF_RESOURCE_INSTANCE" \\ --meta --get-parameter notify 2>/dev/null }; chomp $ans; unless ( lc($ans) =~ /^true$|^on$|^yes$|^y$|^1$/ ) { ocf_exit_reason( 'You must set meta parameter notify=true for your master resource' ); exit $OCF_ERR_INSTALLED; } but that is wrong - "notify" is set on ms definition, while $OCF_RESOURCE_INSTANCE refers to individual clone member. There is no notify option on PGSQL primitive. Why does not it check OCF_RESKEY_CRM_meta_notify? ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
Hi, here is a complete cluster configuration: node 1: master node 2: secondary primitive AWSVIP awsvip \ params secondary_private_ip=10.x.x.x api_delay=5 primitive PGSQL pgsqlms \ params pgdata="/var/lib/postgresql/9.5/main" bindir="/usr/lib/postgresql/9.5/bin" pghost="/var/run/postgresql/" recovery_template="/etc/postgresql/9.5/main/recovery.conf.pcmk" start_opts="-c config_file=/etc/postgresql/9.5/main/postgresql.conf" \ op start timeout=60s interval=0 \ op stop timeout=60s interval=0 \ op promote timeout=15s interval=0 \ op demote timeout=120s interval=0 \ op monitor interval=15s timeout=10s role=Master \ op monitor interval=16s timeout=10s role=Slave \ op notify timeout=60 interval=0 primitive fencing-postgres-ha-2 stonith:external/ec2 \ params port=master \ op start interval=0s timeout=60s \ op monitor interval=360s timeout=60s \ op stop interval=0s timeout=60s primitive fencing-test-rsyslog stonith:external/ec2 \ params port=secondary \ op start interval=0s timeout=60s \ op monitor interval=360s timeout=60s \ op stop interval=0s timeout=60s ms PGSQL-HA PGSQL \ meta notify=true colocation IPAWSIP-WITH-MASTER inf: AWSVIP PGSQL-HA:Master order demote-then-stop-ip Mandatory: _rsc_set_ PGSQL-HA:demote AWSVIP:stop symmetrical=false location loc-fence-master fencing-postgres-ha-2 -inf: master location loc-fence-secondary fencing-test-rsyslog -inf: secondary order promote-then-ip Mandatory: _rsc_set_ PGSQL-HA:promote AWSVIP:start symmetrical=false property cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.14-70404b0 \ cluster-infrastructure=corosync \ cluster-name=psql-ha \ stonith-enabled=true \ no-quorum-policy=ignore \ last-lrm-refresh=1556315444 \ maintenance-mode=false rsc_defaults rsc-options: \ resource-stickiness=10 \ migration-threshold=2 I tried to start manually postgres to be sure it is ok. There are no error in postgres log. I also tried with different meta parameters, but always with notify=true. I also tried this: ms PGSQL-HA PGSQL \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true interleave=true I have followed this link: https://clusterlabs.github.io/PAF/Quick_Start-Debian-9-crm.html When stonith is enabled and working I imported all other resources and constraints all together in the same time. On Fri, 26 Apr 2019 at 13:46, Jehan-Guillaume de Rorthais wrote: > Hi, > > On Thu, 25 Apr 2019 18:57:55 +0200 > Danka Ivanović wrote: > > > Apr 25 16:39:50 [4213] master lrmd: notice: > > operation_finished: PGSQL_monitor_0:5849:stderr [ ocf-exit-reason:You > > must set meta parameter notify=true for your master resource ] > > Resource agent pgsqlms refuse to start PgSQL because your configuration > lacks > the "notify=true" attribute in your master definition. > > Could you please share your full Pacemaker configuration? > > Regards, > -- Pozdrav Danka Ivanovic ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
Hi, On Thu, 25 Apr 2019 18:57:55 +0200 Danka Ivanović wrote: > Apr 25 16:39:50 [4213] master lrmd: notice: > operation_finished: PGSQL_monitor_0:5849:stderr [ ocf-exit-reason:You > must set meta parameter notify=true for your master resource ] Resource agent pgsqlms refuse to start PgSQL because your configuration lacks the "notify=true" attribute in your master definition. Could you please share your full Pacemaker configuration? Regards, ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
Hi, Here are the logs when pacemaker fails to start postgres service on master. It manage to start only postgres slave. I tried different configuration with pgslqms and pgsql resource agents. Those errors are when I use pgsqlms agent, which configuration I have sent in first mail: Apr 25 16:40:23 [4213] master lrmd: info: log_execute: executing - rsc:PGSQL action:start call_id:51 launching as "postgres" command "/usr/lib/postgresql/9.5/bin/pg_ctl --pgdata /var/lib/postgresql/9.5/main -w --timeout 120 start -o -c config_file=/etc/postgresql/9.5/main/postgresql.conf" Apr 25 16:40:24 [4211] mastercib: info: cib_perform_op: + /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='PGSQL']/lrm_rsc_op[@id='PGSQL_last_0']: @operation_key=PGSQL_start_0, @operation=start, @transition-key=12:30:0:078c2b66-b095-49c4-947b-2427dd7852bf, @transition-magic=0:0;12:30:0:078c2b66-b095-49c4-947b-2427dd7852bf, @call-id=176, @rc-code=0, @exec-time=1146, @queue-time=0 Apr 25 16:40:53 [4216] master crmd:debug: crm_timer_start: Started Shutdown Escalation (I_STOP:120ms), src=53 Apr 25 16:41:23 [4213] master lrmd: warning: child_timeout_callback: PGSQL_start_0 process (PID 5986) timed out Part of the log is attached. On Tue, 23 Apr 2019 at 17:28, Danka Ivanović wrote: > Hi, > It seems that ldap timeout caused cluster failure. Cluster is checking > status every 15s on master and 16s on slave. Cluster needs postgres user > for authentication, but ldap first query user on ldap server and then > localy on host. When connection to ldap server was interrupted, cluster > couldn't find postgres user and authenticate on db to check state. Problem > is solved with reconfiguring /etc/ldap.conf and /etc/nslcd.conf. Following > variable is added: nss_initgroups_ignoreusers with specified local users > which should be ignored when querying ldap server. Thanks for your help. :) > Another problem is that I cannot start postgres master with pacemaker. > When I start postgres manually (with systemd) and then start pacemaker on > slave, pacemaker is able to recognize master and start slave and failover > works. > That is another problem which I didn't manage to solve. Should I send a > new mail for that issue or we can continue in this thread? > > On Fri, 19 Apr 2019 at 19:19, Jehan-Guillaume de Rorthais > wrote: > >> On Fri, 19 Apr 2019 17:26:14 +0200 >> Danka Ivanović wrote: >> ... >> > Should I change any of those timeout parameters in order to avoid >> timeout? >> >> You can try to raise the timeout, indeed. But as far as we don't know >> **why** >> your VMs froze for some time, it is difficult to guess how high should be >> these timeouts. >> >> Not to mention that it will raise your RTO. >> > > > -- > Pozdrav > Danka Ivanovic > -- Pozdrav Danka Ivanovic Apr 25 16:39:50 [4211] mastercib:debug: crm_client_new: Connecting 0x55d8444e8e80 for uid=0 gid=0 pid=5791 id=c93d535d-77d8-4556-9a63-d9a1c2b45de9 Apr 25 16:39:50 [4211] mastercib:debug: handle_new_connection: IPC credentials authenticated (4211-5791-13) Apr 25 16:39:50 [4211] mastercib:debug: qb_ipcs_shm_connect: connecting to client [5791] Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096 Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096 Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096 Apr 25 16:39:50 [4211] mastercib:debug: cib_acl_enabled:CIB ACL is disabled Apr 25 16:39:50 [4211] mastercib:debug: qb_ipcs_dispatch_connection_request:HUP conn (4211-5791-13) Apr 25 16:39:50 [4211] mastercib:debug: qb_ipcs_disconnect: qb_ipcs_disconnect(4211-5791-13) state:2 Apr 25 16:39:50 [4211] mastercib:debug: crm_client_destroy: Destroying 0 events Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_close: Free'ing ringbuffer: /dev/shm/qb-cib_rw-response-4211-5791-13-header Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_close: Free'ing ringbuffer: /dev/shm/qb-cib_rw-event-4211-5791-13-header Apr 25 16:39:50 [4211] mastercib:debug: qb_rb_close: Free'ing ringbuffer: /dev/shm/qb-cib_rw-request-4211-5791-13-header Apr 25 16:39:50 [15544] master corosync debug [QB] IPC credentials authenticated (15544-5837-24) Apr 25 16:39:50 [15544] master corosync debug [QB] connecting to client [5837] Apr 25 16:39:50 [15544] master corosync debug [QB] shm size:1048589; real_size:1052672; rb->word_size:263168 Apr 25 16:39:50 [15544] master corosync debug [QB] shm size:1048589; real_size:1052672; rb->word_size:263168 Apr 25 16:39:50 [15544] master corosync debug [QB] shm size:1048589; real_size:1052672; rb->word_si
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
Hi, It seems that ldap timeout caused cluster failure. Cluster is checking status every 15s on master and 16s on slave. Cluster needs postgres user for authentication, but ldap first query user on ldap server and then localy on host. When connection to ldap server was interrupted, cluster couldn't find postgres user and authenticate on db to check state. Problem is solved with reconfiguring /etc/ldap.conf and /etc/nslcd.conf. Following variable is added: nss_initgroups_ignoreusers with specified local users which should be ignored when querying ldap server. Thanks for your help. :) Another problem is that I cannot start postgres master with pacemaker. When I start postgres manually (with systemd) and then start pacemaker on slave, pacemaker is able to recognize master and start slave and failover works. That is another problem which I didn't manage to solve. Should I send a new mail for that issue or we can continue in this thread? On Fri, 19 Apr 2019 at 19:19, Jehan-Guillaume de Rorthais wrote: > On Fri, 19 Apr 2019 17:26:14 +0200 > Danka Ivanović wrote: > ... > > Should I change any of those timeout parameters in order to avoid > timeout? > > You can try to raise the timeout, indeed. But as far as we don't know > **why** > your VMs froze for some time, it is difficult to guess how high should be > these timeouts. > > Not to mention that it will raise your RTO. > -- Pozdrav Danka Ivanovic ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
On Fri, 19 Apr 2019 17:26:14 +0200 Danka Ivanović wrote: ... > Should I change any of those timeout parameters in order to avoid timeout? You can try to raise the timeout, indeed. But as far as we don't know **why** your VMs froze for some time, it is difficult to guess how high should be these timeouts. Not to mention that it will raise your RTO. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
Here is the command output from crm configure show: node 1: master \ attributes master-PGSQL=1001 node 2: secondary \ attributes master-PGSQL=1000 primitive AWSVIP awsvip \ params secondary_private_ip=10.x.x.x api_delay=5 primitive PGSQL pgsqlms \ params pgdata="/var/lib/postgresql/9.5/main" bindir="/usr/lib/postgresql/9.5/bin" pghost="/var/run/postgresql/" recovery_template="/etc/postgresql/9.5/main/recovery.conf.pcmk" start_opts="-c config_file=/etc/postgresql/9.5/main/postgresql.conf" \ op start timeout=60s interval=0 \ op stop timeout=60s interval=0 \ op promote timeout=15s interval=0 \ op demote timeout=120s interval=0 \ op monitor interval=15s timeout=10s role=Master \ op monitor interval=16s timeout=10s role=Slave \ op notify timeout=60 interval=0 primitive fencing-master stonith:external/ec2 \ params port=master \ op start interval=0s timeout=60s \ op monitor interval=360s timeout=60s \ op stop interval=0s timeout=60s primitive fencing-secondary stonith:external/ec2 \ params port=secondary \ op start interval=0s timeout=60s \ op monitor interval=360s timeout=60s \ op stop interval=0s timeout=60s ms PGSQL-HA PGSQL \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true interleave=true colocation IPAWSIP-WITH-MASTER inf: AWSVIP PGSQL-HA:Master order demote-then-stop-ip Mandatory: _rsc_set_ PGSQL-HA:demote AWSVIP:stop symmetrical=false location loc-fence-master fencing-master -inf: master location loc-fence-secondary fencing-secondary -inf: secondary order promote-then-ip Mandatory: _rsc_set_ PGSQL-HA:promote AWSVIP:start symmetrical=false property cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.14-70404b0 \ cluster-infrastructure=corosync \ cluster-name=pgc-psql-ha \ stonith-enabled=true \ no-quorum-policy=ignore \ maintenance-mode=false \ last-lrm-refresh=1551885417 rsc_defaults rsc-options: \ resource-stickiness=10 \ migration-threshold=1 Should I change any of those timeout parameters in order to avoid timeout? On Fri, 19 Apr 2019 at 12:23, Danka Ivanović wrote: > Thanks for the clarification about failure-timeout, migration threshold > and pacemaker. > Instances are hosted on AWS cloud, and they are in the same security > groups and availability zones. > I don't have information about hardware which hosts those VMs since they > are non dedicated. UTC timezone is configured on both machines and default > ntp configuration. > remote refid st t when poll reach delay offset > jitter > > == > 0.ubuntu.pool.n .POOL. 16 p- 6400.0000.000 > 0.000 > 1.ubuntu.pool.n .POOL. 16 p- 6400.0000.000 > 0.000 > 2.ubuntu.pool.n .POOL. 16 p- 6400.0000.000 > 0.000 > 3.ubuntu.pool.n .POOL. 16 p- 6400.0000.000 > 0.000 > ntp.ubuntu.com .POOL. 16 p- 6400.0000.000 > 0.000 > +198.46.223.227 204.9.54.119 2 u 65 512 377 22.3180.096 > 1.111 > -time1.plumdev.n .GPS.1 u 116 512 377 72.4871.386 > 0.544 > -199.180.133.100 140.142.2.8 3 u 839 1024 377 65.574 -1.199 > 1.167 > +helium.constant 128.59.0.245 2 u 217 512 3777.3680.952 > 0.090 > *i.will.not.be.e 213.251.128.249 2 u 207 512 377 14.7331.185 > 0.305 > > > On Fri, 19 Apr 2019 at 11:46, Jehan-Guillaume de Rorthais > wrote: > >> On Fri, 19 Apr 2019 11:08:33 +0200 >> Danka Ivanović wrote: >> >> > Hi, >> > Thank you for your response. >> > >> > Ok, It seems that fencing resources and secondary timed out at the same >> > time, together with ldap. >> > I understand that because of "migration-threshold=1", standby tried to >> > recover just once and then was stopped. Is this ok, or the threshold >> should >> > be increased? >> >> It depend on your usecase really. >> >> Note that as soon as a resource hit migration threashold, there's an >> implicit >> constraint forbidding it to come back on this node until you reset the >> failcount. That's why your pgsql master resource never came back anywhere. >> >> You can as well set failure-timeout if you are brave enough to automate >> the >> failure reset. See: >> >> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-migration.html >> >> > Master server is started with systmectl, then pacemaker is started on >> > master, which detects master and then when starting pacemaker on >> secondary >> > it brings up postgres service in slave mode. >> >> You should not. Systemd should not mess with resources handled by >> Pacemaker. >> >> > I didn't manage to start postgres master over pacemaker. I tested >> > failover with setup like this and it works. I will try to setup >> postgres to >> > be run with pacemaker, >> >> Pacemaker is suppose to start the resource itself if it is enabled in its >> setup. Look at this whole chapter (its end is importa
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
Thanks for the clarification about failure-timeout, migration threshold and pacemaker. Instances are hosted on AWS cloud, and they are in the same security groups and availability zones. I don't have information about hardware which hosts those VMs since they are non dedicated. UTC timezone is configured on both machines and default ntp configuration. remote refid st t when poll reach delay offset jitter == 0.ubuntu.pool.n .POOL. 16 p- 6400.0000.000 0.000 1.ubuntu.pool.n .POOL. 16 p- 6400.0000.000 0.000 2.ubuntu.pool.n .POOL. 16 p- 6400.0000.000 0.000 3.ubuntu.pool.n .POOL. 16 p- 6400.0000.000 0.000 ntp.ubuntu.com .POOL. 16 p- 6400.0000.000 0.000 +198.46.223.227 204.9.54.119 2 u 65 512 377 22.3180.096 1.111 -time1.plumdev.n .GPS.1 u 116 512 377 72.4871.386 0.544 -199.180.133.100 140.142.2.8 3 u 839 1024 377 65.574 -1.199 1.167 +helium.constant 128.59.0.245 2 u 217 512 3777.3680.952 0.090 *i.will.not.be.e 213.251.128.249 2 u 207 512 377 14.7331.185 0.305 On Fri, 19 Apr 2019 at 11:46, Jehan-Guillaume de Rorthais wrote: > On Fri, 19 Apr 2019 11:08:33 +0200 > Danka Ivanović wrote: > > > Hi, > > Thank you for your response. > > > > Ok, It seems that fencing resources and secondary timed out at the same > > time, together with ldap. > > I understand that because of "migration-threshold=1", standby tried to > > recover just once and then was stopped. Is this ok, or the threshold > should > > be increased? > > It depend on your usecase really. > > Note that as soon as a resource hit migration threashold, there's an > implicit > constraint forbidding it to come back on this node until you reset the > failcount. That's why your pgsql master resource never came back anywhere. > > You can as well set failure-timeout if you are brave enough to automate the > failure reset. See: > > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-migration.html > > > Master server is started with systmectl, then pacemaker is started on > > master, which detects master and then when starting pacemaker on > secondary > > it brings up postgres service in slave mode. > > You should not. Systemd should not mess with resources handled by > Pacemaker. > > > I didn't manage to start postgres master over pacemaker. I tested > > failover with setup like this and it works. I will try to setup postgres > to > > be run with pacemaker, > > Pacemaker is suppose to start the resource itself if it is enabled in its > setup. Look at this whole chapter (its end is important): > > https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#starting-or-stopping-the-cluster > > > but I am concerned about those timeouts which > > caused cluster to crash. Can you help me investigate why this happened or > > what should I change in order to avoid it? For aws virtual ip is used AWS > > secondary IP. > > Really I can't help on this. It looks like suddenly both VMs froze most of > their processes, or maybe some kind of clock jump, exhausting the > timeouts...I > really don't know. > > It sounds more related to your virtualization stack I suppose. Maybe some > kind > of "hot" backup? Maybe the hypervisor didn't schedule enough CPU to your > VMs > for too long? > > This is surprising both VM had timeouts in almost the same time. Do you > know if > they are on the same hypervisor host? If they do, this is a SPoF: you > should > move one of them in another host. > > ++ > > > On Thu, 18 Apr 2019 at 18:24, Jehan-Guillaume de Rorthais < > j...@dalibo.com> > > wrote: > > > > > On Thu, 18 Apr 2019 14:19:44 +0200 > > > Danka Ivanović wrote: > > > > > > > > > > > > It seems you had timeout for both fencing resources and your standby in > > > the same > > > time here: > > > > > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op > > > > monitor for fencing-secondary on master: unknown error (1) > > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op > > > > monitor for fencing-master on secondary: unknown error (1) > > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op > > > > monitor for PGSQL:1 on secondary: unknown error (1) > > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing > fencing-secondary > > > > away from master after 1 failures (max=1) > > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing > fencing-master > > > away > > > > from secondary after 1 failures (max=1) > > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA > away > > > from > > > > secondary after 1 failures (max=1) > > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA > away > > > from > > > > seco
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
On Fri, 19 Apr 2019 11:08:33 +0200 Danka Ivanović wrote: > Hi, > Thank you for your response. > > Ok, It seems that fencing resources and secondary timed out at the same > time, together with ldap. > I understand that because of "migration-threshold=1", standby tried to > recover just once and then was stopped. Is this ok, or the threshold should > be increased? It depend on your usecase really. Note that as soon as a resource hit migration threashold, there's an implicit constraint forbidding it to come back on this node until you reset the failcount. That's why your pgsql master resource never came back anywhere. You can as well set failure-timeout if you are brave enough to automate the failure reset. See: https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-migration.html > Master server is started with systmectl, then pacemaker is started on > master, which detects master and then when starting pacemaker on secondary > it brings up postgres service in slave mode. You should not. Systemd should not mess with resources handled by Pacemaker. > I didn't manage to start postgres master over pacemaker. I tested > failover with setup like this and it works. I will try to setup postgres to > be run with pacemaker, Pacemaker is suppose to start the resource itself if it is enabled in its setup. Look at this whole chapter (its end is important): https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#starting-or-stopping-the-cluster > but I am concerned about those timeouts which > caused cluster to crash. Can you help me investigate why this happened or > what should I change in order to avoid it? For aws virtual ip is used AWS > secondary IP. Really I can't help on this. It looks like suddenly both VMs froze most of their processes, or maybe some kind of clock jump, exhausting the timeouts...I really don't know. It sounds more related to your virtualization stack I suppose. Maybe some kind of "hot" backup? Maybe the hypervisor didn't schedule enough CPU to your VMs for too long? This is surprising both VM had timeouts in almost the same time. Do you know if they are on the same hypervisor host? If they do, this is a SPoF: you should move one of them in another host. ++ > On Thu, 18 Apr 2019 at 18:24, Jehan-Guillaume de Rorthais > wrote: > > > On Thu, 18 Apr 2019 14:19:44 +0200 > > Danka Ivanović wrote: > > > > > > > > It seems you had timeout for both fencing resources and your standby in > > the same > > time here: > > > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op > > > monitor for fencing-secondary on master: unknown error (1) > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op > > > monitor for fencing-master on secondary: unknown error (1) > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op > > > monitor for PGSQL:1 on secondary: unknown error (1) > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary > > > away from master after 1 failures (max=1) > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master > > away > > > from secondary after 1 failures (max=1) > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away > > from > > > secondary after 1 failures (max=1) > > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away > > from > > > secondary after 1 failures (max=1) > > > > Because you have "migration-threshold=1", the standby will be shut down: > > > > > Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1 (secondary) > > > > The transition is stopped because the pgsql master timed out in the > > meantime > > : > > > > > Apr 17 10:03:40 master crmd[12481]: notice: Transition 3462 (Complete=5, > > > Pending=0, Fired=0, Skipped=1, Incomplete=6, > > > Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Stopped > > > > and as you mentioned, your ldap as well: > > > > > Apr 17 10:03:40 master nslcd[1518]: [d7e446] ldap_result() > > > timed out > > > > Here are the four timeout errors (2 fencings and 2 pgsql instances): > > > > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op > > > monitor for fencing-secondary on master: unknown error (1) > > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op > > > monitor for PGSQL:0 on master: unknown error (1) > > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op > > > monitor for fencing-master on secondary: unknown error (1) > > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op > > > monitor for PGSQL:1 on secondary: unknown error (1) > > > > As a reaction, Pacemaker decide to stop everything because it can not move > > resources anywhere: > > > > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away > > from > > > master after 1 failures (max=1) > > > Apr 17 10:03:40 master pengine
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
Hi, Thank you for your response. Ok, It seems that fencing resources and secondary timed out at the same time, together with ldap. I understand that because of "migration-threshold=1", standby tried to recover just once and then was stopped. Is this ok, or the threshold should be increased? Master server is started with systmectl, then pacemaker is started on master, which detects master and then when starting pacemaker on secondary it brings up postgres service in slave mode. I didn't manage to start postgres master over pacemaker. I tested failover with setup like this and it works. I will try to setup postgres to be run with pacemaker, but I am concerned about those timeouts which caused cluster to crash. Can you help me investigate why this happened or what should I change in order to avoid it? For aws virtual ip is used AWS secondary IP. Link to the awsvip resource: https://raw.githubusercontent.com/ClusterLabs/resource-agents/master/heartbeat/awsvip Link to the ec2 stonith reosurce agent: https://raw.githubusercontent.com/ClusterLabs/cluster-glue/master/lib/plugins/stonith/external/ec2 Command output when cluster works: crm status Output: Stack: corosync Current DC: postgres-ha-1 (version 1.1.14-70404b0) - partition with quorum 2 nodes and 5 resources configured Online: [ postgres-ha-1 postgres-ha-2 ] Full list of resources: AWSVIP (ocf::heartbeat:awsvip): Started postgres-ha-1 Master/Slave Set: PGSQL-HA [PGSQL] Masters: [ postgres-ha-1 ] Slaves: [ postgres-ha-2 ] fencing-postgres-ha-1 (stonith:external/ec2): Started postgres-ha-2 fencing-postgres-ha-2 (stonith:external/ec2): Started postgres-ha-1 On Thu, 18 Apr 2019 at 18:24, Jehan-Guillaume de Rorthais wrote: > On Thu, 18 Apr 2019 14:19:44 +0200 > Danka Ivanović wrote: > > > > It seems you had timeout for both fencing resources and your standby in > the same > time here: > > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op > > monitor for fencing-secondary on master: unknown error (1) > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op > > monitor for fencing-master on secondary: unknown error (1) > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op > > monitor for PGSQL:1 on secondary: unknown error (1) > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary > > away from master after 1 failures (max=1) > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master > away > > from secondary after 1 failures (max=1) > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away > from > > secondary after 1 failures (max=1) > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away > from > > secondary after 1 failures (max=1) > > Because you have "migration-threshold=1", the standby will be shut down: > > > Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1 (secondary) > > The transition is stopped because the pgsql master timed out in the > meantime > : > > > Apr 17 10:03:40 master crmd[12481]: notice: Transition 3462 (Complete=5, > > Pending=0, Fired=0, Skipped=1, Incomplete=6, > > Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Stopped > > and as you mentioned, your ldap as well: > > > Apr 17 10:03:40 master nslcd[1518]: [d7e446] ldap_result() > > timed out > > Here are the four timeout errors (2 fencings and 2 pgsql instances): > > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op > > monitor for fencing-secondary on master: unknown error (1) > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op > > monitor for PGSQL:0 on master: unknown error (1) > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op > > monitor for fencing-master on secondary: unknown error (1) > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op > > monitor for PGSQL:1 on secondary: unknown error (1) > > As a reaction, Pacemaker decide to stop everything because it can not move > resources anywhere: > > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away > from > > master after 1 failures (max=1) > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away > from > > master after 1 failures (max=1) > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-secondary > > away from master after 1 failures (max=1) > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-master > away > > from secondary after 1 failures (max=1) > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away > from > > secondary after 1 failures (max=1) > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away > from > > secondary after 1 failures (max=1) > > Apr 17 10:03:40 master pengine[12480]: notice: Stop AWSVIP (master) > > Apr 17 10:03:40 master pengine[12480]: notice: Demote PGSQL:0 (Master -> > > Stopped master) > > Apr 17 10:03:40 master peng
Re: [ClusterLabs] Fwd: Postgres pacemaker cluster failure
On Thu, 18 Apr 2019 14:19:44 +0200 Danka Ivanović wrote: It seems you had timeout for both fencing resources and your standby in the same time here: > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op > monitor for fencing-secondary on master: unknown error (1) > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op > monitor for fencing-master on secondary: unknown error (1) > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op > monitor for PGSQL:1 on secondary: unknown error (1) > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary > away from master after 1 failures (max=1) > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master away > from secondary after 1 failures (max=1) > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away from > secondary after 1 failures (max=1) > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away from > secondary after 1 failures (max=1) Because you have "migration-threshold=1", the standby will be shut down: > Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1 (secondary) The transition is stopped because the pgsql master timed out in the meantime : > Apr 17 10:03:40 master crmd[12481]: notice: Transition 3462 (Complete=5, > Pending=0, Fired=0, Skipped=1, Incomplete=6, > Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Stopped and as you mentioned, your ldap as well: > Apr 17 10:03:40 master nslcd[1518]: [d7e446] ldap_result() > timed out Here are the four timeout errors (2 fencings and 2 pgsql instances): > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op > monitor for fencing-secondary on master: unknown error (1) > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op > monitor for PGSQL:0 on master: unknown error (1) > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op > monitor for fencing-master on secondary: unknown error (1) > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op > monitor for PGSQL:1 on secondary: unknown error (1) As a reaction, Pacemaker decide to stop everything because it can not move resources anywhere: > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from > master after 1 failures (max=1) > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from > master after 1 failures (max=1) > Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-secondary > away from master after 1 failures (max=1) > Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-master away > from secondary after 1 failures (max=1) > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from > secondary after 1 failures (max=1) > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from > secondary after 1 failures (max=1) > Apr 17 10:03:40 master pengine[12480]: notice: Stop AWSVIP (master) > Apr 17 10:03:40 master pengine[12480]: notice: Demote PGSQL:0 (Master -> > Stopped master) > Apr 17 10:03:40 master pengine[12480]: notice: Stop PGSQL:1 (secondary) Now, following lines are really not expected. Why systemd detects PostgreSQL stopped? > Apr 17 10:03:40 master postgresql@9.5-main[32458]: Cluster is not running. > Apr 17 10:03:40 master systemd[1]: postgresql@9.5-main.service: Control > process exited, code=exited status=2 > Apr 17 10:03:40 master systemd[1]: postgresql@9.5-main.service: Unit > entered failed state. > Apr 17 10:03:40 master systemd[1]: postgresql@9.5-main.service: Failed with > result 'exit-code'. I suspect the service is still enabled or has been started by hand. As soon as you setup a resource in Pacemaker, admin show **always** ask Pacemaker to start/stop it. Never use systemctl to handle the resource yourself. You must disable this service in systemd. ++ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Fwd: Postgres pacemaker cluster failure
Hi, Can you help me with troubleshooting postgres pacemaker cluster failure? Today cluster failed without promoting secondary to master. At the same time appeared ldap time out. Here are the logs, master was stopped by pacemaker at 10:03:40 AM UTC. Thank you in advance. corosync.log Apr 17 10:03:34 master crmd[12481]: notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Apr 17 10:03:34 master pengine[12480]: notice: On loss of CCM Quorum: Ignore Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op monitor for fencing-secondary on master: unknown error (1) Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op monitor for fencing-master on secondary: unknown error (1) Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op monitor for PGSQL:1 on secondary: unknown error (1) Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary away from master after 1 failures (max=1) Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master away from secondary after 1 failures (max=1) Apr 17 10:03:34 master pengine[12480]: notice: Recover PGSQL:1 (Slave secondary) Apr 17 10:03:34 master pengine[12480]: notice: Calculated Transition 3461: /var/lib/pacemaker/pengine/pe-input-58.bz2 Apr 17 10:03:34 master pengine[12480]: notice: On loss of CCM Quorum: Ignore Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op monitor for fencing-secondary on master: unknown error (1) Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op monitor for fencing-master on secondary: unknown error (1) Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op monitor for PGSQL:1 on secondary: unknown error (1) Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary away from master after 1 failures (max=1) Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master away from secondary after 1 failures (max=1) Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away from secondary after 1 failures (max=1) Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away from secondary after 1 failures (max=1) Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1 (secondary) Apr 17 10:03:34 master pengine[12480]: notice: Calculated Transition 3462: /var/lib/pacemaker/pengine/pe-input-59.bz2 Apr 17 10:03:40 master lrmd[12477]: warning: PGSQL_monitor_15000 process (PID 32372) timed out Apr 17 10:03:40 master lrmd[12477]: warning: PGSQL_monitor_15000:32372 - timed out after 1ms Apr 17 10:03:40 master crmd[12481]: notice: Transition aborted by PGSQL_monitor_15000 'modify' on master: Old event (magic=2:1;8:7:8:319e4083-ccc0-440a-ae43-1bbd39275fe7, cib=0.93.14, source=process_graph_event:605, 0) Apr 17 10:03:40 master corosync[23321]: [QB ] IPC credentials authenticated (23321-32400-25) Apr 17 10:03:40 master corosync[23321]: [QB ] connecting to client [32400] Apr 17 10:03:40 master corosync[23321]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Apr 17 10:03:40 master corosync[23321]: message repeated 2 times: [ [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168] Apr 17 10:03:40 master corosync[23321]: [QB ] HUP conn (23321-32400-25) Apr 17 10:03:40 master corosync[23321]: [QB ] qb_ipcs_disconnect(23321-32400-25) state:2 Apr 17 10:03:40 master corosync[23321]: [QB ] epoll_ctl(del): Bad file descriptor (9) Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-response-23321-32400-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-event-23321-32400-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-request-23321-32400-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] IPC credentials authenticated (23321-32400-25) Apr 17 10:03:40 master corosync[23321]: [QB ] connecting to client [32400] Apr 17 10:03:40 master corosync[23321]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Apr 17 10:03:40 master corosync[23321]: message repeated 2 times: [ [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168] Apr 17 10:03:40 master corosync[23321]: [QB ] HUP conn (23321-32400-25) Apr 17 10:03:40 master corosync[23321]: [QB ] qb_ipcs_disconnect(23321-32400-25) state:2 Apr 17 10:03:40 master corosync[23321]: [QB ] epoll_ctl(del): Bad file descriptor (9) Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-response-23321-32400-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-event-23321-32400-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-request-23321-32400-25-header Apr 17 10:03:40 master pgsqlms(PGSQL)[32393]: DEBUG: _get_controldata: found: { Apr 17 10:03:40 master pgsqlms(PGSQL)[32393]: DEBUG: pgsql_notify: environment variables: { Apr 17