Re: [ClusterLabs] Master/slave failover does not work as expected

Michael Powell Tue, 13 Aug 2019 06:43:12 -0700

Re: " Does slave have master score? Your logs show only one node with master. 
To select another node as new master it needs non-zero master score as well."


Yes, it does.  I included the corosync.log file from the DC node earlier, but 
the corosync log file from the other node shows messages like the following:

Aug 13 05:59:50 [16795] mgraid-16201289RN00023-1    pengine:    debug: 
master_color:    SS16201289RN00023:1 master score: 500

Regards,
  michael

-----Original Message-----
From: Users <users-boun...@clusterlabs.org> On Behalf Of 
users-requ...@clusterlabs.org
Sent: Monday, August 12, 2019 9:38 PM
To: users@clusterlabs.org
Subject: [EXTERNAL] Users Digest, Vol 55, Issue 24

Send Users mailing list submissions to
        users@clusterlabs.org

To subscribe or unsubscribe via the World Wide Web, visit
        https://lists.clusterlabs.org/mailman/listinfo/users
or, via email, send a message with subject or body 'help' to
        users-requ...@clusterlabs.org

You can reach the person managing the list at
        users-ow...@clusterlabs.org

When replying, please edit your Subject line so it is more specific than "Re: 
Contents of Users digest..."


Today's Topics:

   1. Re: [EXTERNAL] Users Digest, Vol 55, Issue 19 (Andrei Borzenkov)


----------------------------------------------------------------------

Message: 1
Date: Tue, 13 Aug 2019 07:37:47 +0300
From: Andrei Borzenkov <arvidj...@gmail.com>
To: Cluster Labs - All topics related to open-source clustering
        welcomed <users@clusterlabs.org>
Cc: Venkata Reddy Chappavarapu <venkata.chappavar...@harmonicinc.com>
Subject: Re: [ClusterLabs] [EXTERNAL] Users Digest, Vol 55, Issue 19
Message-ID: <f7b3f2d1-95e8-4b65-aa72-4074b5944...@gmail.com>
Content-Type: text/plain; charset="utf-8"



?????????? ? iPhone

> 13 ???. 2019 ?., ? 0:17, Michael Powell <michael.pow...@harmonicinc.com> 
> ???????(?):
> 
> Yes, I have tried that.  I used crm_resource --meta -p resource-stickiness -v 
> 0 -r SS16201289RN00023 to disable resource stickiness and then kill -9 <pid> 
> to kill the application associated with the master resource.  The results are 
> the same:  the slave resource remains a slave while the failed resource is 
> restarted and becomes master again.
>  

Does slave have master score? Your logs show only one node with master. To 
select another node as new master it needs non-zero master score as well.


> One approach that seems to work is to run crm_resource -M -r 
> ms-SS16201289RN00023 -H mgraid-16201289RN00023-1 to move the resource to the 
> other node (assuming that the master is running on node 
> mgraid-16201289RN00023-0.)  My original understanding was that this would 
> ?restart? the resource on the destination node, but that was apparently a 
> misunderstanding.  I can change our scripts to use this approach, but a) 
> thought that maintain the approach of demoting the master resource and 
> promoting the slave to master was more generic and b) I am unsure of any 
> potential side effects of moving the resource.  Given what I?m trying to 
> accomplish, is this in fact the preferred approach?
>  
> Regards,
>     Michael
>  
>  
> -----Original Message-----
> From: Users <users-boun...@clusterlabs.org> On Behalf Of 
> users-requ...@clusterlabs.org
> Sent: Monday, August 12, 2019 1:10 PM
> To: users@clusterlabs.org
> Subject: [EXTERNAL] Users Digest, Vol 55, Issue 19
>  
> Send Users mailing list submissions to
>                 users@clusterlabs.org
>  
> To subscribe or unsubscribe via the World Wide Web, visit
>                 https://lists.clusterlabs.org/mailman/listinfo/users
> or, via email, send a message with subject or body 'help' to
>                 users-requ...@clusterlabs.org
>  
> You can reach the person managing the list at
>                 users-ow...@clusterlabs.org
>  
> When replying, please edit your Subject line so it is more specific than "Re: 
> Contents of Users digest..."
>  
>  
> Today's Topics:
>  
>    1. why is node fenced ? (Lentes, Bernd)
>    2. Postgres HA - pacemaker RA do not support auto      failback (Shital A)
>    3. Re: why is node fenced ? (Chris Walker)
>    4. Re: Master/slave failover does not work as expected
>       (Andrei Borzenkov)
>  
>  
> ----------------------------------------------------------------------
>  
> Message: 1
> Date: Mon, 12 Aug 2019 18:09:24 +0200 (CEST)
> From: "Lentes, Bernd" <bernd.len...@helmholtz-muenchen.de>
> To: Pacemaker ML <users@clusterlabs.org>
> Subject: [ClusterLabs] why is node fenced ?
> Message-ID:
>                 
> <546330844.1686419.1565626164456.javamail.zim...@helmholtz-muenchen.de
> >
>                
> Content-Type: text/plain; charset=utf-8
>  
> Hi,
>  
> last Friday (9th of August) i had to install patches on my two-node cluster.
> I put one of the nodes (ha-idg-2) into standby (crm node standby ha-idg-2), 
> patched it, rebooted, started the cluster (systemctl start pacemaker) again, 
> put the node again online, everything fine.
>  
> Then i wanted to do the same procedure with the other node (ha-idg-1).
> I put it in standby, patched it, rebooted, started pacemaker again.
> But then ha-idg-1 fenced ha-idg-2, it said the node is unclean.
> I know that nodes which are unclean need to be shutdown, that's logical.
>  
> But i don't know from where the conclusion comes that the node is unclean 
> respectively why it is unclean, i searched in the logs and didn't find any 
> hint.
>  
> I put the syslog and the pacemaker log on a seafile share, i'd be very 
> thankful if you'll have a look.
> https://hmgubox.helmholtz-muenchen.de/d/53a10960932445fb9cfe/
>  
> Here the cli history of the commands:
>  
> 17:03:04  crm node standby ha-idg-2
> 17:07:15  zypper up (install Updates on ha-idg-2)
> 17:17:30  systemctl reboot
> 17:25:21  systemctl start pacemaker.service
> 17:25:47  crm node online ha-idg-2
> 17:26:35  crm node standby ha-idg1-
> 17:30:21  zypper up (install Updates on ha-idg-1)
> 17:37:32  systemctl reboot
> 17:43:04  systemctl start pacemaker.service
> 17:44:00  ha-idg-1 is fenced
>  
> Thanks.
>  
> Bernd
>  
> OS is SLES 12 SP4, pacemaker 1.1.19, corosync 2.3.6-9.13.1
>  
>  
> --
>  
> Bernd Lentes
> Systemadministration
> Institut f?r Entwicklungsgenetik
> Geb?ude 35.34 - Raum 208
> HelmholtzZentrum m?nchen
> bernd.len...@helmholtz-muenchen.de
> phone: +49 89 3187 1241
> phone: +49 89 3187 3827
> fax: +49 89 3187 2294
> http://www.helmholtz-muenchen.de/idg
>  
> Perfekt ist wer keine Fehler macht
> Also sind Tote perfekt
>  
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) 
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
> Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich 
> Bassler, Kerstin Guenther
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
>  
>  
>  
> ------------------------------
>  
> Message: 2
> Date: Mon, 12 Aug 2019 12:24:02 +0530
> From: Shital A <brightuser2...@gmail.com>
> To: pgsql-gene...@postgresql.com, Users@clusterlabs.org
> Subject: [ClusterLabs] Postgres HA - pacemaker RA do not support auto
>                 failback
> Message-ID:
>                 
> <camp7vw_kf2em_buh_fpbznc9z6pvvx+7rxjymhfmcozxuwg...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>  
> Hello,
>  
> Postgres version : 9.6
> OS:Rhel 7.6
>  
> We are working on HA setup for postgres cluster of two nodes in 
> active-passive mode.
>  
> Installed:
> Pacemaker 1.1.19
> Corosync 2.4.3
>  
> The pacemaker agent with this installation doesn't support automatic 
> failback. What I mean by that is explained below:
> 1. Cluster is setup like A - B with A as master.
> 2. Kill services on A, node B will come up as master.
> 3. node A is ready to join the cluster, we have to delete the lock 
> file it creates on any one of the node and execute the cleanup command 
> to get the node back as standby
>  
> Step 3 is manual so HA is not achieved in real sense.
>  
> Please help to check:
> 1. Is there any version of the resouce agent which supports automatic 
> failback? To avoid generation of lock file and deleting it.
>  
> 2. If there is no such support, if we need such functionality, do we 
> have to modify existing code?
>  
> How this can be achieved. Please suggest.
> Thanks.
>  
> Thanks.
> -------------- next part -------------- An HTML attachment was 
> scrubbed...
> URL: 
> <https://lists.clusterlabs.org/pipermail/users/attachments/20190812/73
> 7a010e/attachment-0001.html>
>  
> ------------------------------
>  
> Message: 3
> Date: Mon, 12 Aug 2019 17:47:02 +0000
> From: Chris Walker <cwal...@cray.com>
> To: Cluster Labs - All topics related to open-source clustering
>                 welcomed <users@clusterlabs.org>
> Subject: Re: [ClusterLabs] why is node fenced ?
> Message-ID: <eafef777-5a49-4c06-a2f6-8711f528b...@cray.com>
> Content-Type: text/plain; charset="utf-8"
>  
> When ha-idg-1 started Pacemaker around 17:43, it did not see ha-idg-2, 
> for example,
>  
> Aug 09 17:43:05 [6318] ha-idg-1 pacemakerd:     info: 
> pcmk_quorum_notification: Quorum retained | membership=1320 members=1
>  
> after ~20s (dc-deadtime parameter), ha-idg-2 is marked 'unclean' and 
> STONITHed as part of startup fencing.
>  
> There is nothing in ha-idg-2's HA logs around 17:43 indicating that it saw 
> ha-idg-1 either, so it appears that there was no communication at all between 
> the two nodes.
>  
> I'm not sure exactly why the nodes did not see one another, but there 
> are indications of network issues around this time
>  
> 2019-08-09T17:42:16.427947+02:00 ha-idg-2 kernel: [ 1229.245533] bond1: now 
> running without any active interface!
>  
> so perhaps that's related.
>  
> HTH,
> Chris
>  
>  
> ?On 8/12/19, 12:09 PM, "Users on behalf of Lentes, Bernd" 
> <users-boun...@clusterlabs.org on behalf of 
> bernd.len...@helmholtz-muenchen.de> wrote:
>  
>     Hi,
>    
>     last Friday (9th of August) i had to install patches on my two-node 
> cluster.
>     I put one of the nodes (ha-idg-2) into standby (crm node standby 
> ha-idg-2), patched it, rebooted,
>     started the cluster (systemctl start pacemaker) again, put the node again 
> online, everything fine.
>    
>     Then i wanted to do the same procedure with the other node (ha-idg-1).
>     I put it in standby, patched it, rebooted, started pacemaker again.
>     But then ha-idg-1 fenced ha-idg-2, it said the node is unclean.
>     I know that nodes which are unclean need to be shutdown, that's logical.
>    
>     But i don't know from where the conclusion comes that the node is unclean 
> respectively why it is unclean,
>     i searched in the logs and didn't find any hint.
>    
>     I put the syslog and the pacemaker log on a seafile share, i'd be very 
> thankful if you'll have a look.
>     https://hmgubox.helmholtz-muenchen.de/d/53a10960932445fb9cfe/
>    
>     Here the cli history of the commands:
>    
>     17:03:04  crm node standby ha-idg-2
>     17:07:15  zypper up (install Updates on ha-idg-2)
>     17:17:30  systemctl reboot
>     17:25:21  systemctl start pacemaker.service
>     17:25:47  crm node online ha-idg-2
>     17:26:35  crm node standby ha-idg1-
>     17:30:21  zypper up (install Updates on ha-idg-1)
>     17:37:32  systemctl reboot
>     17:43:04  systemctl start pacemaker.service
>     17:44:00  ha-idg-1 is fenced
>    
>     Thanks.
>    
>     Bernd
>    
>     OS is SLES 12 SP4, pacemaker 1.1.19, corosync 2.3.6-9.13.1
>    
>     
>     --
>     
>     Bernd Lentes
>     Systemadministration
>     Institut f?r Entwicklungsgenetik
>     Geb?ude 35.34 - Raum 208
>     HelmholtzZentrum m?nchen
>     bernd.len...@helmholtz-muenchen.de
>     phone: +49 89 3187 1241
>     phone: +49 89 3187 3827
>     fax: +49 89 3187 2294
>     http://www.helmholtz-muenchen.de/idg
>     
>     Perfekt ist wer keine Fehler macht
>     Also sind Tote perfekt
>     
>     
>     Helmholtz Zentrum Muenchen
>     Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
>     Ingolstaedter Landstr. 1
>     85764 Neuherberg
>     www.helmholtz-muenchen.de
>     Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
>     Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich 
> Bassler, Kerstin Guenther
>     Registergericht: Amtsgericht Muenchen HRB 6466
>     USt-IdNr: DE 129521671
>    
>     _______________________________________________
>     Manage your subscription:
>     https://lists.clusterlabs.org/mailman/listinfo/users
>    
>     ClusterLabs home: https://www.clusterlabs.org/
>  
>  
> ------------------------------
>  
> Message: 4
> Date: Mon, 12 Aug 2019 23:09:31 +0300
> From: Andrei Borzenkov <arvidj...@gmail.com>
> To: Cluster Labs - All topics related to open-source clustering
>                 welcomed <users@clusterlabs.org>
> Cc: Venkata Reddy Chappavarapu <venkata.chappavar...@harmonicinc.com>
> Subject: Re: [ClusterLabs] Master/slave failover does not work as
>                 expected
> Message-ID:
>                 
> <CAA91j0WxSxt_eVmUvXgJ_0goBkBw69r3o-VesRvGc6atg6o=j...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>  
> On Mon, Aug 12, 2019 at 4:12 PM Michael Powell < 
> michael.pow...@harmonicinc.com> wrote:
>  
> > At 07:44:49, the ss agent discovers that the master instance has 
> > failed on node *mgraid?-0* as a result of a failed *ssadm* request 
> > in response to an *ss_monitor()* operation.  It issues a *crm_master 
> > -Q -D* command with the intent of demoting the master and promoting 
> > the slave, on the other node, to master.  The *ss_demote()* function 
> > finds that the application is no longer running and returns 
> > *OCF_NOT_RUNNING* (7).  In the older product, this was sufficient to 
> > promote the other instance to master, but in the current product, 
> > that does not happen.  Currently, the failed application is 
> > restarted, as expected, and is promoted to master, but this takes 10?s of 
> > seconds.
> > 
> > 
> > 
>  
> Did you try to disable resource stickiness for this ms?
> -------------- next part -------------- An HTML attachment was 
> scrubbed...
> URL: 
> <https://lists.clusterlabs.org/pipermail/users/attachments/20190812/12
> 978d55/attachment.html>
> -------------- next part -------------- A non-text attachment was 
> scrubbed...
> Name: image001.gif
> Type: image/gif
> Size: 1854 bytes
> Desc: not available
> URL: 
> <https://lists.clusterlabs.org/pipermail/users/attachments/20190812/12
> 978d55/attachment.gif>
>  
> ------------------------------
>  
> Subject: Digest Footer
>  
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>  
> ClusterLabs home: https://www.clusterlabs.org/
>  
> ------------------------------
>  
> End of Users Digest, Vol 55, Issue 19
> *************************************
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<https://lists.clusterlabs.org/pipermail/users/attachments/20190813/6b477712/attachment.html>

------------------------------

Subject: Digest Footer

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

------------------------------

End of Users Digest, Vol 55, Issue 24
*************************************
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Master/slave failover does not work as expected

Reply via email to