Re: [ClusterLabs] Antw: [EXT] Which verson of pacemaker/corosync provides crm_feature_set 3.0.10?

2021-11-24 Thread Vitaly Zolotusky
Ulrich, 
Yes, Fedora is far ahead of 22, but our product was in the field for quite a 
few years. Later versions are running on 28, but now we are moved on to CentOs. 
Upgrade was always happening online with no service interruption.
The problem with upgrade to the new version of Corosync is that it does not 
talk to the old one.
Now that we need to replace Pacemaker, Corosync and Postgres we need to have 
one brief interruption for Pacemaker, Corosync and Postgres update, but before 
we update Pacemaker we have to sync new nodes with old ones to keep cluster 
running until all nodes are ready. 
So what we do is - we install new nodes with old version of Pacemaker, Corosync 
and Postgres, have them running in mixed (old/new) configuration until all are 
ready for new Pacemaker, Corosync and Postgres and then we shutdown the cluster 
for couple of min and upgrade Pacemaker, Corosync and Postgres. 
The only reason we shutdown the cluster now is because old Corosync does not 
talk to new Corosync.
_Vitaly

> On November 24, 2021 2:22 AM Ulrich Windl  
> wrote:
> 
>  
> >>> vitaly  schrieb am 23.11.2021 um 20:11 in Nachricht
> <45677632.67420.1637694706...@webmail6.networksolutionsemail.com>:
> > Hello,
> > I am working on the upgrade from older version of pacemaker/corosync to the
> 
> > current one. In the interim we need to sync newly installed node with the 
> > node running old software. Our old node uses pacemaker 1.1.13‑3.fc22 and 
> > corosync 2.3.5‑1.fc22 and has crm_feature_set 3.0.10.
> > 
> > For interim sync I used pacemaker 1.1.18‑2.fc28 and corosync 2.4.4‑1.fc28. 
> > This version is using crm_feature_set 3.0.14. 
> > This version is working fine, but it has issues in some edge cases, like 
> > when the new node starts alone and then the old one tries to join.
> 
> What I'm wondering (not wearing a red hat): Isn't Fedora at something like 33
> or 34 right now?
> If so, why bother with such old versions?
> 
> > 
> > So I need to rebuild rpms for crm_feature_set 3.0.10. This will be used just
> 
> > once and then it will be upgraded to the latest versions of pacemaker and 
> > corosync.
> > 
> > Now, couple of questions:
> > 1. Which rpm defines crm_feature_set?
> > 2. Which version of this rpm has crm_feature_set 3.0.10?
> > 3. Where could I get source rpms to rebuild this rpm on CentOs 8?
> > Thanks a lot!
> > _Vitaly Zolotusky
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users 
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Upgrading/downgrading cluster configuration

2020-10-22 Thread Vitaly Zolotusky
Thanks for the reply.
I do see backup config command in pcs, but not in crmsh. 
What would that be in crmsh? Would something like this work after Corosync 
started in the state with all resources inactive? I'll try this: 

crm configure save 
crm configure load 

Thank you!
_Vitaly Zolotusky

> On October 22, 2020 1:54 PM Strahil Nikolov  wrote:
> 
>  
> Have you tried to backup the config via crmsh/pcs and when you downgrade to 
> restore from it ?
> 
> Best Regards,
> Strahil Nikolov
> 
> 
> 
> 
> 
> 
> В четвъртък, 22 октомври 2020 г., 15:40:43 Гринуич+3, Vitaly Zolotusky 
>  написа: 
> 
> 
> 
> 
> 
> Hello,
> We are trying to upgrade our product from Corosync 2.X to Corosync 3.X. Our 
> procedure includes upgrade where we stopthe cluster, replace rpms and restart 
> the cluster. Upgrade works fine, but we also need to implement rollback in 
> case something goes wrong. 
> When we rollback and reload old RPMs cluster says that there are no active 
> resources. It looks like there is a problem with cluster configuration 
> version.
> Here is output of the crm_mon:
> 
> d21-22-left.lab.archivas.com /opt/rhino/sil/bin # crm_mon -A1
> Stack: corosync
> Current DC: NONE
> Last updated: Thu Oct 22 12:39:37 2020
> Last change: Thu Oct 22 12:04:49 2020 by root via crm_attribute on 
> d21-22-left.lab.archivas.com
> 
> 2 nodes configured
> 15 resources configured
> 
> Node d21-22-left.lab.archivas.com: UNCLEAN (offline)
> Node d21-22-right.lab.archivas.com: UNCLEAN (offline)
> 
> No active resources
> 
> 
> Node Attributes:
> 
> ***
> What would be the best way to implement downgrade of the configuration? 
> Should we just change crm feature set, or we need to rebuild the whole config?
> Thanks!
> _Vitaly Zolotusky
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Upgrading/downgrading cluster configuration

2020-10-22 Thread Vitaly Zolotusky
Hello,
We are trying to upgrade our product from Corosync 2.X to Corosync 3.X. Our 
procedure includes upgrade where we stopthe cluster, replace rpms and restart 
the cluster. Upgrade works fine, but we also need to implement rollback in case 
something goes wrong. 
When we rollback and reload old RPMs cluster says that there are no active 
resources. It looks like there is a problem with cluster configuration version.
Here is output of the crm_mon:

d21-22-left.lab.archivas.com /opt/rhino/sil/bin # crm_mon -A1
Stack: corosync
Current DC: NONE
Last updated: Thu Oct 22 12:39:37 2020
Last change: Thu Oct 22 12:04:49 2020 by root via crm_attribute on 
d21-22-left.lab.archivas.com

2 nodes configured
15 resources configured

Node d21-22-left.lab.archivas.com: UNCLEAN (offline)
Node d21-22-right.lab.archivas.com: UNCLEAN (offline)

No active resources


Node Attributes:

***
What would be the best way to implement downgrade of the configuration? Should 
we just change crm feature set, or we need to rebuild the whole config?
Thanks!
_Vitaly Zolotusky
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Rolling upgrade from Corosync 2.3+ to Corosync 2.99+ or Corosync 3.0+?

2020-06-12 Thread Vitaly Zolotusky
Hello,
This is exactly what we do in some of our software. We have a messaging version 
number and we can negotiate messaging protocol between nodes. Then 
communication is happening on the highest common version.
Would be great to have something like that available in Corosync!
Thanks,
_Vitaly

> On June 12, 2020 3:40 AM Ulrich Windl  
> wrote:
> 
>  
> Hi!
> 
> I can't help here, but in general I think corosync should support "upgrade"
> mode. Maybe like this:
> The newer version can also speak the previous protocol and the current
> protocol will be enabled only after all nodes in the cluster are upgraded.
> 
> Probably this would require a tripple version number field like (oldest
> version supported, version being requested/used, newest version supported).
> 
> For the API a questy about the latest commonly agreed version number and a
> request to use a different version number would be needed, too.
> 
> Regards,
> Ulrich
> 
> >>> Vitaly Zolotusky  schrieb am 11.06.2020 um 04:14 in
> Nachricht
> <19881_1591841678_5EE1938D_19881_553_1_1163034878.247559.1591841668387@webmail6.
> etworksolutionsemail.com>:
> > Hello everybody.
> > We are trying to do a rolling upgrade from Corosync 2.3.5‑1 to Corosync 
> > 2.99+. It looks like they are not compatible and we are getting messages 
> > like:
> > Jun 11 02:10:20 d21‑22‑left corosync[6349]:   [TOTEM ] Message received from
> 
> > 172.18.52.44 has bad magic number (probably sent by Corosync 2.3+)..
> Ignoring
> > on the upgraded node and 
> > Jun 11 01:02:37 d21‑22‑right corosync[14912]:   [TOTEM ] Invalid packet
> data
> > Jun 11 01:02:38 d21‑22‑right corosync[14912]:   [TOTEM ] Incoming packet has
> 
> > different crypto type. Rejecting
> > Jun 11 01:02:38 d21‑22‑right corosync[14912]:   [TOTEM ] Received message
> has 
> > invalid digest... ignoring.
> > on the pre‑upgrade node.
> > 
> > Is there a good way to do this upgrade? 
> > I would appreciate it very much if you could point me to any documentation 
> > or articles on this issue.
> > Thank you very much!
> > _Vitaly
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users 
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Rolling upgrade from Corosync 2.3+ to Corosync 2.99+ or Corosync 3.0+?

2020-06-11 Thread Vitaly Zolotusky
Hello, Strahil.
Thanks for your suggestion. 
We are doing something similar to what you suggest, but:
1. We do not have external storage. Our product is a single box with 2 internal 
heads and 10-14 PB (peta) of data in a single box. (or it could have 9 boxes 
hooked up together , but still with just 2 heads and 9 times more storage). 
2. Setup a new cluster is kind of hard. We do that on an extra partitions in 
chroot while the old cluster is running, so shutdown should be pretty short if 
we can figure out a way for the cluster to work while we configure the new 
partitions.
3. At this time we have to stop a node to move configuration from old to new 
partition, initialize new databases, etc. While we are doing that the other 
node is taking over all processing.
We will see if we can incorporate your suggestion into our upgrade path.
Thanks a lot for your help!
_Vitaly

 
> On June 11, 2020 12:00 PM Strahil Nikolov  wrote:
> 
>  
> Hi Vitaly,
> 
> have you considered  something like  this:
> 1.  Setup a  new cluster
> 2.  Present the same  shared storage on the new  cluster
> 3. Prepare the resource configuration but do not apply yet.
> 3. Power down all  resources on old cluster
> 4. Deploy the resources on the new cluster and immediately bring the  
> resources up
> 5. Remove access  to the shared storage for the  old cluster
> 6. Wipe the  old  cluster.
> 
> Downtime  will be way  shorter.
> 
> Best Regards,
> Strahil  Nikolov
> 
> На 11 юни 2020 г. 17:48:47 GMT+03:00, Vitaly Zolotusky  
> написа:
> >Thank you very much for quick reply!
> >I will try to either build new version on Fedora 22, or build the old
> >version on CentOs 8 and do a HA stack upgrade separately from my full
> >product/OS upgrade. A lot of my customers would be extremely unhappy
> >with even short downtime, so I can't really do the full upgrade
> >offline.
> >Thanks again!
> >_Vitaly
> >
> >> On June 11, 2020 10:14 AM Jan Friesse  wrote:
> >> 
> >>  
> >> > Thank you very much for your help!
> >> > We did try to go to V3.0.3-5 and then dropped to 2.99 in hope that
> >it may work with rolling upgrade (we were fooled by the same major
> >version (2)). Our fresh install works fine on V3.0.3-5.
> >> > Do you know if it is possible to build Pacemaker 3.0.3-5 and
> >Corosync 2.0.3 on Fedora 22 so that I 
> >> 
> >> Good question. Fedora 22 is quite old but close to RHEL 7 for which
> >we 
> >> build packages automatically (https://kronosnet.org/builds/) so it 
> >> should be possible. But you are really on your own, because I don't 
> >> think anybody ever tried it.
> >> 
> >> Regards,
> >>Honza
> >> 
> >> 
> >> 
> >> upgrade the stack before starting "real" upgrade of the product?
> >> > Then I can do the following sequence:
> >> > 1. "quick" full shutdown for HA stack upgrade to 3.0 version
> >> > 2. start HA stack on the old OS and product version with Pacemaker
> >3.0.3 and bring the product online
> >> > 3. start rolling upgrade for product upgrade to the new OS and
> >product version
> >> > Thanks again for your help!
> >> > _Vitaly
> >> > 
> >> >> On June 11, 2020 3:30 AM Jan Friesse  wrote:
> >> >>
> >> >>   
> >> >> Vitaly,
> >> >>
> >> >>> Hello everybody.
> >> >>> We are trying to do a rolling upgrade from Corosync 2.3.5-1 to
> >Corosync 2.99+. It looks like they are not compatible and we are
> >getting messages like:
> >> >>
> >> >> Yes, they are not wire compatible. Also please do not use 2.99
> >versions,
> >> >> these were alfa/beta/rc before 3.0 and 3.0 is actually quite a
> >long time
> >> >> released (3.0.4 is latest and I would recommend using it - there
> >were
> >> >> quite a few important bugfixes between 3.0.0 and 3.0.4)
> >> >>
> >> >>
> >> >>> Jun 11 02:10:20 d21-22-left corosync[6349]:   [TOTEM ] Message
> >received from 172.18.52.44 has bad magic number (probably sent by
> >Corosync 2.3+).. Ignoring
> >> >>> on the upgraded node and
> >> >>> Jun 11 01:02:37 d21-22-right corosync[14912]:   [TOTEM ] Invalid
> >packet data
> >> >>> Jun 11 01:02:38 d21-22-right corosync[14912]:   [TOTEM ] Incoming
> >packet has different crypto type. Rejecting
> >> >>> Jun 11 01:02:38 d21-22-right corosync[14912]:   [TOTEM ] Received
&

Re: [ClusterLabs] Rolling upgrade from Corosync 2.3+ to Corosync 2.99+ or Corosync 3.0+?

2020-06-11 Thread Vitaly Zolotusky
Thank you very much for quick reply!
I will try to either build new version on Fedora 22, or build the old version 
on CentOs 8 and do a HA stack upgrade separately from my full product/OS 
upgrade. A lot of my customers would be extremely unhappy with even short 
downtime, so I can't really do the full upgrade offline.
Thanks again!
_Vitaly

> On June 11, 2020 10:14 AM Jan Friesse  wrote:
> 
>  
> > Thank you very much for your help!
> > We did try to go to V3.0.3-5 and then dropped to 2.99 in hope that it may 
> > work with rolling upgrade (we were fooled by the same major version (2)). 
> > Our fresh install works fine on V3.0.3-5.
> > Do you know if it is possible to build Pacemaker 3.0.3-5 and Corosync 2.0.3 
> > on Fedora 22 so that I 
> 
> Good question. Fedora 22 is quite old but close to RHEL 7 for which we 
> build packages automatically (https://kronosnet.org/builds/) so it 
> should be possible. But you are really on your own, because I don't 
> think anybody ever tried it.
> 
> Regards,
>Honza
> 
> 
> 
> upgrade the stack before starting "real" upgrade of the product?
> > Then I can do the following sequence:
> > 1. "quick" full shutdown for HA stack upgrade to 3.0 version
> > 2. start HA stack on the old OS and product version with Pacemaker 3.0.3 
> > and bring the product online
> > 3. start rolling upgrade for product upgrade to the new OS and product 
> > version
> > Thanks again for your help!
> > _Vitaly
> > 
> >> On June 11, 2020 3:30 AM Jan Friesse  wrote:
> >>
> >>   
> >> Vitaly,
> >>
> >>> Hello everybody.
> >>> We are trying to do a rolling upgrade from Corosync 2.3.5-1 to Corosync 
> >>> 2.99+. It looks like they are not compatible and we are getting messages 
> >>> like:
> >>
> >> Yes, they are not wire compatible. Also please do not use 2.99 versions,
> >> these were alfa/beta/rc before 3.0 and 3.0 is actually quite a long time
> >> released (3.0.4 is latest and I would recommend using it - there were
> >> quite a few important bugfixes between 3.0.0 and 3.0.4)
> >>
> >>
> >>> Jun 11 02:10:20 d21-22-left corosync[6349]:   [TOTEM ] Message received 
> >>> from 172.18.52.44 has bad magic number (probably sent by Corosync 2.3+).. 
> >>> Ignoring
> >>> on the upgraded node and
> >>> Jun 11 01:02:37 d21-22-right corosync[14912]:   [TOTEM ] Invalid packet 
> >>> data
> >>> Jun 11 01:02:38 d21-22-right corosync[14912]:   [TOTEM ] Incoming packet 
> >>> has different crypto type. Rejecting
> >>> Jun 11 01:02:38 d21-22-right corosync[14912]:   [TOTEM ] Received message 
> >>> has invalid digest... ignoring.
> >>> on the pre-upgrade node.
> >>>
> >>> Is there a good way to do this upgrade?
> >>
> >> Usually best way is to start from scratch in testing environment to make
> >> sure everything works as expected. Then you can shutdown current
> >> cluster, upgrade and start it again - config file is mostly compatible,
> >> you may just consider changing transport to knet. I don't think there is
> >> any definitive guide to do upgrade without shutting down whole cluster,
> >> but somebody else may have idea.
> >>
> >> Regards,
> >> Honza
> >>
> >>> I would appreciate it very much if you could point me to any 
> >>> documentation or articles on this issue.
> >>> Thank you very much!
> >>> _Vitaly
> >>> ___
> >>> Manage your subscription:
> >>> https://lists.clusterlabs.org/mailman/listinfo/users
> >>>
> >>> ClusterLabs home: https://www.clusterlabs.org/
> >>>
> >
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Rolling upgrade from Corosync 2.3+ to Corosync 2.99+ or Corosync 3.0+?

2020-06-11 Thread Vitaly Zolotusky
Thank you very much for your help!
We did try to go to V3.0.3-5 and then dropped to 2.99 in hope that it may work 
with rolling upgrade (we were fooled by the same major version (2)). Our fresh 
install works fine on V3.0.3-5.
Do you know if it is possible to build Pacemaker 3.0.3-5 and Corosync 2.0.3 on 
Fedora 22 so that I upgrade the stack before starting "real" upgrade of the 
product? 
Then I can do the following sequence:
1. "quick" full shutdown for HA stack upgrade to 3.0 version
2. start HA stack on the old OS and product version with Pacemaker 3.0.3 and 
bring the product online
3. start rolling upgrade for product upgrade to the new OS and product version
Thanks again for your help!
_Vitaly

> On June 11, 2020 3:30 AM Jan Friesse  wrote:
> 
>  
> Vitaly,
> 
> > Hello everybody.
> > We are trying to do a rolling upgrade from Corosync 2.3.5-1 to Corosync 
> > 2.99+. It looks like they are not compatible and we are getting messages 
> > like:
> 
> Yes, they are not wire compatible. Also please do not use 2.99 versions, 
> these were alfa/beta/rc before 3.0 and 3.0 is actually quite a long time 
> released (3.0.4 is latest and I would recommend using it - there were 
> quite a few important bugfixes between 3.0.0 and 3.0.4)
> 
> 
> > Jun 11 02:10:20 d21-22-left corosync[6349]:   [TOTEM ] Message received 
> > from 172.18.52.44 has bad magic number (probably sent by Corosync 2.3+).. 
> > Ignoring
> > on the upgraded node and
> > Jun 11 01:02:37 d21-22-right corosync[14912]:   [TOTEM ] Invalid packet data
> > Jun 11 01:02:38 d21-22-right corosync[14912]:   [TOTEM ] Incoming packet 
> > has different crypto type. Rejecting
> > Jun 11 01:02:38 d21-22-right corosync[14912]:   [TOTEM ] Received message 
> > has invalid digest... ignoring.
> > on the pre-upgrade node.
> > 
> > Is there a good way to do this upgrade?
> 
> Usually best way is to start from scratch in testing environment to make 
> sure everything works as expected. Then you can shutdown current 
> cluster, upgrade and start it again - config file is mostly compatible, 
> you may just consider changing transport to knet. I don't think there is 
> any definitive guide to do upgrade without shutting down whole cluster, 
> but somebody else may have idea.
> 
> Regards,
>Honza
> 
> > I would appreciate it very much if you could point me to any documentation 
> > or articles on this issue.
> > Thank you very much!
> > _Vitaly
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
> >
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Rolling upgrade from Corosync 2.3+ to Corosync 2.99+ or Corosync 3.0+?

2020-06-10 Thread Vitaly Zolotusky
Hello everybody.
We are trying to do a rolling upgrade from Corosync 2.3.5-1 to Corosync 2.99+. 
It looks like they are not compatible and we are getting messages like:
Jun 11 02:10:20 d21-22-left corosync[6349]:   [TOTEM ] Message received from 
172.18.52.44 has bad magic number (probably sent by Corosync 2.3+).. Ignoring
on the upgraded node and 
Jun 11 01:02:37 d21-22-right corosync[14912]:   [TOTEM ] Invalid packet data
Jun 11 01:02:38 d21-22-right corosync[14912]:   [TOTEM ] Incoming packet has 
different crypto type. Rejecting
Jun 11 01:02:38 d21-22-right corosync[14912]:   [TOTEM ] Received message has 
invalid digest... ignoring.
on the pre-upgrade node.

Is there a good way to do this upgrade? 
I would appreciate it very much if you could point me to any documentation or 
articles on this issue.
Thank you very much!
_Vitaly
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Fence_sbd script in Fedora30?

2019-09-24 Thread Vitaly Zolotusky
Thank you Everybody for quick reply!
I was a little confused that fence_sbd script was removed from sbd package. I 
guess now it is living only in fence_agents.
Also, I was looking for some guidance on new (for me) parameters in fence_sbd, 
but I think I have figured that out. Another problem I have is that we modify 
scripts to work with our hardware and I am in the process of going through 
these changes.
Thanks again!
_Vitaly

> On September 24, 2019 at 12:29 AM Andrei Borzenkov  
> wrote:
> 
> 
> 23.09.2019 23:23, Vitaly Zolotusky пишет:
> > Hello,
> > I am trying to upgrade to Fedora 30. The platform is two node cluster with 
> > pacemaker.
> > It Fedora 28 we were using old fence_sbd script from 2013:
> > 
> > # This STONITH script drives the shared-storage stonith plugin.
> > # Copyright (C) 2013 Lars Marowsky-Bree 
> > 
> > We were overwriting the distribution script in custom built RPM with the 
> > one from 2013.
> > It looks like there is no fence_sbd script any more in the agents source 
> > and some apis changed so that the old script would not work.
> > Do you have any documentation / suggestions on how to move from old 
> > fence_sbd script to the latest?
> 
> What's wrong with external/sbd stonith resource?
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Fence_sbd script in Fedora30?

2019-09-23 Thread Vitaly Zolotusky
Hello,
I am trying to upgrade to Fedora 30. The platform is two node cluster with 
pacemaker.
It Fedora 28 we were using old fence_sbd script from 2013:

# This STONITH script drives the shared-storage stonith plugin.
# Copyright (C) 2013 Lars Marowsky-Bree 

We were overwriting the distribution script in custom built RPM with the one 
from 2013.
It looks like there is no fence_sbd script any more in the agents source and 
some apis changed so that the old script would not work.
Do you have any documentation / suggestions on how to move from old fence_sbd 
script to the latest?
Thank you very much!
_Vitaly
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] HA domain controller fences newly joined node after fence_ipmilan delay even if transition was aborted.

2018-12-18 Thread Vitaly Zolotusky
Chris,
Thanks a lot for the info. I'll explore both options.
_Vitaly

> On December 18, 2018 at 11:13 AM Chris Walker  wrote:
> 
> 
> Looks like rhino66-left was scheduled for fencing because it was not present 
> 20 seconds (the dc-deadtime parameter) after rhino66-right started Pacemaker 
> (startup fencing).  I can think of a couple of ways to allow all nodes to 
> survive if they come up far apart in time (i.e., father apart than 
> dc-deadtime):
> 
> 1.  Increase dc-deadtime.  Unfortunately, the cluster always waits for 
> dc-deadtime to expire before starting resources, so this can delay your 
> cluster's startup.
> 
> 2.  As Ken mentioned, synchronize the starting of Corosync and Pacemaker.  I 
> did this with a simple ExecStartPre systemd script:
> 
> [root@bug0 ~]# cat /etc/systemd/system/corosync.service.d/ha_wait.conf
> [Service]
> ExecStartPre=/sbin/ha_wait.sh
> TimeoutStartSec=11min
> [root@bug0 ~]#
> 
> where ha_wait.sh has something like:
> 
> #!/bin/bash
> 
> timeout=600
> 
> peer=
> 
> echo "Waiting for ${peer}"
> peerup() {
>   systemctl -H ${peer} show -p ActiveState corosync.service 2> /dev/null | \
> egrep -q "=active|=reloading|=failed|=activating|=deactivating" && return > 0
>   return 1
> }
> 
> now=${SECONDS}
> while ! peerup && [ $((SECONDS-now)) -lt ${timeout} ]; do
>   echo -n .
>   sleep 5
> done
> 
> peerup && echo "${peer} is up starting HA" || echo "${peer} not up after 
> ${timeout} starting HA alone"
> 
> 
> This will cause corosync startup to block for 10 minutes waiting for the 
> partner node to come up, after which both nodes will start corosync/pacemaker 
> close in time.  If one node never comes up, then it will wait 10 minutes 
> before starting, after which the other node will be fenced (startup fencing 
> and subsequent resource startup will only happen will only occur if 
> no-quorum-policy is set to ignore)
> 
> HTH,
> 
> Chris
> 
> On 12/17/18 6:25 PM, Vitaly Zolotusky wrote:
> 
> Ken, Thank you very much for quick response!
> I do have "two_node: 1" in the corosync.conf. I have attached it to this 
> email (not from the same system as original messages, but they are all the 
> same).
> Syncing startup of corosync and pacemaker on different nodes would be a 
> problem for us.
> I suspect that the problem is that corosync assumes quorum is reached as soon 
> as corosync is started on both nodes, but pacemaker does not abort fencing 
> until pacemaker starts on the other node.
> 
> I will try to work around this issue by moving corosync and pacemaker 
> startups on single node as close to each other as possible.
> Thanks again!
> _Vitaly
> 
> 
> 
> On December 17, 2018 at 6:01 PM Ken Gaillot 
> <mailto:kgail...@redhat.com> wrote:
> 
> 
> On Mon, 2018-12-17 at 15:43 -0500, Vitaly Zolotusky wrote:
> 
> 
> Hello,
> I have a 2 node cluster and stonith is configured for SBD and
> fence_ipmilan.
> fence_ipmilan for node 1 is configured for 0 delay and for node 2 for
> 30 sec delay so that nodes do not start killing each other during
> startup.
> 
> 
> 
> If you're using corosync 2 or later, you can set "two_node: 1" in
> corosync.conf. That implies the wait_for_all option, so that at start-
> up, both nodes must be present before quorum can be reached the first
> time. (After that point, one node can go away and quorum will be
> retained.)
> 
> Another way to avoid this is to start corosync on all nodes, then start
> pacemaker on all nodes.
> 
> 
> 
> In some cases (usually right after installation and when node 1 comes
> up first and node 2 second) the node that comes up first (node 1)
> states that node 2 is unclean, but can't fence it until quorum
> reached.
> Then as soon as quorum is reached after startup of corosync on node 2
> it sends a fence request for node 2.
> Fence_ipmilan gets into 30 sec delay.
> Pacemaker gets started on node 2.
> While fence_ipmilan is still waiting for the delay node 1 crmd aborts
> transition that requested the fence.
> Even though the transition was aborted, when delay time expires node
> 2 gets fenced.
> 
> 
> 
> Currently, pacemaker has no way of cancelling fencing once it's been
> initiated. Technically, it would be possible to cancel an operation
> that's in the delay stage (assuming that no other fence device has
> already been attempted, if there are more than one), but that hasn't
> been implemented.
> 
> 
> 
> Excerpts from messages are below. I also attached messages from both
> nodes and pe-input files fro

Re: [ClusterLabs] Antw: HA domain controller fences newly joined node after fence_ipmilan delay even if transition was aborted.

2018-12-18 Thread Vitaly Zolotusky
Ulrich,
Thank you very much for suggestion.
My guess is that node 2 is considered unclean because both nodes were rebooted 
without pacemaker knowledge after installation. For our appliance it should be 
OK as we are supposed to survive multiple hardware and power failures. So this 
case is just an indication that something is not working right and we would 
like to get to the bottom of it.
Thanks again!
_Vitaly

> On December 18, 2018 at 1:47 AM Ulrich Windl 
>  wrote:
> 
> 
> >>> Vitaly Zolotusky  schrieb am 17.12.2018 um 21:43 in 
> >>> Nachricht
> <1782126841.215210.1545079428...@webmail6.networksolutionsemail.com>:
> > Hello,
> > I have a 2 node cluster and stonith is configured for SBD and fence_ipmilan.
> > fence_ipmilan for node 1 is configured for 0 delay and for node 2 for 30 
> > sec 
> > delay so that nodes do not start killing each other during startup.
> > In some cases (usually right after installation and when node 1 comes up 
> > first and node 2 second) the node that comes up first (node 1) states that 
> > node 2 is unclean, but can't fence it until quorum reached. 
> 
> I'd concentrate on examining why node2 is considered unclean. Of course that 
> doesn't fix the issue, but if fixing it takes some time, you'll have a 
> work-around ;-)
> 
> > Then as soon as quorum is reached after startup of corosync on node 2 it 
> > sends a fence request for node 2. 
> > Fence_ipmilan gets into 30 sec delay.
> > Pacemaker gets started on node 2.
> > While fence_ipmilan is still waiting for the delay node 1 crmd aborts 
> > transition that requested the fence.
> > Even though the transition was aborted, when delay time expires node 2 gets 
> > fenced.
> > Excerpts from messages are below. I also attached messages from both nodes 
> > and pe-input files from node 1.
> > Any suggestions would be appreciated.
> > Thank you very much for your help!
> > Vitaly Zolotusky
> > 
> > Here are excerpts from the messages:
> > 
> > Node 1 - controller - rhino66-right 172.18.51.81 - came up first  
> > *
> > 
> > Nov 29 16:47:54 rhino66-right pengine[22183]:  warning: Fencing and 
> > resource 
> > management disabled due to lack of quorum
> > Nov 29 16:47:54 rhino66-right pengine[22183]:  warning: Node 
> > rhino66-left.lab.archivas.com is unclean!
> > Nov 29 16:47:54 rhino66-right pengine[22183]:   notice: Cannot fence 
> > unclean 
> > nodes until quorum is attained (or no-quorum-policy is set to ignore)
> > .
> > Nov 29 16:48:38 rhino66-right corosync[6677]:   [TOTEM ] A new membership 
> > (172.16.1.81:60) was formed. Members joined: 2
> > Nov 29 16:48:38 rhino66-right corosync[6677]:   [VOTEQ ] Waiting for all 
> > cluster members. Current votes: 1 expected_votes: 2
> > Nov 29 16:48:38 rhino66-right corosync[6677]:   [QUORUM] This node is 
> > within 
> > the primary component and will provide service.
> > Nov 29 16:48:38 rhino66-right corosync[6677]:   [QUORUM] Members[2]: 1 2
> > Nov 29 16:48:38 rhino66-right corosync[6677]:   [MAIN  ] Completed service 
> > synchronization, ready to provide service.
> > Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Quorum acquired
> > Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Quorum acquired
> > Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Could not obtain a 
> > node 
> > name for corosync nodeid 2
> > Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Could not obtain 
> > a node name for corosync nodeid 2
> > Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Could not obtain a 
> > node 
> > name for corosync nodeid 2
> > Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Node (null) state is 
> > now member
> > Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Could not obtain 
> > a node name for corosync nodeid 2
> > Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Node (null) 
> > state 
> > is now member
> > Nov 29 16:48:54 rhino66-right crmd[22184]:   notice: State transition 
> > S_IDLE 
> >-> S_POLICY_ENGINE
> > Nov 29 16:48:54 rhino66-right pengine[22183]:   notice: Watchdog will be 
> > used via SBD if fencing is required
> > Nov 29 16:48:54 rhino66-right pengine[22183]:  warning: Scheduling Node 
> > rhino66-left.lab.archivas.com for STONITH
> > Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  * Fence (reboot) 
> > rhino66-left.lab.archivas.com 'node is unclean'
> > Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  * Start  
&

Re: [ClusterLabs] HA domain controller fences newly joined node after fence_ipmilan delay even if transition was aborted.

2018-12-17 Thread Vitaly Zolotusky
Ken, Thank you very much for quick response! 
I do have "two_node: 1" in the corosync.conf. I have attached it to this email 
(not from the same system as original messages, but they are all the same).
Syncing startup of corosync and pacemaker on different nodes would be a problem 
for us.
I suspect that the problem is that corosync assumes quorum is reached as soon 
as corosync is started on both nodes, but pacemaker does not abort fencing 
until pacemaker starts on the other node.

I will try to work around this issue by moving corosync and pacemaker startups 
on single node as close to each other as possible.
Thanks again!
_Vitaly

> On December 17, 2018 at 6:01 PM Ken Gaillot  wrote:
> 
> 
> On Mon, 2018-12-17 at 15:43 -0500, Vitaly Zolotusky wrote:
> > Hello,
> > I have a 2 node cluster and stonith is configured for SBD and
> > fence_ipmilan.
> > fence_ipmilan for node 1 is configured for 0 delay and for node 2 for
> > 30 sec delay so that nodes do not start killing each other during
> > startup.
> 
> If you're using corosync 2 or later, you can set "two_node: 1" in
> corosync.conf. That implies the wait_for_all option, so that at start-
> up, both nodes must be present before quorum can be reached the first
> time. (After that point, one node can go away and quorum will be
> retained.)
> 
> Another way to avoid this is to start corosync on all nodes, then start
> pacemaker on all nodes.
> 
> > In some cases (usually right after installation and when node 1 comes
> > up first and node 2 second) the node that comes up first (node 1)
> > states that node 2 is unclean, but can't fence it until quorum
> > reached. 
> > Then as soon as quorum is reached after startup of corosync on node 2
> > it sends a fence request for node 2. 
> > Fence_ipmilan gets into 30 sec delay.
> > Pacemaker gets started on node 2.
> > While fence_ipmilan is still waiting for the delay node 1 crmd aborts
> > transition that requested the fence.
> > Even though the transition was aborted, when delay time expires node
> > 2 gets fenced.
> 
> Currently, pacemaker has no way of cancelling fencing once it's been
> initiated. Technically, it would be possible to cancel an operation
> that's in the delay stage (assuming that no other fence device has
> already been attempted, if there are more than one), but that hasn't
> been implemented.
> 
> > Excerpts from messages are below. I also attached messages from both
> > nodes and pe-input files from node 1.
> > Any suggestions would be appreciated.
> > Thank you very much for your help!
> > Vitaly Zolotusky
> > 
> > Here are excerpts from the messages:
> > 
> > Node 1 - controller - rhino66-right 172.18.51.81 - came up
> > first  *
> > 
> > Nov 29 16:47:54 rhino66-right pengine[22183]:  warning: Fencing and
> > resource management disabled due to lack of quorum
> > Nov 29 16:47:54 rhino66-right pengine[22183]:  warning: Node rhino66-
> > left.lab.archivas.com is unclean!
> > Nov 29 16:47:54 rhino66-right pengine[22183]:   notice: Cannot fence
> > unclean nodes until quorum is attained (or no-quorum-policy is set to
> > ignore)
> > .
> > Nov 29 16:48:38 rhino66-right corosync[6677]:   [TOTEM ] A new
> > membership (172.16.1.81:60) was formed. Members joined: 2
> > Nov 29 16:48:38 rhino66-right corosync[6677]:   [VOTEQ ] Waiting for
> > all cluster members. Current votes: 1 expected_votes: 2
> > Nov 29 16:48:38 rhino66-right corosync[6677]:   [QUORUM] This node is
> > within the primary component and will provide service.
> > Nov 29 16:48:38 rhino66-right corosync[6677]:   [QUORUM] Members[2]:
> > 1 2
> > Nov 29 16:48:38 rhino66-right corosync[6677]:   [MAIN  ] Completed
> > service synchronization, ready to provide service.
> > Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Quorum acquired
> > Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Quorum
> > acquired
> > Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Could not obtain
> > a node name for corosync nodeid 2
> > Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Could not
> > obtain a node name for corosync nodeid 2
> > Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Could not obtain
> > a node name for corosync nodeid 2
> > Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Node (null)
> > state is now member
> > Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Could not
> > obtain a node name for corosync nodeid 2
> > Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Node
> > (null)