Re: [ClusterLabs] Pacemaker and Corosync versions compatibility

2016-06-24 Thread Digimer
On 24/06/16 09:46 PM, Maciej Kopczyński wrote:
> Hello,
> 
> I've been following a tutorial to set up a simple HA cluster using
> Pacemaker and Corosync on CentOS 6.x while I have noticed that in the
> original documentation it is stated that:
> 
> "Since |pcs| has the ability to manage all aspects of the cluster (both
> corosync and pacemaker), it requires a specific cluster stack to be in
> use: corosync 2.0 or later with votequorum plus Pacemaker 1.1.8 or later."
> 
> Here are the versions of packages installed in my system (CentOS 6.7):
> pacemaker-1.1.14-8.el6.x86_64
> corosync-1.4.7-5.el6.x86_64
> 
> I did not do that much of testing, but my cluster seems to be more or
> less working so far, what are the compatibility issues then? What will
> not work with corosync in version lower than 2.0?
> 
> Thanks in advance for your answers.

You will notice on EL6 that you have cman installed and that there is an
/etc/cluster/cluster.conf file. This is a special plugin setup made to
provide support for pacemaker in RHEL 6 (was added officially in late
6.4/6.5 release). This is the quorum provider and it manages corosync
v1.4 so that pacemaker can work with it.

This is odd, admittedly, but it was part of the process of merging what
had formerly been two separate HA stacks. You can read about the history
and merger here: https://alteeve.ca/w/History_of_HA_Clustering

In RHEL 7, the merger was finished and you have the final stack of just
corosync v2 and pacemaker. However, in el6 with the cman plugin, it will
also work just fine.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pacemaker and Corosync versions compatibility

2016-06-24 Thread Maciej Kopczyński
Hello,

I've been following a tutorial to set up a simple HA cluster using
Pacemaker and Corosync on CentOS 6.x while I have noticed that in the
original documentation it is stated that:

"Since pcs has the ability to manage all aspects of the cluster (both
corosync and pacemaker), it requires a specific cluster stack to be in use:
corosync 2.0 or later with votequorum plus Pacemaker 1.1.8 or later."

Here are the versions of packages installed in my system (CentOS 6.7):
pacemaker-1.1.14-8.el6.x86_64
corosync-1.4.7-5.el6.x86_64

I did not do that much of testing, but my cluster seems to be more or less
working so far, what are the compatibility issues then? What will not work
with corosync in version lower than 2.0?

Thanks in advance for your answers.

MK
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] DLM standalone without crm ?

2016-06-24 Thread Digimer
Yes, technically. I've not played with it stand-alone, and I believe it
will still need corosync for internode communication and membership.
Also, if a node fails and can't be fenced, I believe it will block.
Others here might be able to speak more authoritatively than I.

madi

On 24/06/16 11:34 AM, Lentes, Bernd wrote:
> Hi,
> 
> is it possible to have a DLM running without CRM ? Just for playing around a 
> bit and get used to some stuff.
> 
> 
> Bernd
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] DLM standalone without crm ?

2016-06-24 Thread Lentes, Bernd
Hi,

is it possible to have a DLM running without CRM ? Just for playing around a 
bit and get used to some stuff.


Bernd

-- 
Bernd Lentes 

Systemadministration 
institute of developmental genetics 
Gebäude 35.34 - Raum 208 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 (0)89 3187 1241 
fax: +49 (0)89 3187 2294 

Wer glaubt das Projektleiter Projekte leiten 
der glaubt auch das Zitronenfalter 
Zitronen falten
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Dr. Alfons Enhsen, Renate Schlusen 
(komm.)
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-06-24 Thread Ken Gaillot
On 06/24/2016 05:41 AM, Adam Spiers wrote:
> Andrew Beekhof  wrote:
>> On Fri, Jun 24, 2016 at 1:01 AM, Adam Spiers  wrote:
>>> Andrew Beekhof  wrote:
>>
> Well, if you're OK with bending the rules like this then that's good
> enough for me to say we should at least try it :)

 I still say you shouldn't only do it on error.
>>>
>>> When else should it be done?
>>
>> I was thinking whenever a stop() happens.
> 
> OK, seems we are agreed finally :)
> 
>>> IIUC, disabling/enabling the service is independent of the up/down
>>> state which nova tracks automatically, and which based on slightly
>>> more than a skim of the code, is dependent on the state of the RPC
>>> layer.
>>>
> But how would you avoid repeated consecutive invocations of "nova
> service-disable" when the monitor action fails, and ditto for "nova
> service-enable" when it succeeds?

 I don't think you can. Not ideal but I'd not have thought a deal breaker.
>>>
>>> Sounds like a massive deal-breaker to me!  With op monitor
>>> interval="10s" and 100 compute nodes, that would mean 10 pointless
>>> calls to nova-api every second.  Am I missing something?
>>
>> I was thinking you would only call it for the "I detected a failure
>> case" and service-enable would still be on start().
>> So the number of pointless calls per second would be capped at one
>> tenth of the number of failed compute nodes.
>>
>> One would hope that all of them weren't dead.
> 
> Oh OK - yeah that wouldn't be nearly as bad.
> 
>>> Also I don't see any benefit to moving the API calls from start/stop
>>> actions to the monitor action.  If there's a failure, Pacemaker will
>>> invoke the stop action, so we can do service-disable there.
>>
>> I agree. Doing it unconditionally at stop() is my preferred option, I
>> was only trying to provide a path that might be close to the behaviour
>> you were looking for.
>>
>>> If the
>>> start action is invoked and we successfully initiate startup of
>>> nova-compute, the RA can undo any service-disable it previously did
>>> (although it should not reverse a service-disable done elsewhere,
>>> e.g. manually by the cloud operator).
>>
>> Agree
> 
> Trying to adjust to this new sensation of agreement ;-)
> 
> Earlier in this thread I proposed
> the idea of a tiny temporary file in /run which tracks the last known
> state and optimizes away the consecutive invocations, but IIRC you
> were against that.

 I'm generally not a fan, but sometimes state files are a necessity.
 Just make sure you think through what a missing file might mean.
>>>
>>> Sure.  A missing file would mean the RA's never called service-disable
>>> before,
>>
>> And that is why I generally don't like state files.
>> The default location for state files doesn't persist across reboots.
>>
>> t1. stop (ie. disable)
>> t2. reboot
>> t3. start with no state file
>> t4. WHY WONT NOVA USE THE NEW COMPUTE NODE STUPID CLUSTERS
> 
> Well then we simply put the state file somewhere which does persist
> across reboots.

There's also the possibility of using a node attribute. If you set a
normal node attribute, it will abort the transition and calculate a new
one, so that's something to take into account. You could set a private
node attribute, which never gets written to the CIB and thus doesn't
abort transitions, but it also does not survive a complete cluster stop.

>>> which means that it shouldn't call service-enable on startup.
>>>
 Unless use the state file to store the date at which the last
 start operation occurred?

 If we're calling stop() and data - start_date > threshold, then, if
 you must, be optimistic, skip service-disable and assume we'll get
 started again soon.

 Otherwise if we're calling stop() and data - start_date <= threshold,
 always call service-disable because we're in a restart loop which is
 not worth optimising for.

 ( And always call service-enable at start() )

 No Pacemaker feature or Beekhof approval required :-)
>>>
>>> Hmm ...  it's possible I just don't understand this proposal fully,
>>> but it sounds a bit woolly to me, e.g. how would you decide a suitable
>>> threshold?
>>
>> roll a dice?
>>
>>> I think I preferred your other suggestion of just skipping the
>>> optimization, i.e. calling service-disable on the first stop, and
>>> service-enable on (almost) every start.
>>
>> good :)
>>
>> And the use of force-down from your subsequent email sounds excellent
> 
> OK great!  We finally got there :-)  Now I guess I just have to write
> the spec and the actual code ;-)

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-06-24 Thread Adam Spiers
Andrew Beekhof  wrote:
> On Fri, Jun 24, 2016 at 1:01 AM, Adam Spiers  wrote:
> > Andrew Beekhof  wrote:
> 
> >> > Well, if you're OK with bending the rules like this then that's good
> >> > enough for me to say we should at least try it :)
> >>
> >> I still say you shouldn't only do it on error.
> >
> > When else should it be done?
> 
> I was thinking whenever a stop() happens.

OK, seems we are agreed finally :)

> > IIUC, disabling/enabling the service is independent of the up/down
> > state which nova tracks automatically, and which based on slightly
> > more than a skim of the code, is dependent on the state of the RPC
> > layer.
> >
> >> > But how would you avoid repeated consecutive invocations of "nova
> >> > service-disable" when the monitor action fails, and ditto for "nova
> >> > service-enable" when it succeeds?
> >>
> >> I don't think you can. Not ideal but I'd not have thought a deal breaker.
> >
> > Sounds like a massive deal-breaker to me!  With op monitor
> > interval="10s" and 100 compute nodes, that would mean 10 pointless
> > calls to nova-api every second.  Am I missing something?
> 
> I was thinking you would only call it for the "I detected a failure
> case" and service-enable would still be on start().
> So the number of pointless calls per second would be capped at one
> tenth of the number of failed compute nodes.
> 
> One would hope that all of them weren't dead.

Oh OK - yeah that wouldn't be nearly as bad.

> > Also I don't see any benefit to moving the API calls from start/stop
> > actions to the monitor action.  If there's a failure, Pacemaker will
> > invoke the stop action, so we can do service-disable there.
> 
> I agree. Doing it unconditionally at stop() is my preferred option, I
> was only trying to provide a path that might be close to the behaviour
> you were looking for.
> 
> > If the
> > start action is invoked and we successfully initiate startup of
> > nova-compute, the RA can undo any service-disable it previously did
> > (although it should not reverse a service-disable done elsewhere,
> > e.g. manually by the cloud operator).
> 
> Agree

Trying to adjust to this new sensation of agreement ;-)

> >> > Earlier in this thread I proposed
> >> > the idea of a tiny temporary file in /run which tracks the last known
> >> > state and optimizes away the consecutive invocations, but IIRC you
> >> > were against that.
> >>
> >> I'm generally not a fan, but sometimes state files are a necessity.
> >> Just make sure you think through what a missing file might mean.
> >
> > Sure.  A missing file would mean the RA's never called service-disable
> > before,
> 
> And that is why I generally don't like state files.
> The default location for state files doesn't persist across reboots.
> 
> t1. stop (ie. disable)
> t2. reboot
> t3. start with no state file
> t4. WHY WONT NOVA USE THE NEW COMPUTE NODE STUPID CLUSTERS

Well then we simply put the state file somewhere which does persist
across reboots.

> > which means that it shouldn't call service-enable on startup.
> >
> >> Unless use the state file to store the date at which the last
> >> start operation occurred?
> >>
> >> If we're calling stop() and data - start_date > threshold, then, if
> >> you must, be optimistic, skip service-disable and assume we'll get
> >> started again soon.
> >>
> >> Otherwise if we're calling stop() and data - start_date <= threshold,
> >> always call service-disable because we're in a restart loop which is
> >> not worth optimising for.
> >>
> >> ( And always call service-enable at start() )
> >>
> >> No Pacemaker feature or Beekhof approval required :-)
> >
> > Hmm ...  it's possible I just don't understand this proposal fully,
> > but it sounds a bit woolly to me, e.g. how would you decide a suitable
> > threshold?
> 
> roll a dice?
> 
> > I think I preferred your other suggestion of just skipping the
> > optimization, i.e. calling service-disable on the first stop, and
> > service-enable on (almost) every start.
> 
> good :)
> 
> And the use of force-down from your subsequent email sounds excellent

OK great!  We finally got there :-)  Now I guess I just have to write
the spec and the actual code ;-)

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: restarting pacemakerd

2016-06-24 Thread Ulrich Windl
>>> Ferenc Wágner  schrieb am 18.06.2016 um 12:15 in Nachricht
<8760t66bwn@lant.ki.iif.hu>:
> Hi,
> 
> Could somebody please elaborate a little why the pacemaker systemd
> service file contains "Restart=on-failure"?  I mean that a failed node
> gets fenced anyway, so most of the time this would be a futile effort.

I guess it's the "better than nothing" category. ;-)

> On the other hand, one could argue that restarting failed services
> should be the default behavior of systemd (or any init system).  Still,
> it is not.  I'd be grateful for some insight into the matter.
> -- 
> Thanks,
> Feri
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Alert notes

2016-06-24 Thread Klaus Wenninger
On 06/24/2016 09:16 AM, Ulrich Windl wrote:
 Ferenc Wágner  schrieb am 15.06.2016 um 18:11 in Nachricht
> <87vb1a5t4k@lant.ki.iif.hu>:
>> Hi,
>>
> [...]
>> The SNMP agent seems to have a problem with hrSystemDate, which should
>> be an OCTETSTR with strict format, not some plain textual timestamp.
> ???
> snmptranslate -M+. -m+NET-SNMP-MIB -m+HOST-RESOURCES-MIB -Tp -Ib hrSystemDate
> +-- -RW- StringhrSystemDate(2)
>  Textual Convention: DateAndTime
>  Size: 8 | 11
>
> [...]
> DateAndTime ::= TEXTUAL-CONVENTION
> DISPLAY-HINT "2d-1d-1d,1d:1d:1d.1d,1a1d:1d"
> STATUS   current
> DESCRIPTION
> "A date-time specification.
>
> field  octets  contents  range
> -  --    -
>   1  1-2   year* 0..65536
>   2   3month 1..12
>   3   4day   1..31
>   4   5hour  0..23
>   5   6minutes   0..59
>   6   7seconds   0..60
>(use 60 for leap-second)
>   7   8deci-seconds  0..9
>   8   9direction from UTC'+' / '-'
>   9  10hours from UTC*   0..13
>  10  11minutes from UTC  0..59
>
> * Notes:
> - the value of year is in network-byte order
> - daylight saving time in New Zealand is +13
>
> For example, Tuesday May 26, 1992 at 1:30:15 PM EDT would be
> displayed as:
>
>  1992-5-26,13:30:15.0,-4:0
>
> Note that if only local time is known, then timezone
> information (fields 8-10) is not present."
> SYNTAX   OCTET STRING (SIZE (8 | 11))
>
>
But as already discussed fortunately one doesn't have to deal with the
binary representation when using the snmptrap-tool (as the example-script
is doing) because the tool is doing the conversion for us.
And it doesn't seem to be picky on zero-padding
so that the format string given in the header of the snmp-example-script
should do the job ("%Y-%m-%d,%H:%M:%S.%01N").
>> But I haven't really looked into this yet.
>> -- 
>> Regards,
>> Feri
>>
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> http://clusterlabs.org/mailman/listinfo/users 
>>
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Cluster reboot fro maintenance

2016-06-24 Thread Marco Felettigh
Maintenance worked perfectly:

- cluster in maintenance: crm configure property maintenance-mode=true
- update vm os etc
- stop corosync/pacemaker
- reboot
- start corosync/pacemaker
- cluster out of maintenance: crm configure property
  maintenance-mode=false
- all resources went up ok

Best regards
Marco

On Mon, 20 Jun 2016 15:42:11 -0500
Ken Gaillot  wrote:

> On 06/20/2016 07:45 AM, ma...@nucleus.it wrote:
> > Hi,
> > i have a two node cluster with some vms (pacemaker resources)
> > running on the two hypervisors:
> > pacemaker-1.0.10
> > corosync-1.3.0
> > 
> > I need to do maintenance stuff , so i need to:
> > - put on maintenance the cluster so the cluster doesn't
> >   touch/start/stop/monitor the vms
> > - update the vms
> > - stop the vms
> > - stop cluster stuff (corosync/pacemaker) so it do not
> >   start/stop/monitor vms
> > - reboot the hypervisors.
> > - start cluster stuff
> > - remove maintenance from the cluster stuff so it start all the vms
> > 
> > What is the corret way to do that ( corosync/pacemaker) side ?
> > 
> > 
> > Best regards
> > Marco  
> 
> Maintenance mode provides this ability. Set the maintenance-mode
> cluster proprerty to true, do whatever you want, then set it back to
> false when done.
> 
> That said, I've never used pacemaker/corosync versions that old, so
> I'm not 100% sure that applies to those versions, though I would
> guess it does.
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
> http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Alert notes

2016-06-24 Thread Ulrich Windl
>>> Ferenc Wágner  schrieb am 15.06.2016 um 18:11 in Nachricht
<87vb1a5t4k@lant.ki.iif.hu>:
> Hi,
> 
[...]
> The SNMP agent seems to have a problem with hrSystemDate, which should
> be an OCTETSTR with strict format, not some plain textual timestamp.

???
snmptranslate -M+. -m+NET-SNMP-MIB -m+HOST-RESOURCES-MIB -Tp -Ib hrSystemDate
+-- -RW- StringhrSystemDate(2)
 Textual Convention: DateAndTime
 Size: 8 | 11

[...]
DateAndTime ::= TEXTUAL-CONVENTION
DISPLAY-HINT "2d-1d-1d,1d:1d:1d.1d,1a1d:1d"
STATUS   current
DESCRIPTION
"A date-time specification.

field  octets  contents  range
-  --    -
  1  1-2   year* 0..65536
  2   3month 1..12
  3   4day   1..31
  4   5hour  0..23
  5   6minutes   0..59
  6   7seconds   0..60
   (use 60 for leap-second)
  7   8deci-seconds  0..9
  8   9direction from UTC'+' / '-'
  9  10hours from UTC*   0..13
 10  11minutes from UTC  0..59

* Notes:
- the value of year is in network-byte order
- daylight saving time in New Zealand is +13

For example, Tuesday May 26, 1992 at 1:30:15 PM EDT would be
displayed as:

 1992-5-26,13:30:15.0,-4:0

Note that if only local time is known, then timezone
information (fields 8-10) is not present."
SYNTAX   OCTET STRING (SIZE (8 | 11))


> But I haven't really looked into this yet.
> -- 
> Regards,
> Feri
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org