Re: [ClusterLabs] Pacemaker and Corosync versions compatibility
On 24/06/16 09:46 PM, Maciej Kopczyński wrote: > Hello, > > I've been following a tutorial to set up a simple HA cluster using > Pacemaker and Corosync on CentOS 6.x while I have noticed that in the > original documentation it is stated that: > > "Since |pcs| has the ability to manage all aspects of the cluster (both > corosync and pacemaker), it requires a specific cluster stack to be in > use: corosync 2.0 or later with votequorum plus Pacemaker 1.1.8 or later." > > Here are the versions of packages installed in my system (CentOS 6.7): > pacemaker-1.1.14-8.el6.x86_64 > corosync-1.4.7-5.el6.x86_64 > > I did not do that much of testing, but my cluster seems to be more or > less working so far, what are the compatibility issues then? What will > not work with corosync in version lower than 2.0? > > Thanks in advance for your answers. You will notice on EL6 that you have cman installed and that there is an /etc/cluster/cluster.conf file. This is a special plugin setup made to provide support for pacemaker in RHEL 6 (was added officially in late 6.4/6.5 release). This is the quorum provider and it manages corosync v1.4 so that pacemaker can work with it. This is odd, admittedly, but it was part of the process of merging what had formerly been two separate HA stacks. You can read about the history and merger here: https://alteeve.ca/w/History_of_HA_Clustering In RHEL 7, the merger was finished and you have the final stack of just corosync v2 and pacemaker. However, in el6 with the cman plugin, it will also work just fine. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Pacemaker and Corosync versions compatibility
Hello, I've been following a tutorial to set up a simple HA cluster using Pacemaker and Corosync on CentOS 6.x while I have noticed that in the original documentation it is stated that: "Since pcs has the ability to manage all aspects of the cluster (both corosync and pacemaker), it requires a specific cluster stack to be in use: corosync 2.0 or later with votequorum plus Pacemaker 1.1.8 or later." Here are the versions of packages installed in my system (CentOS 6.7): pacemaker-1.1.14-8.el6.x86_64 corosync-1.4.7-5.el6.x86_64 I did not do that much of testing, but my cluster seems to be more or less working so far, what are the compatibility issues then? What will not work with corosync in version lower than 2.0? Thanks in advance for your answers. MK ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] DLM standalone without crm ?
Yes, technically. I've not played with it stand-alone, and I believe it will still need corosync for internode communication and membership. Also, if a node fails and can't be fenced, I believe it will block. Others here might be able to speak more authoritatively than I. madi On 24/06/16 11:34 AM, Lentes, Bernd wrote: > Hi, > > is it possible to have a DLM running without CRM ? Just for playing around a > bit and get used to some stuff. > > > Bernd > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] DLM standalone without crm ?
Hi, is it possible to have a DLM running without CRM ? Just for playing around a bit and get used to some stuff. Bernd -- Bernd Lentes Systemadministration institute of developmental genetics Gebäude 35.34 - Raum 208 HelmholtzZentrum München bernd.len...@helmholtz-muenchen.de phone: +49 (0)89 3187 1241 fax: +49 (0)89 3187 2294 Wer glaubt das Projektleiter Projekte leiten der glaubt auch das Zitronenfalter Zitronen falten Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Dr. Alfons Enhsen, Renate Schlusen (komm.) Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?
On 06/24/2016 05:41 AM, Adam Spiers wrote: > Andrew Beekhof wrote: >> On Fri, Jun 24, 2016 at 1:01 AM, Adam Spiers wrote: >>> Andrew Beekhof wrote: >> > Well, if you're OK with bending the rules like this then that's good > enough for me to say we should at least try it :) I still say you shouldn't only do it on error. >>> >>> When else should it be done? >> >> I was thinking whenever a stop() happens. > > OK, seems we are agreed finally :) > >>> IIUC, disabling/enabling the service is independent of the up/down >>> state which nova tracks automatically, and which based on slightly >>> more than a skim of the code, is dependent on the state of the RPC >>> layer. >>> > But how would you avoid repeated consecutive invocations of "nova > service-disable" when the monitor action fails, and ditto for "nova > service-enable" when it succeeds? I don't think you can. Not ideal but I'd not have thought a deal breaker. >>> >>> Sounds like a massive deal-breaker to me! With op monitor >>> interval="10s" and 100 compute nodes, that would mean 10 pointless >>> calls to nova-api every second. Am I missing something? >> >> I was thinking you would only call it for the "I detected a failure >> case" and service-enable would still be on start(). >> So the number of pointless calls per second would be capped at one >> tenth of the number of failed compute nodes. >> >> One would hope that all of them weren't dead. > > Oh OK - yeah that wouldn't be nearly as bad. > >>> Also I don't see any benefit to moving the API calls from start/stop >>> actions to the monitor action. If there's a failure, Pacemaker will >>> invoke the stop action, so we can do service-disable there. >> >> I agree. Doing it unconditionally at stop() is my preferred option, I >> was only trying to provide a path that might be close to the behaviour >> you were looking for. >> >>> If the >>> start action is invoked and we successfully initiate startup of >>> nova-compute, the RA can undo any service-disable it previously did >>> (although it should not reverse a service-disable done elsewhere, >>> e.g. manually by the cloud operator). >> >> Agree > > Trying to adjust to this new sensation of agreement ;-) > > Earlier in this thread I proposed > the idea of a tiny temporary file in /run which tracks the last known > state and optimizes away the consecutive invocations, but IIRC you > were against that. I'm generally not a fan, but sometimes state files are a necessity. Just make sure you think through what a missing file might mean. >>> >>> Sure. A missing file would mean the RA's never called service-disable >>> before, >> >> And that is why I generally don't like state files. >> The default location for state files doesn't persist across reboots. >> >> t1. stop (ie. disable) >> t2. reboot >> t3. start with no state file >> t4. WHY WONT NOVA USE THE NEW COMPUTE NODE STUPID CLUSTERS > > Well then we simply put the state file somewhere which does persist > across reboots. There's also the possibility of using a node attribute. If you set a normal node attribute, it will abort the transition and calculate a new one, so that's something to take into account. You could set a private node attribute, which never gets written to the CIB and thus doesn't abort transitions, but it also does not survive a complete cluster stop. >>> which means that it shouldn't call service-enable on startup. >>> Unless use the state file to store the date at which the last start operation occurred? If we're calling stop() and data - start_date > threshold, then, if you must, be optimistic, skip service-disable and assume we'll get started again soon. Otherwise if we're calling stop() and data - start_date <= threshold, always call service-disable because we're in a restart loop which is not worth optimising for. ( And always call service-enable at start() ) No Pacemaker feature or Beekhof approval required :-) >>> >>> Hmm ... it's possible I just don't understand this proposal fully, >>> but it sounds a bit woolly to me, e.g. how would you decide a suitable >>> threshold? >> >> roll a dice? >> >>> I think I preferred your other suggestion of just skipping the >>> optimization, i.e. calling service-disable on the first stop, and >>> service-enable on (almost) every start. >> >> good :) >> >> And the use of force-down from your subsequent email sounds excellent > > OK great! We finally got there :-) Now I guess I just have to write > the spec and the actual code ;-) ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?
Andrew Beekhof wrote: > On Fri, Jun 24, 2016 at 1:01 AM, Adam Spiers wrote: > > Andrew Beekhof wrote: > > >> > Well, if you're OK with bending the rules like this then that's good > >> > enough for me to say we should at least try it :) > >> > >> I still say you shouldn't only do it on error. > > > > When else should it be done? > > I was thinking whenever a stop() happens. OK, seems we are agreed finally :) > > IIUC, disabling/enabling the service is independent of the up/down > > state which nova tracks automatically, and which based on slightly > > more than a skim of the code, is dependent on the state of the RPC > > layer. > > > >> > But how would you avoid repeated consecutive invocations of "nova > >> > service-disable" when the monitor action fails, and ditto for "nova > >> > service-enable" when it succeeds? > >> > >> I don't think you can. Not ideal but I'd not have thought a deal breaker. > > > > Sounds like a massive deal-breaker to me! With op monitor > > interval="10s" and 100 compute nodes, that would mean 10 pointless > > calls to nova-api every second. Am I missing something? > > I was thinking you would only call it for the "I detected a failure > case" and service-enable would still be on start(). > So the number of pointless calls per second would be capped at one > tenth of the number of failed compute nodes. > > One would hope that all of them weren't dead. Oh OK - yeah that wouldn't be nearly as bad. > > Also I don't see any benefit to moving the API calls from start/stop > > actions to the monitor action. If there's a failure, Pacemaker will > > invoke the stop action, so we can do service-disable there. > > I agree. Doing it unconditionally at stop() is my preferred option, I > was only trying to provide a path that might be close to the behaviour > you were looking for. > > > If the > > start action is invoked and we successfully initiate startup of > > nova-compute, the RA can undo any service-disable it previously did > > (although it should not reverse a service-disable done elsewhere, > > e.g. manually by the cloud operator). > > Agree Trying to adjust to this new sensation of agreement ;-) > >> > Earlier in this thread I proposed > >> > the idea of a tiny temporary file in /run which tracks the last known > >> > state and optimizes away the consecutive invocations, but IIRC you > >> > were against that. > >> > >> I'm generally not a fan, but sometimes state files are a necessity. > >> Just make sure you think through what a missing file might mean. > > > > Sure. A missing file would mean the RA's never called service-disable > > before, > > And that is why I generally don't like state files. > The default location for state files doesn't persist across reboots. > > t1. stop (ie. disable) > t2. reboot > t3. start with no state file > t4. WHY WONT NOVA USE THE NEW COMPUTE NODE STUPID CLUSTERS Well then we simply put the state file somewhere which does persist across reboots. > > which means that it shouldn't call service-enable on startup. > > > >> Unless use the state file to store the date at which the last > >> start operation occurred? > >> > >> If we're calling stop() and data - start_date > threshold, then, if > >> you must, be optimistic, skip service-disable and assume we'll get > >> started again soon. > >> > >> Otherwise if we're calling stop() and data - start_date <= threshold, > >> always call service-disable because we're in a restart loop which is > >> not worth optimising for. > >> > >> ( And always call service-enable at start() ) > >> > >> No Pacemaker feature or Beekhof approval required :-) > > > > Hmm ... it's possible I just don't understand this proposal fully, > > but it sounds a bit woolly to me, e.g. how would you decide a suitable > > threshold? > > roll a dice? > > > I think I preferred your other suggestion of just skipping the > > optimization, i.e. calling service-disable on the first stop, and > > service-enable on (almost) every start. > > good :) > > And the use of force-down from your subsequent email sounds excellent OK great! We finally got there :-) Now I guess I just have to write the spec and the actual code ;-) ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: restarting pacemakerd
>>> Ferenc Wágner schrieb am 18.06.2016 um 12:15 in Nachricht <8760t66bwn@lant.ki.iif.hu>: > Hi, > > Could somebody please elaborate a little why the pacemaker systemd > service file contains "Restart=on-failure"? I mean that a failed node > gets fenced anyway, so most of the time this would be a futile effort. I guess it's the "better than nothing" category. ;-) > On the other hand, one could argue that restarting failed services > should be the default behavior of systemd (or any init system). Still, > it is not. I'd be grateful for some insight into the matter. > -- > Thanks, > Feri > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Alert notes
On 06/24/2016 09:16 AM, Ulrich Windl wrote: Ferenc Wágner schrieb am 15.06.2016 um 18:11 in Nachricht > <87vb1a5t4k@lant.ki.iif.hu>: >> Hi, >> > [...] >> The SNMP agent seems to have a problem with hrSystemDate, which should >> be an OCTETSTR with strict format, not some plain textual timestamp. > ??? > snmptranslate -M+. -m+NET-SNMP-MIB -m+HOST-RESOURCES-MIB -Tp -Ib hrSystemDate > +-- -RW- StringhrSystemDate(2) > Textual Convention: DateAndTime > Size: 8 | 11 > > [...] > DateAndTime ::= TEXTUAL-CONVENTION > DISPLAY-HINT "2d-1d-1d,1d:1d:1d.1d,1a1d:1d" > STATUS current > DESCRIPTION > "A date-time specification. > > field octets contents range > - -- - > 1 1-2 year* 0..65536 > 2 3month 1..12 > 3 4day 1..31 > 4 5hour 0..23 > 5 6minutes 0..59 > 6 7seconds 0..60 >(use 60 for leap-second) > 7 8deci-seconds 0..9 > 8 9direction from UTC'+' / '-' > 9 10hours from UTC* 0..13 > 10 11minutes from UTC 0..59 > > * Notes: > - the value of year is in network-byte order > - daylight saving time in New Zealand is +13 > > For example, Tuesday May 26, 1992 at 1:30:15 PM EDT would be > displayed as: > > 1992-5-26,13:30:15.0,-4:0 > > Note that if only local time is known, then timezone > information (fields 8-10) is not present." > SYNTAX OCTET STRING (SIZE (8 | 11)) > > But as already discussed fortunately one doesn't have to deal with the binary representation when using the snmptrap-tool (as the example-script is doing) because the tool is doing the conversion for us. And it doesn't seem to be picky on zero-padding so that the format string given in the header of the snmp-example-script should do the job ("%Y-%m-%d,%H:%M:%S.%01N"). >> But I haven't really looked into this yet. >> -- >> Regards, >> Feri >> >> ___ >> Users mailing list: Users@clusterlabs.org >> http://clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Cluster reboot fro maintenance
Maintenance worked perfectly: - cluster in maintenance: crm configure property maintenance-mode=true - update vm os etc - stop corosync/pacemaker - reboot - start corosync/pacemaker - cluster out of maintenance: crm configure property maintenance-mode=false - all resources went up ok Best regards Marco On Mon, 20 Jun 2016 15:42:11 -0500 Ken Gaillot wrote: > On 06/20/2016 07:45 AM, ma...@nucleus.it wrote: > > Hi, > > i have a two node cluster with some vms (pacemaker resources) > > running on the two hypervisors: > > pacemaker-1.0.10 > > corosync-1.3.0 > > > > I need to do maintenance stuff , so i need to: > > - put on maintenance the cluster so the cluster doesn't > > touch/start/stop/monitor the vms > > - update the vms > > - stop the vms > > - stop cluster stuff (corosync/pacemaker) so it do not > > start/stop/monitor vms > > - reboot the hypervisors. > > - start cluster stuff > > - remove maintenance from the cluster stuff so it start all the vms > > > > What is the corret way to do that ( corosync/pacemaker) side ? > > > > > > Best regards > > Marco > > Maintenance mode provides this ability. Set the maintenance-mode > cluster proprerty to true, do whatever you want, then set it back to > false when done. > > That said, I've never used pacemaker/corosync versions that old, so > I'm not 100% sure that applies to those versions, though I would > guess it does. > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: > http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: Alert notes
>>> Ferenc Wágner schrieb am 15.06.2016 um 18:11 in Nachricht <87vb1a5t4k@lant.ki.iif.hu>: > Hi, > [...] > The SNMP agent seems to have a problem with hrSystemDate, which should > be an OCTETSTR with strict format, not some plain textual timestamp. ??? snmptranslate -M+. -m+NET-SNMP-MIB -m+HOST-RESOURCES-MIB -Tp -Ib hrSystemDate +-- -RW- StringhrSystemDate(2) Textual Convention: DateAndTime Size: 8 | 11 [...] DateAndTime ::= TEXTUAL-CONVENTION DISPLAY-HINT "2d-1d-1d,1d:1d:1d.1d,1a1d:1d" STATUS current DESCRIPTION "A date-time specification. field octets contents range - -- - 1 1-2 year* 0..65536 2 3month 1..12 3 4day 1..31 4 5hour 0..23 5 6minutes 0..59 6 7seconds 0..60 (use 60 for leap-second) 7 8deci-seconds 0..9 8 9direction from UTC'+' / '-' 9 10hours from UTC* 0..13 10 11minutes from UTC 0..59 * Notes: - the value of year is in network-byte order - daylight saving time in New Zealand is +13 For example, Tuesday May 26, 1992 at 1:30:15 PM EDT would be displayed as: 1992-5-26,13:30:15.0,-4:0 Note that if only local time is known, then timezone information (fields 8-10) is not present." SYNTAX OCTET STRING (SIZE (8 | 11)) > But I haven't really looked into this yet. > -- > Regards, > Feri > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org