Re: [ClusterLabs] Stack: unknown and all nodes offline
On 12/10/2015 12:45 PM, Louis Munro wrote: > Hello all, > > I am trying to get a Corosync 2 cluster going on CentOS 6.7 but I am running > in a bit of a problem with either Corosync or Pacemaker. > crm reports that all my nodes are offline and the stack is unknown (I am not > sure if that is relevant). > > I believe both nodes are actually present and seen in corosync, but they may > not be considered as such by pacemaker. > I have messages in the logs saying that the processes cannot get the node > name and default to uname -n: > > Dec 10 13:38:53 [2236] hack1.example.com crmd: info: > corosync_node_name:Unable to get node name for nodeid 739513528 > Dec 10 13:38:53 [2236] hack1.example.com crmd: notice: get_node_name: > Defaulting to uname -n for the local corosync node name > Dec 10 13:38:53 [2236] hack1.example.com crmd: info: crm_get_peer: > Node 739513528 is now known as hack1.example.com > > The uname -n is correct as far that is concerned. > > > Does this mean anything to anyone here? > > > [Lots of details to follow]... > > I compiled my own versions of Corosync, Pacemaker, crm and the > resource-agents seemingly without problems. > > Here is what I currently have installed: > > # corosync -v > Corosync Cluster Engine, version '2.3.5' > Copyright (c) 2006-2009 Red Hat, Inc. > > # pacemakerd -F > Pacemaker 1.1.13 (Build: 5b41ae1) > Supporting v3.0.10: generated-manpages agent-manpages ascii-docs ncurses > libqb-logging libqb-ipc lha-fencing upstart nagios corosync-native > atomic-attrd libesmtp acls > > # crm --version > crm 2.2.0-rc3 > > > > Here is the output of crm status: > > # crm status > Last updated: Thu Dec 10 12:47:50 2015Last change: Thu Dec 10 > 12:02:33 2015 by root via cibadmin on hack1.example.com > Stack: unknown > Current DC: NONE > 2 nodes and 0 resources configured > > OFFLINE: [ hack1.example.com hack2.example.com ] > > Full list of resources: > > {nothing to see here} > > > > # corosync-cmapctl | grep members > runtime.totem.pg.mrp.srp.members.739513528.config_version (u64) = 0 > runtime.totem.pg.mrp.srp.members.739513528.ip (str) = r(0) ip(172.20.20.184) > runtime.totem.pg.mrp.srp.members.739513528.join_count (u32) = 1 > runtime.totem.pg.mrp.srp.members.739513528.status (str) = joined > runtime.totem.pg.mrp.srp.members.739513590.config_version (u64) = 0 > runtime.totem.pg.mrp.srp.members.739513590.ip (str) = r(0) ip(172.20.20.246) > runtime.totem.pg.mrp.srp.members.739513590.join_count (u32) = 1 > runtime.totem.pg.mrp.srp.members.739513590.status (str) = joined > > > # uname -n > hack1.example.com > > # corosync-cfgtool -s > Printing ring status. > Local node ID 739513528 > RING ID 0 > id = 172.20.20.184 > status = ring 0 active with no faults > > > # uname -n > hack2.example.com > > > # corosync-cfgtool -s > Printing ring status. > Local node ID 739513590 > RING ID 0 > id = 172.20.20.246 > status = ring 0 active with no faults > > > > > Shouldn’t I see both nodes in the same ring? They are in the same ring, but the cfgtool will only print the local id. > My corosync config is currently defined as: > > # egrep -v '#' /etc/corosync/corosync.conf > totem { > version: 2 > > crypto_cipher: none > crypto_hash: none > clear_node_high_bit: yes > cluster_name: hack_cluster > interface { > ringnumber: 0 > bindnetaddr: 172.20.0.0 > mcastaddr: 239.255.1.1 > mcastport: 5405 > ttl: 1 > } > > } > > logging { > fileline: on > to_stderr: no > to_logfile: yes > logfile: /var/log/cluster/corosync.log > to_syslog: yes > debug: off > timestamp: on > logger_subsys { > subsys: QUORUM > debug: off > } > } > > # cat /etc/corosync/service.d/pacemaker > service { > name: pacemaker > ver: 1 > } You don't want this section if you're using corosync 2. That's the old "plugin" used with corosync 1. > > > And here is my pacemaker configuration: > > # crm config show xml > > crm_feature_set="3.0.10" validate-with="pacemaker-2.4" > update-client="cibadmin" epoch="13" admin_epoch="0" update-user="root" > cib-last-written="Thu Dec 10 13:35:06 2015"> > > > > id="cib-bootstrap-options-stonith-enabled"/> > id="cib-bootstrap-options-no-quorum-policy"/> > > > > > >id="hack1.example.com-instance_attributes-standby"/> > > > > >id="hack2.example.com-instance_attributes-standby"/> > > > > > > > > > > > > > > And finally some logs that might be relevant: > > Dec 10 13:38:50 [2227] hack1.example.com corosync notice [MAIN ] > main.c:1227 Corosync Cluster Engine ('2.3.5'): started and ready to provide > service.
Re: [ClusterLabs] Stack: unknown and all nodes offline
On 12/10/2015 01:14 PM, Louis Munro wrote: > I can now answer parts of my own question. > > > My config was missing the quorum configuration: > > quorum { > # Enable and configure quorum subsystem (default: off) > # see also corosync.conf.5 and votequorum.5 > provider: corosync_votequorum > two_node: 1 > expected_votes: 2 > } > > > I read the manpage as saying that was optional, but it looks like I may be > misreading here. > corosync.conf(5) says the following: > > Within the quorum directive it is possible to specify the quorum algorithm to > use with the > provider directive. At the time of writing only corosync_votequorum is > supported. > See votequorum(5) for configuration options. > > > > I still have messages in the logs saying > crmd: notice: get_node_name: Defaulting to uname -n for the local > corosync node name > > I am not sure which part of the configuration I should be setting for that. > > Any pointers regarding that would be nice. Hi, As long as the unames are what you want the nodes to be called, that message is fine. You can explicitly set the node names by using a nodelist {} section in corosync.conf, with each node {} having a ring0_addr specifying the name. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Stack: unknown and all nodes offline
I can now answer parts of my own question. My config was missing the quorum configuration: quorum { # Enable and configure quorum subsystem (default: off) # see also corosync.conf.5 and votequorum.5 provider: corosync_votequorum two_node: 1 expected_votes: 2 } I read the manpage as saying that was optional, but it looks like I may be misreading here. corosync.conf(5) says the following: Within the quorum directive it is possible to specify the quorum algorithm to use with the provider directive. At the time of writing only corosync_votequorum is supported. See votequorum(5) for configuration options. I still have messages in the logs saying crmd: notice: get_node_name: Defaulting to uname -n for the local corosync node name I am not sure which part of the configuration I should be setting for that. Any pointers regarding that would be nice. Regards, -- Louis Munro lmu...@inverse.ca :: www.inverse.ca +1.514.447.4918 x125 :: +1 (866) 353-6153 x125 Inverse inc. :: Leaders behind SOGo (www.sogo.nu) and PacketFence (www.packetfence.org) ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Stack: unknown and all nodes offline
Hello all, I am trying to get a Corosync 2 cluster going on CentOS 6.7 but I am running in a bit of a problem with either Corosync or Pacemaker. crm reports that all my nodes are offline and the stack is unknown (I am not sure if that is relevant). I believe both nodes are actually present and seen in corosync, but they may not be considered as such by pacemaker. I have messages in the logs saying that the processes cannot get the node name and default to uname -n: Dec 10 13:38:53 [2236] hack1.example.com crmd: info: corosync_node_name: Unable to get node name for nodeid 739513528 Dec 10 13:38:53 [2236] hack1.example.com crmd: notice: get_node_name: Defaulting to uname -n for the local corosync node name Dec 10 13:38:53 [2236] hack1.example.com crmd: info: crm_get_peer: Node 739513528 is now known as hack1.example.com The uname -n is correct as far that is concerned. Does this mean anything to anyone here? [Lots of details to follow]... I compiled my own versions of Corosync, Pacemaker, crm and the resource-agents seemingly without problems. Here is what I currently have installed: # corosync -v Corosync Cluster Engine, version '2.3.5' Copyright (c) 2006-2009 Red Hat, Inc. # pacemakerd -F Pacemaker 1.1.13 (Build: 5b41ae1) Supporting v3.0.10: generated-manpages agent-manpages ascii-docs ncurses libqb-logging libqb-ipc lha-fencing upstart nagios corosync-native atomic-attrd libesmtp acls # crm --version crm 2.2.0-rc3 Here is the output of crm status: # crm status Last updated: Thu Dec 10 12:47:50 2015 Last change: Thu Dec 10 12:02:33 2015 by root via cibadmin on hack1.example.com Stack: unknown Current DC: NONE 2 nodes and 0 resources configured OFFLINE: [ hack1.example.com hack2.example.com ] Full list of resources: {nothing to see here} # corosync-cmapctl | grep members runtime.totem.pg.mrp.srp.members.739513528.config_version (u64) = 0 runtime.totem.pg.mrp.srp.members.739513528.ip (str) = r(0) ip(172.20.20.184) runtime.totem.pg.mrp.srp.members.739513528.join_count (u32) = 1 runtime.totem.pg.mrp.srp.members.739513528.status (str) = joined runtime.totem.pg.mrp.srp.members.739513590.config_version (u64) = 0 runtime.totem.pg.mrp.srp.members.739513590.ip (str) = r(0) ip(172.20.20.246) runtime.totem.pg.mrp.srp.members.739513590.join_count (u32) = 1 runtime.totem.pg.mrp.srp.members.739513590.status (str) = joined # uname -n hack1.example.com # corosync-cfgtool -s Printing ring status. Local node ID 739513528 RING ID 0 id = 172.20.20.184 status = ring 0 active with no faults # uname -n hack2.example.com # corosync-cfgtool -s Printing ring status. Local node ID 739513590 RING ID 0 id = 172.20.20.246 status = ring 0 active with no faults Shouldn’t I see both nodes in the same ring? My corosync config is currently defined as: # egrep -v '#' /etc/corosync/corosync.conf totem { version: 2 crypto_cipher: none crypto_hash: none clear_node_high_bit: yes cluster_name: hack_cluster interface { ringnumber: 0 bindnetaddr: 172.20.0.0 mcastaddr: 239.255.1.1 mcastport: 5405 ttl: 1 } } logging { fileline: on to_stderr: no to_logfile: yes logfile: /var/log/cluster/corosync.log to_syslog: yes debug: off timestamp: on logger_subsys { subsys: QUORUM debug: off } } # cat /etc/corosync/service.d/pacemaker service { name: pacemaker ver: 1 } And here is my pacemaker configuration: # crm config show xml And finally some logs that might be relevant: Dec 10 13:38:50 [2227] hack1.example.com corosync notice [MAIN ] main.c:1227 Corosync Cluster Engine ('2.3.5'): started and ready to provide service. Dec 10 13:38:50 [2227] hack1.example.com corosync info[MAIN ] main.c:1228 Corosync built-in features: pie relro bindnow Dec 10 13:38:50 [2227] hack1.example.com corosync notice [TOTEM ] totemnet.c:248 Initializing transport (UDP/IP Multicast). Dec 10 13:38:50 [2227] hack1.example.com corosync notice [TOTEM ] totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto: none hash: none Dec 10 13:38:50 [2227] hack1.example.com corosync notice [TOTEM ] totemudp.c:671 The network interface [172.20.20.184] is now up. Dec 10 13:38:50 [2227] hack1.example.com corosync notice [SERV ] service.c:174 Service engine loaded: corosync configuration map access [0] Dec 10 13:38:50 [2227] hack1.example.com corosync info[QB] ipc_setup.c:377 server name: cmap Dec 10 13:38:50 [2227] hack1.example.com corosync notice [SERV ] service.c:174 Service engine loaded: corosync co
Re: [ClusterLabs] duplicate node
Hi, On Tue, Dec 08, 2015 at 09:17:27PM +, gerry kernan wrote: > Hi > > How would I remove a duplicate node, I have a 2 node setup , but on node is > showing twice . crm show configure below, node gat-voip-01.gdft.org is > listed twice. > > > node $id="0dc85a64-01ad-4fc5-81fd-698208a8322c" gat-voip-02\ > attributes standby="on" > node $id="3b5d1061-8f68-4ab3-b169-e0ebe890c446" gat-voip-01 > node $id="ae4d76e7-af64-4d93-acdd-4d7b5c274eff" gat-voip-01\ > attributes standby="off" First you need to figure out which one is the old uuid, then try: # crm node delete This looks like heartbeat, there used to be a crm_uuid or something similar to read the uuid. There's also a uuid file somewhere in /var/lib/heartbeat. Thanks, Dejan > primitive res_Filesystem_rep ocf:heartbeat:Filesystem \ > params device="/dev/drbd0" directory="/rep" fstype="ext3" \ > operations $id="res_Filesystem_rep-operations" \ > op start interval="0" timeout="60" \ > op stop interval="0" timeout="60" \ > op monitor interval="20" timeout="40" start-delay="0" \ > op notify interval="0" timeout="60" \ > meta target-role="started" is-managed="true" > primitive res_IPaddr2_northIP ocf:heartbeat:IPaddr2 \ > params ip="10.75.29.10" cidr_netmask="26" \ > operations $id="res_IPaddr2_northIP-operations" \ > op start interval="0" timeout="20" \ > op stop interval="0" timeout="20" \ > op monitor interval="10" timeout="20" start-delay="0" \ > meta target-role="started" is-managed="true" > primitive res_IPaddr2_sipIP ocf:heartbeat:IPaddr2 \ > params ip="158.255.224.226" nic="bond2" \ > operations $id="res_IPaddr2_sipIP-operations" \ > op start interval="0" timeout="20" \ > op stop interval="0" timeout="20" \ > op monitor interval="10" timeout="20" start-delay="0" \ > meta target-role="started" is-managed="true" > primitive res_asterisk_res_asterisk lsb:asterisk \ > operations $id="res_asterisk_res_asterisk-operations" \ > op start interval="0" timeout="15" \ > op stop interval="0" timeout="15" \ > op monitor interval="15" timeout="15" start-delay="15" \ > meta target-role="started" is-managed="true" > primitive res_drbd_1 ocf:linbit:drbd \ > params drbd_resource="r0" \ > operations $id="res_drbd_1-operations" \ > op start interval="0" timeout="240" \ > op promote interval="0" timeout="90" \ > op demote interval="0" timeout="90" \ > op stop interval="0" timeout="100" \ > op monitor interval="10" timeout="20" start-delay="0" \ > op notify interval="0" timeout="90" > primitive res_httpd_res_httpd lsb:httpd \ > operations $id="res_httpd_res_httpd-operations" \ > op start interval="0" timeout="15" \ > op stop interval="0" timeout="15" \ > op monitor interval="15" timeout="15" start-delay="15" \ > meta target-role="started" is-managed="true" > primitive res_mysqld_res_mysql lsb:mysqld \ > operations $id="res_mysqld_res_mysql-operations" \ > op start interval="0" timeout="15" \ > op stop interval="0" timeout="15" \ > op monitor interval="15" timeout="15" start-delay="15" \ > meta target-role="started" > group asterisk res_Filesystem_rep res_IPaddr2_northIP res_IPaddr2_sipIP > res_mysqld_res_mysql res_httpd_res_httpd res_asterisk_res_asterisk > ms ms_drbd_1 res_drbd_1 \ > meta clone-max="2" notify="true" interleave="true" > resource-stickiness="100" > location loc_res_httpd_res_httpd_gat-voip-01.gdft.org asterisk inf: > gat-voip-01.gdft.org > location loc_res_mysqld_res_mysql_gat-voip-01.gdft.org asterisk inf: > gat-voip-01.gdft.org > colocation col_res_Filesystem_rep_ms_drbd_1 inf: asterisk ms_drbd_1:Master > order ord_ms_drbd_1_res_Filesystem_rep inf: ms_drbd_1:promote asterisk:start > property $id="cib-bootstrap-options" \ > stonith-enabled="false" \ > dc-version="1.0.12-unknown" \ > no-quorum-policy="ignore" \ > cluster-infrastructure="Heartbeat" \ > last-lrm-refresh="1345727614" > > > > Gerry Kernan > > > Infinity IT | 17 The Mall | Beacon Court | Sandyford | Dublin > D18 E3C8 | Ireland > Tel: +353 - (0)1 - 293 0090 | E-Mail: gerry.ker...@infinityit.ie > > Managed IT Services Infinity IT - www.infinityit.ie > IP TelephonyAsterisk Consulting - > www.asteriskconsulting.com > Contact CentreTotal Interact - www.totalinteract.com > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org