Re: [ClusterLabs] Stack: unknown and all nodes offline

2015-12-10 Thread Ken Gaillot
On 12/10/2015 01:14 PM, Louis Munro wrote:
> I can now answer parts of my own question.
> 
> 
> My config was missing the quorum configuration:
> 
> quorum {
> # Enable and configure quorum subsystem (default: off)
> # see also corosync.conf.5 and votequorum.5
> provider: corosync_votequorum
> two_node: 1
> expected_votes: 2
> }
> 
> 
> I read the manpage as saying that was optional, but it looks like I may be 
> misreading here.
> corosync.conf(5) says the following: 
> 
> Within the quorum directive it is possible to specify the quorum algorithm to 
> use with the
> provider directive. At the time of writing only corosync_votequorum is 
> supported.  
> See votequorum(5) for configuration options.
> 
> 
> 
> I still have messages in the logs saying 
> crmd:   notice: get_node_name:   Defaulting to uname -n for the local 
> corosync node name
> 
> I am not sure which part of the configuration I should be setting for that.
> 
> Any pointers regarding that would be nice.

Hi,

As long as the unames are what you want the nodes to be called, that
message is fine. You can explicitly set the node names by using a
nodelist {} section in corosync.conf, with each node {} having a
ring0_addr specifying the name.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Stack: unknown and all nodes offline

2015-12-10 Thread Ken Gaillot
On 12/10/2015 12:45 PM, Louis Munro wrote:
> Hello all,
> 
> I am trying to get a Corosync 2 cluster going on CentOS 6.7 but I am running 
> in a bit of a problem with either Corosync or Pacemaker.
> crm reports that all my nodes are offline and the stack is unknown (I am not 
> sure if that is relevant).
> 
> I believe both nodes are actually present and seen in corosync, but they may 
> not be considered as such by pacemaker.
> I have messages in the logs saying that the processes cannot get the node 
> name and default to uname -n: 
> 
> Dec 10 13:38:53 [2236] hack1.example.com   crmd: info: 
> corosync_node_name:Unable to get node name for nodeid 739513528
> Dec 10 13:38:53 [2236] hack1.example.com   crmd:   notice: get_node_name: 
> Defaulting to uname -n for the local corosync node name
> Dec 10 13:38:53 [2236] hack1.example.com   crmd: info: crm_get_peer:  
> Node 739513528 is now known as hack1.example.com
> 
> The uname -n is correct as far that is concerned.
> 
> 
> Does this mean anything to anyone here? 
> 
> 
> [Lots of details to follow]...
> 
> I compiled my own versions of Corosync, Pacemaker, crm and the 
> resource-agents seemingly without problems.
> 
> Here is what I currently have installed:
> 
> # corosync -v
> Corosync Cluster Engine, version '2.3.5'
> Copyright (c) 2006-2009 Red Hat, Inc.
> 
> # pacemakerd -F
> Pacemaker 1.1.13 (Build: 5b41ae1)
>  Supporting v3.0.10:  generated-manpages agent-manpages ascii-docs ncurses 
> libqb-logging libqb-ipc lha-fencing upstart nagios  corosync-native 
> atomic-attrd libesmtp acls
> 
> # crm --version
> crm 2.2.0-rc3
> 
> 
> 
> Here is the output of crm status:
> 
> # crm status
> Last updated: Thu Dec 10 12:47:50 2015Last change: Thu Dec 10 
> 12:02:33 2015 by root via cibadmin on hack1.example.com
> Stack: unknown
> Current DC: NONE
> 2 nodes and 0 resources configured
> 
> OFFLINE: [ hack1.example.com hack2.example.com ]
> 
> Full list of resources:
> 
> {nothing to see here}
> 
> 
> 
> # corosync-cmapctl | grep members
> runtime.totem.pg.mrp.srp.members.739513528.config_version (u64) = 0
> runtime.totem.pg.mrp.srp.members.739513528.ip (str) = r(0) ip(172.20.20.184)
> runtime.totem.pg.mrp.srp.members.739513528.join_count (u32) = 1
> runtime.totem.pg.mrp.srp.members.739513528.status (str) = joined
> runtime.totem.pg.mrp.srp.members.739513590.config_version (u64) = 0
> runtime.totem.pg.mrp.srp.members.739513590.ip (str) = r(0) ip(172.20.20.246)
> runtime.totem.pg.mrp.srp.members.739513590.join_count (u32) = 1
> runtime.totem.pg.mrp.srp.members.739513590.status (str) = joined
> 
> 
> # uname -n
> hack1.example.com
> 
> # corosync-cfgtool -s
> Printing ring status.
> Local node ID 739513528
> RING ID 0
>   id  = 172.20.20.184
>   status  = ring 0 active with no faults
> 
> 
> # uname -n
> hack2.example.com
> 
> 
> # corosync-cfgtool -s
> Printing ring status.
> Local node ID 739513590
> RING ID 0
>   id  = 172.20.20.246
>   status  = ring 0 active with no faults
> 
> 
> 
> 
> Shouldn’t I see both nodes in the same ring?

They are in the same ring, but the cfgtool will only print the local id.

> My corosync config is currently defined as:
> 
> # egrep -v '#' /etc/corosync/corosync.conf
> totem {
>   version: 2
> 
>   crypto_cipher: none
>   crypto_hash: none
>   clear_node_high_bit: yes
>   cluster_name: hack_cluster
>   interface {
>   ringnumber: 0
>   bindnetaddr: 172.20.0.0
>   mcastaddr: 239.255.1.1
>   mcastport: 5405
>   ttl: 1
>   }
> 
> }
> 
> logging {
>   fileline: on
>   to_stderr: no
>   to_logfile: yes
>   logfile: /var/log/cluster/corosync.log
>   to_syslog: yes
>   debug: off
>   timestamp: on
>   logger_subsys {
>   subsys: QUORUM
>   debug: off
>   }
> }
> 
> # cat /etc/corosync/service.d/pacemaker
> service {
> name: pacemaker
> ver: 1
> }

You don't want this section if you're using corosync 2. That's the old
"plugin" used with corosync 1.

> 
> 
> And here is my pacemaker configuration:
> 
> # crm config show xml
> 
>  crm_feature_set="3.0.10" validate-with="pacemaker-2.4" 
> update-client="cibadmin" epoch="13" admin_epoch="0" update-user="root" 
> cib-last-written="Thu Dec 10 13:35:06 2015">
>   
> 
>   
>  id="cib-bootstrap-options-stonith-enabled"/>
>  id="cib-bootstrap-options-no-quorum-policy"/>
>   
> 
> 
>   
> 
>id="hack1.example.com-instance_attributes-standby"/>
> 
>   
>   
> 
>id="hack2.example.com-instance_attributes-standby"/>
> 
>   
> 
> 
> 
>   
> 
> 
> 
> 
> 
> 
> 
> And finally some logs that might be relevant: 
> 
> Dec 10 13:38:50 [2227] hack1.example.com corosync notice  [MAIN  ] 
> main.c:1227 Corosync Cluster Engine ('2.3.5'): started and ready to provide 
> service.