Re: [Pacemaker] issues when installing on pxe booted environment
Ah, /dev/shm had root:root user writable only. Opening it up seems to have kicked something the right way. Thanks folks. John White HPC Systems Engineer (510) 486-7307 One Cyclotron Rd, MS: 50C-3209C Lawrence Berkeley National Lab Berkeley, CA 94720 On Apr 11, 2013, at 1:37 PM, John White wrote: > Yep, we've definitely got /dev/shm (this was done to fix an earlier problem). > > John White > HPC Systems Engineer > (510) 486-7307 > One Cyclotron Rd, MS: 50C-3209C > Lawrence Berkeley National Lab > Berkeley, CA 94720 > > On Mar 27, 2013, at 4:46 PM, Andrew Beekhof wrote: > >> What about /dev/shm ? >> Libqb tries to create some shared memory in that location by default. >> >> On Thu, Mar 28, 2013 at 8:50 AM, John White wrote: >>> Yup: >>> -bash-4.1$ cd /var/run/crm/ >>> -bash-4.1$ ls >>> lost+found pcmk pengine st_callback st_command >>> -bash-4.1$ touch blah >>> -bash-4.1$ ls -l >>> total 16 >>> -rw-r--r-- 1 hacluster haclient 0 Mar 27 14:50 blah >>> drwx-- 2 root root 16384 Mar 14 15:00 lost+found >>> srwxrwxrwx 1 root root 0 Mar 22 11:25 pcmk >>> srwxrwxrwx 1 hacluster root 0 Mar 22 11:25 pengine >>> srwxrwxrwx 1 root root 0 Mar 22 11:25 st_callback >>> srwxrwxrwx 1 root root 0 Mar 22 11:25 st_command >>> -bash-4.1$ ls -l /var/run/| grep crm >>> drwxr-xr-x 3 hacluster haclient 4096 Mar 27 14:50 crm >>> -bash-4.1$ whoami >>> hacluster >>> -bash-4.1$ >>> >>> John White >>> HPC Systems Engineer >>> (510) 486-7307 >>> One Cyclotron Rd, MS: 50C-3209C >>> Lawrence Berkeley National Lab >>> Berkeley, CA 94720 >>> >>> On Mar 25, 2013, at 4:21 PM, Andreas Kurz wrote: >>> On 2013-03-22 19:31, John White wrote: > Hello Folks, >We're trying to get a corosync/pacemaker instance going on a 4 node > cluster that boots via pxe. There have been a number of state/file > system issues, but those appear to be *mostly* taken care of thus far. > We're running into an issue now where cib just isn't staying up with > errors akin to the following (sorry for the lengthy dump, note the attrd > and cib connection errors). Any ideas would be greatly appreciated: > > Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating > RNG parser context > Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: > /usr/lib64/heartbeat/attrd > Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed > active directory to /var/lib/heartbeat/cores/hacluster > Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up > Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster > type is: 'corosync' > Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: > Connecting to cluster infrastructure: corosync > Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could > not connect to the Cluster Process Group API: 2 > Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed > Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection > active > Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute > updates > Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup > Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: > /usr/lib64/heartbeat/pengine > Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: > Changed active directory to /var/lib/heartbeat/cores/hacluster > Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old > instances of pengine > Mar 22 11:25:18 n0014 pengine: [25842]: debug: > init_client_ipc_comms_nodispatch: Attempting to talk on: > /var/run/crm/pengine That "/var/run/crm" directory is available and owned by hacluster.haclient ... and writable by at least the hacluster user? Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now > Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child > process attrd exited (pid=25841, rc=100) > Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child > process attrd no longer wishes to be respawned > Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: > Node n0014.lustre now has process list: 00110312 > (was 00111312) > Mar 22 11:25:18 n0014 pengine: [25842]: debug: > init_client_ipc_comms_nodispatch: Could not init comms on: > /var/run/crm/pengine > Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms > Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine > Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: > Adding fd=4 to mainloop > Mar 22 11:25:18 n0014 s
Re: [Pacemaker] issues when installing on pxe booted environment
Yep, we've definitely got /dev/shm (this was done to fix an earlier problem). John White HPC Systems Engineer (510) 486-7307 One Cyclotron Rd, MS: 50C-3209C Lawrence Berkeley National Lab Berkeley, CA 94720 On Mar 27, 2013, at 4:46 PM, Andrew Beekhof wrote: > What about /dev/shm ? > Libqb tries to create some shared memory in that location by default. > > On Thu, Mar 28, 2013 at 8:50 AM, John White wrote: >> Yup: >> -bash-4.1$ cd /var/run/crm/ >> -bash-4.1$ ls >> lost+found pcmk pengine st_callback st_command >> -bash-4.1$ touch blah >> -bash-4.1$ ls -l >> total 16 >> -rw-r--r-- 1 hacluster haclient 0 Mar 27 14:50 blah >> drwx-- 2 root root 16384 Mar 14 15:00 lost+found >> srwxrwxrwx 1 root root 0 Mar 22 11:25 pcmk >> srwxrwxrwx 1 hacluster root 0 Mar 22 11:25 pengine >> srwxrwxrwx 1 root root 0 Mar 22 11:25 st_callback >> srwxrwxrwx 1 root root 0 Mar 22 11:25 st_command >> -bash-4.1$ ls -l /var/run/| grep crm >> drwxr-xr-x 3 hacluster haclient 4096 Mar 27 14:50 crm >> -bash-4.1$ whoami >> hacluster >> -bash-4.1$ >> >> John White >> HPC Systems Engineer >> (510) 486-7307 >> One Cyclotron Rd, MS: 50C-3209C >> Lawrence Berkeley National Lab >> Berkeley, CA 94720 >> >> On Mar 25, 2013, at 4:21 PM, Andreas Kurz wrote: >> >>> On 2013-03-22 19:31, John White wrote: Hello Folks, We're trying to get a corosync/pacemaker instance going on a 4 node cluster that boots via pxe. There have been a number of state/file system issues, but those appear to be *mostly* taken care of thus far. We're running into an issue now where cib just isn't staying up with errors akin to the following (sorry for the lengthy dump, note the attrd and cib connection errors). Any ideas would be greatly appreciated: Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating RNG parser context Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: /usr/lib64/heartbeat/attrd Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type is: 'corosync' Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could not connect to the Cluster Process Group API: 2 Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute updates Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: /usr/lib64/heartbeat/pengine Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old instances of pengine Mar 22 11:25:18 n0014 pengine: [25842]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pengine >>> >>> That "/var/run/crm" directory is available and owned by >>> hacluster.haclient ... and writable by at least the hacluster user? >>> >>> Regards, >>> Andreas >>> >>> -- >>> Need help with Pacemaker? >>> http://www.hastexo.com/now >>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child process attrd exited (pid=25841, rc=100) Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child process attrd no longer wishes to be respawned Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: Node n0014.lustre now has process list: 00110312 (was 00111312) Mar 22 11:25:18 n0014 pengine: [25842]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/pengine Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: Adding fd=4 to mainloop Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: init_ais_connection_once: Connection to 'corosync': established Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating entry for node n0014.lustre/247988234 Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node n0014.lustre now has id: 247988234 Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 247988234 is now known as n0014
Re: [Pacemaker] issues when installing on pxe booted environment
On 29/03/2013, at 11:06 PM, Rainer Brestan wrote: > Puh, i haven´t thought the discussion became this direction. > > Corosync is not the only software, which need shared memory, so it is part of > the OS startup to provide it, not part of Corosync or Pacemaker. > And yes, it is too late to mount it in Pacemaker startup. > Even in most embedded Linux installations share memory is available, just > with smaller size. > So, there is no need to include anything in Corosync or Pacemaker startup > directly. I guess my point is that if there is something the cluster needs in order to function, we should either error out in an obvious manner or take steps to make it available so that everything JustWork. I'd prefer the latter personally. > > My answer about shared memory is valid and neccessary only for anaconda based > installations. > When you install a RedHat system, the installation process is done by > anaconda with a running miniroot (install.img) as ram disk. > Within this miniroot shared memory is enabled, but they (RedHat) missed the > mount. > If somebody wants to correct it, RedHat (more specific the maintainer of > install.img) is the correct address for this gap. Has anyone filed a bug in their bugzilla for this? > > To have world read access to CRM_CORE_DIR should be enough in case a core > pattern set explicit to another directory (not tested yet). > When calling resource agents, CRM_CORE_DIR is the current PWD (tested with > echo all environment variables to a file inside monitor call). > If this directory is not readable during switch user inside resource agent, > stupid things could happen. Anyhow, it is not a good method of writing > resource agents, switching user without setting the environment of the > switched user. So, this is no fault of Pacemaker, more of loose programmed > resource agents. > When the resource agent or any of its childs produces a core dump, it goes > either to the current directory or to the directory specified with kernel > core pattern. > If the core will go into the current directory and it is written by a > switched user, who is not member of pcmk_gid, the core get lost. Good point. I guess there is no avoiding 777 then. > > As i am not sure every resource agent is written with proper switch user > environment and not rewriting my core pattern, enable world write on > CRM_CCORE_DIR was the easy work around for it. > > Rainer > Gesendet: Freitag, 29. März 2013 um 09:51 Uhr > Von: "Jacek Konieczny" > An: pacemaker@oss.clusterlabs.org > Betreff: Re: [Pacemaker] issues when installing on pxe booted environment > On Fri, 29 Mar 2013 11:37:37 +1100 > Andrew Beekhof wrote: > > > On Thu, Mar 28, 2013 at 10:43 PM, Rainer Brestan > > wrote: > > > Hi John, > > > to get Corosync/Pacemaker running during anaconda installation, i > > > have created a configuration RPM package which does a few actions > > > before starting Corosync and Pacemaker. > > > > > > An excerpt of the post install of this RPM. > > > # mount /dev/shm if not already existing, otherwise openais cannot > > > work if [ ! -d /dev/shm ]; then > > > mkdir /dev/shm > > > mount /dev/shm > > > fi > > > > Perhaps mention this to the corosync guys, it should probably go into > > their init script. > > I don't think so. It is just a part of modern Linux system environment. > corosync is not supposed to mount the root filesystem or /proc – > mounting /dev/shm is not its responsibility either. > > BTW The excerpt above assumes there is a /dev/shm entry in /etc/fstab. > Should this be added there by the corosync init script too? > > Greets, > Jacek > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] issues when installing on pxe booted environment
Puh, i haven´t thought the discussion became this direction. Corosync is not the only software, which need shared memory, so it is part of the OS startup to provide it, not part of Corosync or Pacemaker. And yes, it is too late to mount it in Pacemaker startup. Even in most embedded Linux installations share memory is available, just with smaller size. So, there is no need to include anything in Corosync or Pacemaker startup directly. My answer about shared memory is valid and neccessary only for anaconda based installations. When you install a RedHat system, the installation process is done by anaconda with a running miniroot (install.img) as ram disk. Within this miniroot shared memory is enabled, but they (RedHat) missed the mount. If somebody wants to correct it, RedHat (more specific the maintainer of install.img) is the correct address for this gap. To have world read access to CRM_CORE_DIR should be enough in case a core pattern set explicit to another directory (not tested yet). When calling resource agents, CRM_CORE_DIR is the current PWD (tested with echo all environment variables to a file inside monitor call). If this directory is not readable during switch user inside resource agent, stupid things could happen. Anyhow, it is not a good method of writing resource agents, switching user without setting the environment of the switched user. So, this is no fault of Pacemaker, more of loose programmed resource agents. When the resource agent or any of its childs produces a core dump, it goes either to the current directory or to the directory specified with kernel core pattern. If the core will go into the current directory and it is written by a switched user, who is not member of pcmk_gid, the core get lost. As i am not sure every resource agent is written with proper switch user environment and not rewriting my core pattern, enable world write on CRM_CCORE_DIR was the easy work around for it. Rainer Gesendet: Freitag, 29. März 2013 um 09:51 Uhr Von: "Jacek Konieczny" An: pacemaker@oss.clusterlabs.org Betreff: Re: [Pacemaker] issues when installing on pxe booted environment On Fri, 29 Mar 2013 11:37:37 +1100 Andrew Beekhof wrote: > On Thu, Mar 28, 2013 at 10:43 PM, Rainer Brestan > wrote: > > Hi John, > > to get Corosync/Pacemaker running during anaconda installation, i > > have created a configuration RPM package which does a few actions > > before starting Corosync and Pacemaker. > > > > An excerpt of the post install of this RPM. > > # mount /dev/shm if not already existing, otherwise openais cannot > > work if [ ! -d /dev/shm ]; then > > mkdir /dev/shm > > mount /dev/shm > > fi > > Perhaps mention this to the corosync guys, it should probably go into > their init script. I don't think so. It is just a part of modern Linux system environment. corosync is not supposed to mount the root filesystem or /proc – mounting /dev/shm is not its responsibility either. BTW The excerpt above assumes there is a /dev/shm entry in /etc/fstab. Should this be added there by the corosync init script too? Greets, Jacek ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] issues when installing on pxe booted environment
On Fri, 29 Mar 2013 11:37:37 +1100 Andrew Beekhof wrote: > On Thu, Mar 28, 2013 at 10:43 PM, Rainer Brestan > wrote: > > Hi John, > > to get Corosync/Pacemaker running during anaconda installation, i > > have created a configuration RPM package which does a few actions > > before starting Corosync and Pacemaker. > > > > An excerpt of the post install of this RPM. > > # mount /dev/shm if not already existing, otherwise openais cannot > > work if [ ! -d /dev/shm ]; then > > mkdir /dev/shm > > mount /dev/shm > > fi > > Perhaps mention this to the corosync guys, it should probably go into > their init script. I don't think so. It is just a part of modern Linux system environment. corosync is not supposed to mount the root filesystem or /proc – mounting /dev/shm is not its responsibility either. BTW The excerpt above assumes there is a /dev/shm entry in /etc/fstab. Should this be added there by the corosync init script too? Greets, Jacek ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] issues when installing on pxe booted environment
On Thu, Mar 28, 2013 at 10:43 PM, Rainer Brestan wrote: > Hi John, > to get Corosync/Pacemaker running during anaconda installation, i have > created a configuration RPM package which does a few actions before starting > Corosync and Pacemaker. > > An excerpt of the post install of this RPM. > # mount /dev/shm if not already existing, otherwise openais cannot work > if [ ! -d /dev/shm ]; then > mkdir /dev/shm > mount /dev/shm > fi Perhaps mention this to the corosync guys, it should probably go into their init script. I'd put it in pacemaker but thats likely too late. > # resource agents might run as different user > chmod -R go+rwx /var/lib/heartbeat/cores I'm about to change the permissions to 775 for this. Would that be sufficient? build_path(CRM_CORE_DIR, 0755); mcp_chown(CRM_CORE_DIR, pcmk_uid, pcmk_gid); > > Rainer > > Gesendet: Donnerstag, 28. März 2013 um 00:46 Uhr > Von: "Andrew Beekhof" > An: "The Pacemaker cluster resource manager" > Betreff: Re: [Pacemaker] issues when installing on pxe booted environment > What about /dev/shm ? > Libqb tries to create some shared memory in that location by default. > > On Thu, Mar 28, 2013 at 8:50 AM, John White wrote: >> Yup: >> -bash-4.1$ cd /var/run/crm/ >> -bash-4.1$ ls >> lost+found pcmk pengine st_callback st_command >> -bash-4.1$ touch blah >> -bash-4.1$ ls -l >> total 16 >> -rw-r--r-- 1 hacluster haclient 0 Mar 27 14:50 blah >> drwx-- 2 root root 16384 Mar 14 15:00 lost+found >> srwxrwxrwx 1 root root 0 Mar 22 11:25 pcmk >> srwxrwxrwx 1 hacluster root 0 Mar 22 11:25 pengine >> srwxrwxrwx 1 root root 0 Mar 22 11:25 st_callback >> srwxrwxrwx 1 root root 0 Mar 22 11:25 st_command >> -bash-4.1$ ls -l /var/run/| grep crm >> drwxr-xr-x 3 hacluster haclient 4096 Mar 27 14:50 crm >> -bash-4.1$ whoami >> hacluster >> -bash-4.1$ >> >> John White >> HPC Systems Engineer >> (510) 486-7307 >> One Cyclotron Rd, MS: 50C-3209C >> Lawrence Berkeley National Lab >> Berkeley, CA 94720 >> >> On Mar 25, 2013, at 4:21 PM, Andreas Kurz wrote: >> >>> On 2013-03-22 19:31, John White wrote: >>>> Hello Folks, >>>> We're trying to get a corosync/pacemaker instance going on a 4 node >>>> cluster that boots via pxe. There have been a number of state/file system >>>> issues, but those appear to be *mostly* taken care of thus far. We're >>>> running into an issue now where cib just isn't staying up with errors akin >>>> to the following (sorry for the lengthy dump, note the attrd and cib >>>> connection errors). Any ideas would be greatly appreciated: >>>> >>>> Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: >>>> Creating RNG parser context >>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: >>>> /usr/lib64/heartbeat/attrd >>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed >>>> active directory to /var/lib/heartbeat/cores/hacluster >>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up >>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster >>>> type is: 'corosync' >>>> Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: >>>> Connecting to cluster infrastructure: corosync >>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could >>>> not connect to the Cluster Process Group API: 2 >>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed >>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection >>>> active >>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute >>>> updates >>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup >>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: >>>> /usr/lib64/heartbeat/pengine >>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: >>>> Changed active directory to /var/lib/heartbeat/cores/hacluster >>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old >>>> instances of pengine >>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: >>>> init_client_ipc_comms_nodispatch: Attempting to talk on: >>>> /var/run/crm/pengine >>> >>> That "/var/run/crm" directory is available and owned by >>> hacluster.haclient ... and wri
Re: [Pacemaker] issues when installing on pxe booted environment
Hi John, to get Corosync/Pacemaker running during anaconda installation, i have created a configuration RPM package which does a few actions before starting Corosync and Pacemaker. An excerpt of the post install of this RPM. # mount /dev/shm if not already existing, otherwise openais cannot work if [ ! -d /dev/shm ]; then mkdir /dev/shm mount /dev/shm fi # resource agents might run as different user chmod -R go+rwx /var/lib/heartbeat/cores Rainer Gesendet: Donnerstag, 28. März 2013 um 00:46 Uhr Von: "Andrew Beekhof" An: "The Pacemaker cluster resource manager" Betreff: Re: [Pacemaker] issues when installing on pxe booted environment What about /dev/shm ? Libqb tries to create some shared memory in that location by default. On Thu, Mar 28, 2013 at 8:50 AM, John White wrote: > Yup: > -bash-4.1$ cd /var/run/crm/ > -bash-4.1$ ls > lost+found pcmk pengine st_callback st_command > -bash-4.1$ touch blah > -bash-4.1$ ls -l > total 16 > -rw-r--r-- 1 hacluster haclient 0 Mar 27 14:50 blah > drwx-- 2 root root 16384 Mar 14 15:00 lost+found > srwxrwxrwx 1 root root 0 Mar 22 11:25 pcmk > srwxrwxrwx 1 hacluster root 0 Mar 22 11:25 pengine > srwxrwxrwx 1 root root 0 Mar 22 11:25 st_callback > srwxrwxrwx 1 root root 0 Mar 22 11:25 st_command > -bash-4.1$ ls -l /var/run/| grep crm > drwxr-xr-x 3 hacluster haclient 4096 Mar 27 14:50 crm > -bash-4.1$ whoami > hacluster > -bash-4.1$ > > John White > HPC Systems Engineer > (510) 486-7307 > One Cyclotron Rd, MS: 50C-3209C > Lawrence Berkeley National Lab > Berkeley, CA 94720 > > On Mar 25, 2013, at 4:21 PM, Andreas Kurz wrote: > >> On 2013-03-22 19:31, John White wrote: >>> Hello Folks, >>> We're trying to get a corosync/pacemaker instance going on a 4 node cluster that boots via pxe. There have been a number of state/file system issues, but those appear to be *mostly* taken care of thus far. We're running into an issue now where cib just isn't staying up with errors akin to the following (sorry for the lengthy dump, note the attrd and cib connection errors). Any ideas would be greatly appreciated: >>> >>> Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating RNG parser context >>> Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: /usr/lib64/heartbeat/attrd >>> Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster >>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up >>> Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type is: 'corosync' >>> Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync >>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could not connect to the Cluster Process Group API: 2 >>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed >>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active >>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute updates >>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup >>> Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: /usr/lib64/heartbeat/pengine >>> Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster >>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old instances of pengine >>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pengine >> >> That "/var/run/crm" directory is available and owned by >> hacluster.haclient ... and writable by at least the hacluster user? >> >> Regards, >> Andreas >> >> -- >> Need help with Pacemaker? >> http://www.hastexo.com/now >> >>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child process attrd exited (pid=25841, rc=100) >>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child process attrd no longer wishes to be respawned >>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: Node n0014.lustre now has process list: 00110312 (was 00111312) >>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/pengine >>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms >>> Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine >>> Mar 22 11:25:18 n0014 s
Re: [Pacemaker] issues when installing on pxe booted environment
What about /dev/shm ? Libqb tries to create some shared memory in that location by default. On Thu, Mar 28, 2013 at 8:50 AM, John White wrote: > Yup: > -bash-4.1$ cd /var/run/crm/ > -bash-4.1$ ls > lost+found pcmk pengine st_callback st_command > -bash-4.1$ touch blah > -bash-4.1$ ls -l > total 16 > -rw-r--r-- 1 hacluster haclient 0 Mar 27 14:50 blah > drwx-- 2 root root 16384 Mar 14 15:00 lost+found > srwxrwxrwx 1 root root 0 Mar 22 11:25 pcmk > srwxrwxrwx 1 hacluster root 0 Mar 22 11:25 pengine > srwxrwxrwx 1 root root 0 Mar 22 11:25 st_callback > srwxrwxrwx 1 root root 0 Mar 22 11:25 st_command > -bash-4.1$ ls -l /var/run/| grep crm > drwxr-xr-x 3 hacluster haclient 4096 Mar 27 14:50 crm > -bash-4.1$ whoami > hacluster > -bash-4.1$ > > John White > HPC Systems Engineer > (510) 486-7307 > One Cyclotron Rd, MS: 50C-3209C > Lawrence Berkeley National Lab > Berkeley, CA 94720 > > On Mar 25, 2013, at 4:21 PM, Andreas Kurz wrote: > >> On 2013-03-22 19:31, John White wrote: >>> Hello Folks, >>> We're trying to get a corosync/pacemaker instance going on a 4 node >>> cluster that boots via pxe. There have been a number of state/file system >>> issues, but those appear to be *mostly* taken care of thus far. We're >>> running into an issue now where cib just isn't staying up with errors akin >>> to the following (sorry for the lengthy dump, note the attrd and cib >>> connection errors). Any ideas would be greatly appreciated: >>> >>> Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating >>> RNG parser context >>> Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: >>> /usr/lib64/heartbeat/attrd >>> Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed >>> active directory to /var/lib/heartbeat/cores/hacluster >>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up >>> Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type >>> is: 'corosync' >>> Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: >>> Connecting to cluster infrastructure: corosync >>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could not >>> connect to the Cluster Process Group API: 2 >>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed >>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active >>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute >>> updates >>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup >>> Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: >>> /usr/lib64/heartbeat/pengine >>> Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed >>> active directory to /var/lib/heartbeat/cores/hacluster >>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old >>> instances of pengine >>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: >>> init_client_ipc_comms_nodispatch: Attempting to talk on: >>> /var/run/crm/pengine >> >> That "/var/run/crm" directory is available and owned by >> hacluster.haclient ... and writable by at least the hacluster user? >> >> Regards, >> Andreas >> >> -- >> Need help with Pacemaker? >> http://www.hastexo.com/now >> >>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child >>> process attrd exited (pid=25841, rc=100) >>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child >>> process attrd no longer wishes to be respawned >>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: >>> Node n0014.lustre now has process list: 00110312 >>> (was 00111312) >>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: >>> init_client_ipc_comms_nodispatch: Could not init comms on: >>> /var/run/crm/pengine >>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms >>> Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine >>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: >>> Adding fd=4 to mainloop >>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: init_ais_connection_once: >>> Connection to 'corosync': established >>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating >>> entry for node n0014.lustre/247988234 >>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node >>> n0014.lustre now has id: 247988234 >>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node >>> 247988234 is now known as n0014.lustre >>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: >>> init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk >>> Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked: >>> /usr/lib64/heartbeat/crmd >>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect: >>> Channel 0x995530 connected: 1 children >>> Mar 22 11:25:18 n0014 stonith-ng: [25
Re: [Pacemaker] issues when installing on pxe booted environment
Yup: -bash-4.1$ cd /var/run/crm/ -bash-4.1$ ls lost+found pcmk pengine st_callback st_command -bash-4.1$ touch blah -bash-4.1$ ls -l total 16 -rw-r--r-- 1 hacluster haclient 0 Mar 27 14:50 blah drwx-- 2 root root 16384 Mar 14 15:00 lost+found srwxrwxrwx 1 root root 0 Mar 22 11:25 pcmk srwxrwxrwx 1 hacluster root 0 Mar 22 11:25 pengine srwxrwxrwx 1 root root 0 Mar 22 11:25 st_callback srwxrwxrwx 1 root root 0 Mar 22 11:25 st_command -bash-4.1$ ls -l /var/run/| grep crm drwxr-xr-x 3 hacluster haclient 4096 Mar 27 14:50 crm -bash-4.1$ whoami hacluster -bash-4.1$ John White HPC Systems Engineer (510) 486-7307 One Cyclotron Rd, MS: 50C-3209C Lawrence Berkeley National Lab Berkeley, CA 94720 On Mar 25, 2013, at 4:21 PM, Andreas Kurz wrote: > On 2013-03-22 19:31, John White wrote: >> Hello Folks, >> We're trying to get a corosync/pacemaker instance going on a 4 node >> cluster that boots via pxe. There have been a number of state/file system >> issues, but those appear to be *mostly* taken care of thus far. We're >> running into an issue now where cib just isn't staying up with errors akin >> to the following (sorry for the lengthy dump, note the attrd and cib >> connection errors). Any ideas would be greatly appreciated: >> >> Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating >> RNG parser context >> Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: >> /usr/lib64/heartbeat/attrd >> Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed >> active directory to /var/lib/heartbeat/cores/hacluster >> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up >> Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type >> is: 'corosync' >> Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: >> Connecting to cluster infrastructure: corosync >> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could not >> connect to the Cluster Process Group API: 2 >> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed >> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active >> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute updates >> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup >> Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: >> /usr/lib64/heartbeat/pengine >> Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed >> active directory to /var/lib/heartbeat/cores/hacluster >> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old >> instances of pengine >> Mar 22 11:25:18 n0014 pengine: [25842]: debug: >> init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pengine > > That "/var/run/crm" directory is available and owned by > hacluster.haclient ... and writable by at least the hacluster user? > > Regards, > Andreas > > -- > Need help with Pacemaker? > http://www.hastexo.com/now > >> Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child >> process attrd exited (pid=25841, rc=100) >> Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child >> process attrd no longer wishes to be respawned >> Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: Node >> n0014.lustre now has process list: 00110312 (was >> 00111312) >> Mar 22 11:25:18 n0014 pengine: [25842]: debug: >> init_client_ipc_comms_nodispatch: Could not init comms on: >> /var/run/crm/pengine >> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms >> Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine >> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: >> Adding fd=4 to mainloop >> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: init_ais_connection_once: >> Connection to 'corosync': established >> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating >> entry for node n0014.lustre/247988234 >> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node >> n0014.lustre now has id: 247988234 >> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node >> 247988234 is now known as n0014.lustre >> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: >> init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk >> Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked: >> /usr/lib64/heartbeat/crmd >> Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect: >> Channel 0x995530 connected: 1 children >> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: main: Starting stonith-ng >> mainloop >> Mar 22 11:25:18 n0014 crmd: [25843]: info: crm_log_init_worker: Changed >> active directory to /var/lib/heartbeat/cores/hacluster >> Mar 22 11:25:18 n0014 crmd: [25843]: info: main: CRM Hg Version: >> a02c0f19a00c1eb
Re: [Pacemaker] issues when installing on pxe booted environment
On 2013-03-22 19:31, John White wrote: > Hello Folks, > We're trying to get a corosync/pacemaker instance going on a 4 node > cluster that boots via pxe. There have been a number of state/file system > issues, but those appear to be *mostly* taken care of thus far. We're > running into an issue now where cib just isn't staying up with errors akin to > the following (sorry for the lengthy dump, note the attrd and cib connection > errors). Any ideas would be greatly appreciated: > > Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating RNG > parser context > Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: > /usr/lib64/heartbeat/attrd > Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed > active directory to /var/lib/heartbeat/cores/hacluster > Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up > Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type > is: 'corosync' > Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: Connecting > to cluster infrastructure: corosync > Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could not > connect to the Cluster Process Group API: 2 > Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed > Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active > Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute updates > Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup > Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: > /usr/lib64/heartbeat/pengine > Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed > active directory to /var/lib/heartbeat/cores/hacluster > Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old > instances of pengine > Mar 22 11:25:18 n0014 pengine: [25842]: debug: > init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pengine That "/var/run/crm" directory is available and owned by hacluster.haclient ... and writable by at least the hacluster user? Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now > Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child > process attrd exited (pid=25841, rc=100) > Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child > process attrd no longer wishes to be respawned > Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: Node > n0014.lustre now has process list: 00110312 (was > 00111312) > Mar 22 11:25:18 n0014 pengine: [25842]: debug: > init_client_ipc_comms_nodispatch: Could not init comms on: > /var/run/crm/pengine > Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms > Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine > Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: Adding > fd=4 to mainloop > Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: init_ais_connection_once: > Connection to 'corosync': established > Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating > entry for node n0014.lustre/247988234 > Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node > n0014.lustre now has id: 247988234 > Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 247988234 > is now known as n0014.lustre > Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: > init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk > Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked: /usr/lib64/heartbeat/crmd > Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect: > Channel 0x995530 connected: 1 children > Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: main: Starting stonith-ng > mainloop > Mar 22 11:25:18 n0014 crmd: [25843]: info: crm_log_init_worker: Changed > active directory to /var/lib/heartbeat/cores/hacluster > Mar 22 11:25:18 n0014 crmd: [25843]: info: main: CRM Hg Version: > a02c0f19a00c1eb2527ad38f146ebc0834814558 > Mar 22 11:25:18 n0014 crmd: [25843]: info: crmd_init: Starting crmd > Mar 22 11:25:18 n0014 crmd: [25843]: debug: s_crmd_fsa: Processing I_STARTUP: > [ state=S_STARTING cause=C_STARTUP origin=crmd_init ] > Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: > #011// A_LOG > Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: > #011// A_STARTUP > Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Registering Signal > Handlers > Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Creating CIB and LRM > objects > Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_update_peer: Node > n0014.lustre: id=247988234 state=unknown addr=(null) votes=0 born=0 seen=0 > proc=00110312 (new) > Mar 22 11:25:18 n0014 crmd: [25843]: info: G_main_add_SignalHandler: Added > signal handler for signal 17 > Mar 22 11:25:1
[Pacemaker] issues when installing on pxe booted environment
Hello Folks, We're trying to get a corosync/pacemaker instance going on a 4 node cluster that boots via pxe. There have been a number of state/file system issues, but those appear to be *mostly* taken care of thus far. We're running into an issue now where cib just isn't staying up with errors akin to the following (sorry for the lengthy dump, note the attrd and cib connection errors). Any ideas would be greatly appreciated: Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating RNG parser context Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: /usr/lib64/heartbeat/attrd Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type is: 'corosync' Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could not connect to the Cluster Process Group API: 2 Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute updates Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: /usr/lib64/heartbeat/pengine Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old instances of pengine Mar 22 11:25:18 n0014 pengine: [25842]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pengine Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child process attrd exited (pid=25841, rc=100) Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child process attrd no longer wishes to be respawned Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: Node n0014.lustre now has process list: 00110312 (was 00111312) Mar 22 11:25:18 n0014 pengine: [25842]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/pengine Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: Adding fd=4 to mainloop Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: init_ais_connection_once: Connection to 'corosync': established Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating entry for node n0014.lustre/247988234 Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node n0014.lustre now has id: 247988234 Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 247988234 is now known as n0014.lustre Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked: /usr/lib64/heartbeat/crmd Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect: Channel 0x995530 connected: 1 children Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: main: Starting stonith-ng mainloop Mar 22 11:25:18 n0014 crmd: [25843]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster Mar 22 11:25:18 n0014 crmd: [25843]: info: main: CRM Hg Version: a02c0f19a00c1eb2527ad38f146ebc0834814558 Mar 22 11:25:18 n0014 crmd: [25843]: info: crmd_init: Starting crmd Mar 22 11:25:18 n0014 crmd: [25843]: debug: s_crmd_fsa: Processing I_STARTUP: [ state=S_STARTING cause=C_STARTUP origin=crmd_init ] Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: #011// A_LOG Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: #011// A_STARTUP Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Registering Signal Handlers Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Creating CIB and LRM objects Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_update_peer: Node n0014.lustre: id=247988234 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00110312 (new) Mar 22 11:25:18 n0014 crmd: [25843]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: #011// A_CIB_START Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/cib_rw Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/cib_rw Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection to