Re: [Pacemaker] issues when installing on pxe booted environment

2013-04-11 Thread John White
Ah, /dev/shm had root:root user writable only.  Opening it up seems to have 
kicked something the right way.  Thanks folks.

John White
HPC Systems Engineer
(510) 486-7307
One Cyclotron Rd, MS: 50C-3209C
Lawrence Berkeley National Lab
Berkeley, CA 94720

On Apr 11, 2013, at 1:37 PM, John White  wrote:

> Yep, we've definitely got /dev/shm (this was done to fix an earlier problem).
> 
> John White
> HPC Systems Engineer
> (510) 486-7307
> One Cyclotron Rd, MS: 50C-3209C
> Lawrence Berkeley National Lab
> Berkeley, CA 94720
> 
> On Mar 27, 2013, at 4:46 PM, Andrew Beekhof  wrote:
> 
>> What about /dev/shm ?
>> Libqb tries to create some shared memory in that location by default.
>> 
>> On Thu, Mar 28, 2013 at 8:50 AM, John White  wrote:
>>> Yup:
>>> -bash-4.1$ cd /var/run/crm/
>>> -bash-4.1$ ls
>>> lost+found  pcmk  pengine  st_callback  st_command
>>> -bash-4.1$ touch blah
>>> -bash-4.1$ ls -l
>>> total 16
>>> -rw-r--r-- 1 hacluster haclient 0 Mar 27 14:50 blah
>>> drwx-- 2 root  root 16384 Mar 14 15:00 lost+found
>>> srwxrwxrwx 1 root  root 0 Mar 22 11:25 pcmk
>>> srwxrwxrwx 1 hacluster root 0 Mar 22 11:25 pengine
>>> srwxrwxrwx 1 root  root 0 Mar 22 11:25 st_callback
>>> srwxrwxrwx 1 root  root 0 Mar 22 11:25 st_command
>>> -bash-4.1$ ls -l /var/run/| grep crm
>>> drwxr-xr-x 3 hacluster haclient 4096 Mar 27 14:50 crm
>>> -bash-4.1$ whoami
>>> hacluster
>>> -bash-4.1$
>>> 
>>> John White
>>> HPC Systems Engineer
>>> (510) 486-7307
>>> One Cyclotron Rd, MS: 50C-3209C
>>> Lawrence Berkeley National Lab
>>> Berkeley, CA 94720
>>> 
>>> On Mar 25, 2013, at 4:21 PM, Andreas Kurz  wrote:
>>> 
 On 2013-03-22 19:31, John White wrote:
> Hello Folks,
>We're trying to get a corosync/pacemaker instance going on a 4 node 
> cluster that boots via pxe.  There have been a number of state/file 
> system issues, but those appear to be *mostly* taken care of thus far.  
> We're running into an issue now where cib just isn't staying up with 
> errors akin to the following (sorry for the lengthy dump, note the attrd 
> and cib connection errors).  Any ideas would be greatly appreciated:
> 
> Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating 
> RNG parser context
> Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: 
> /usr/lib64/heartbeat/attrd
> Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed 
> active directory to /var/lib/heartbeat/cores/hacluster
> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up
> Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster 
> type is: 'corosync'
> Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: 
> Connecting to cluster infrastructure: corosync
> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could 
> not connect to the Cluster Process Group API: 2
> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed
> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection 
> active
> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute 
> updates
> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup
> Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: 
> /usr/lib64/heartbeat/pengine
> Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: 
> Changed active directory to /var/lib/heartbeat/cores/hacluster
> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old 
> instances of pengine
> Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
> init_client_ipc_comms_nodispatch: Attempting to talk on: 
> /var/run/crm/pengine
 
 That "/var/run/crm" directory is available and owned by
 hacluster.haclient ... and writable by at least the hacluster user?
 
 Regards,
 Andreas
 
 --
 Need help with Pacemaker?
 http://www.hastexo.com/now
 
> Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child 
> process attrd exited (pid=25841, rc=100)
> Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child 
> process attrd no longer wishes to be respawned
> Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: 
> Node n0014.lustre now has process list: 00110312 
> (was 00111312)
> Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
> init_client_ipc_comms_nodispatch: Could not init comms on: 
> /var/run/crm/pengine
> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms
> Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: 
> Adding fd=4 to mainloop
> Mar 22 11:25:18 n0014 s

Re: [Pacemaker] issues when installing on pxe booted environment

2013-04-11 Thread John White
Yep, we've definitely got /dev/shm (this was done to fix an earlier problem).

John White
HPC Systems Engineer
(510) 486-7307
One Cyclotron Rd, MS: 50C-3209C
Lawrence Berkeley National Lab
Berkeley, CA 94720

On Mar 27, 2013, at 4:46 PM, Andrew Beekhof  wrote:

> What about /dev/shm ?
> Libqb tries to create some shared memory in that location by default.
> 
> On Thu, Mar 28, 2013 at 8:50 AM, John White  wrote:
>> Yup:
>> -bash-4.1$ cd /var/run/crm/
>> -bash-4.1$ ls
>> lost+found  pcmk  pengine  st_callback  st_command
>> -bash-4.1$ touch blah
>> -bash-4.1$ ls -l
>> total 16
>> -rw-r--r-- 1 hacluster haclient 0 Mar 27 14:50 blah
>> drwx-- 2 root  root 16384 Mar 14 15:00 lost+found
>> srwxrwxrwx 1 root  root 0 Mar 22 11:25 pcmk
>> srwxrwxrwx 1 hacluster root 0 Mar 22 11:25 pengine
>> srwxrwxrwx 1 root  root 0 Mar 22 11:25 st_callback
>> srwxrwxrwx 1 root  root 0 Mar 22 11:25 st_command
>> -bash-4.1$ ls -l /var/run/| grep crm
>> drwxr-xr-x 3 hacluster haclient 4096 Mar 27 14:50 crm
>> -bash-4.1$ whoami
>> hacluster
>> -bash-4.1$
>> 
>> John White
>> HPC Systems Engineer
>> (510) 486-7307
>> One Cyclotron Rd, MS: 50C-3209C
>> Lawrence Berkeley National Lab
>> Berkeley, CA 94720
>> 
>> On Mar 25, 2013, at 4:21 PM, Andreas Kurz  wrote:
>> 
>>> On 2013-03-22 19:31, John White wrote:
 Hello Folks,
 We're trying to get a corosync/pacemaker instance going on a 4 node 
 cluster that boots via pxe.  There have been a number of state/file system 
 issues, but those appear to be *mostly* taken care of thus far.  We're 
 running into an issue now where cib just isn't staying up with errors akin 
 to the following (sorry for the lengthy dump, note the attrd and cib 
 connection errors).  Any ideas would be greatly appreciated:
 
 Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating 
 RNG parser context
 Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: 
 /usr/lib64/heartbeat/attrd
 Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed 
 active directory to /var/lib/heartbeat/cores/hacluster
 Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up
 Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type 
 is: 'corosync'
 Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: 
 Connecting to cluster infrastructure: corosync
 Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could 
 not connect to the Cluster Process Group API: 2
 Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed
 Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active
 Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute 
 updates
 Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup
 Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: 
 /usr/lib64/heartbeat/pengine
 Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed 
 active directory to /var/lib/heartbeat/cores/hacluster
 Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old 
 instances of pengine
 Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
 init_client_ipc_comms_nodispatch: Attempting to talk on: 
 /var/run/crm/pengine
>>> 
>>> That "/var/run/crm" directory is available and owned by
>>> hacluster.haclient ... and writable by at least the hacluster user?
>>> 
>>> Regards,
>>> Andreas
>>> 
>>> --
>>> Need help with Pacemaker?
>>> http://www.hastexo.com/now
>>> 
 Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child 
 process attrd exited (pid=25841, rc=100)
 Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child 
 process attrd no longer wishes to be respawned
 Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: 
 Node n0014.lustre now has process list: 00110312 
 (was 00111312)
 Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
 init_client_ipc_comms_nodispatch: Could not init comms on: 
 /var/run/crm/pengine
 Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms
 Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine
 Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: 
 Adding fd=4 to mainloop
 Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: init_ais_connection_once: 
 Connection to 'corosync': established
 Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating 
 entry for node n0014.lustre/247988234
 Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 
 n0014.lustre now has id: 247988234
 Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 
 247988234 is now known as n0014

Re: [Pacemaker] issues when installing on pxe booted environment

2013-04-07 Thread Andrew Beekhof

On 29/03/2013, at 11:06 PM, Rainer Brestan  wrote:

> Puh, i haven´t thought the discussion became this direction.
>  
> Corosync is not the only software, which need shared memory, so it is part of 
> the OS startup to provide it, not part of Corosync or Pacemaker.
> And yes, it is too late to mount it in Pacemaker startup.
> Even in most embedded Linux installations share memory is available, just 
> with smaller size.
> So, there is no need to include anything in Corosync or Pacemaker startup 
> directly.

I guess my point is that if there is something the cluster needs in order to 
function, we should either error out in an obvious manner or take steps to make 
it available so that everything JustWork.  I'd prefer the latter personally.

>  
> My answer about shared memory is valid and neccessary only for anaconda based 
> installations.
> When you install a RedHat system, the installation process is done by 
> anaconda with a running miniroot (install.img) as ram disk.
> Within this miniroot shared memory is enabled, but they (RedHat) missed the 
> mount.
> If somebody wants to correct it, RedHat (more specific the maintainer of 
> install.img) is the correct address for this gap.

Has anyone filed a bug in their bugzilla for this?

>  
> To have world read access to CRM_CORE_DIR should be enough in case a core 
> pattern set explicit to another directory (not tested yet).
> When calling resource agents, CRM_CORE_DIR is the current PWD (tested with 
> echo all environment variables to a file inside monitor call).
> If this directory is not readable during switch user inside resource agent, 
> stupid things could happen. Anyhow, it is not a good method of writing 
> resource agents, switching user without setting the environment of the 
> switched user. So, this is no fault of Pacemaker, more of loose programmed 
> resource agents.
> When the resource agent or any of its childs produces a core dump, it goes 
> either to the current directory or to the directory specified with kernel 
> core pattern.
> If the core will go into the current directory and it is written by a 
> switched user, who is not member of pcmk_gid, the core get lost.

Good point.  I guess there is no avoiding 777 then.

>  
> As i am not sure every resource agent is written with proper switch user 
> environment and not rewriting my core pattern, enable world write on 
> CRM_CCORE_DIR was the easy work around for it.
>  
> Rainer
> Gesendet: Freitag, 29. März 2013 um 09:51 Uhr
> Von: "Jacek Konieczny" 
> An: pacemaker@oss.clusterlabs.org
> Betreff: Re: [Pacemaker] issues when installing on pxe booted environment
> On Fri, 29 Mar 2013 11:37:37 +1100
> Andrew Beekhof  wrote:
> 
> > On Thu, Mar 28, 2013 at 10:43 PM, Rainer Brestan
> >  wrote:
> > > Hi John,
> > > to get Corosync/Pacemaker running during anaconda installation, i
> > > have created a configuration RPM package which does a few actions
> > > before starting Corosync and Pacemaker.
> > >
> > > An excerpt of the post install of this RPM.
> > > # mount /dev/shm if not already existing, otherwise openais cannot
> > > work if [ ! -d /dev/shm ]; then
> > > mkdir /dev/shm
> > > mount /dev/shm
> > > fi
> >
> > Perhaps mention this to the corosync guys, it should probably go into
> > their init script.
> 
> I don't think so. It is just a part of modern Linux system environment.
> corosync is not supposed to mount the root filesystem or /proc –
> mounting /dev/shm is not its responsibility either.
> 
> BTW The excerpt above assumes there is a /dev/shm entry in /etc/fstab.
> Should this be added there by the corosync init script too?
> 
> Greets,
> Jacek
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] issues when installing on pxe booted environment

2013-03-29 Thread Rainer Brestan
Puh, i haven´t thought the discussion became this direction.

 

Corosync is not the only software, which need shared memory, so it is part of the OS startup to provide it, not part of Corosync or Pacemaker.

And yes, it is too late to mount it in Pacemaker startup.

Even in most embedded Linux installations share memory is available, just with smaller size.

So, there is no need to include anything in Corosync or Pacemaker startup directly.

 

My answer about shared memory is valid and neccessary only for anaconda based installations.

When you install a RedHat system, the installation process is done by anaconda with a running miniroot (install.img) as ram disk.

Within this miniroot shared memory is enabled, but they (RedHat) missed the mount.

If somebody wants to correct it, RedHat (more specific the maintainer of install.img) is the correct address for this gap.

 

To have world read access to CRM_CORE_DIR should be enough in case a core pattern set explicit to another directory (not tested yet).

When calling resource agents, CRM_CORE_DIR is the current PWD (tested with echo all environment variables to a file inside monitor call).

If this directory is not readable during switch user inside resource agent, stupid things could happen. Anyhow, it is not a good method of writing resource agents, switching user without setting the environment of the switched user. So, this is no fault of Pacemaker, more of loose programmed resource agents.

When the resource agent or any of its childs produces a core dump, it goes either to the current directory or to the directory specified with kernel core pattern.

If the core will go into the current directory and it is written by a switched user, who is not member of pcmk_gid, the core get lost.

 

As i am not sure every resource agent is written with proper switch user environment and not rewriting my core pattern, enable world write on CRM_CCORE_DIR was the easy work around for it.

 

Rainer


Gesendet: Freitag, 29. März 2013 um 09:51 Uhr
Von: "Jacek Konieczny" 
An: pacemaker@oss.clusterlabs.org
Betreff: Re: [Pacemaker] issues when installing on pxe booted environment

On Fri, 29 Mar 2013 11:37:37 +1100
Andrew Beekhof  wrote:

> On Thu, Mar 28, 2013 at 10:43 PM, Rainer Brestan
>  wrote:
> > Hi John,
> > to get Corosync/Pacemaker running during anaconda installation, i
> > have created a configuration RPM package which does a few actions
> > before starting Corosync and Pacemaker.
> >
> > An excerpt of the post install of this RPM.
> > # mount /dev/shm if not already existing, otherwise openais cannot
> > work if [ ! -d /dev/shm ]; then
> > mkdir /dev/shm
> > mount /dev/shm
> > fi
>
> Perhaps mention this to the corosync guys, it should probably go into
> their init script.

I don't think so. It is just a part of modern Linux system environment.
corosync is not supposed to mount the root filesystem or /proc –
mounting /dev/shm is not its responsibility either.

BTW The excerpt above assumes there is a /dev/shm entry in /etc/fstab.
Should this be added there by the corosync init script too?

Greets,
Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] issues when installing on pxe booted environment

2013-03-29 Thread Jacek Konieczny
On Fri, 29 Mar 2013 11:37:37 +1100
Andrew Beekhof  wrote:

> On Thu, Mar 28, 2013 at 10:43 PM, Rainer Brestan
>  wrote:
> > Hi John,
> > to get Corosync/Pacemaker running during anaconda installation, i
> > have created a configuration RPM package which does a few actions
> > before starting Corosync and Pacemaker.
> >
> > An excerpt of the post install of this RPM.
> > # mount /dev/shm if not already existing, otherwise openais cannot
> > work if [ ! -d /dev/shm ]; then
> > mkdir /dev/shm
> > mount /dev/shm
> > fi
> 
> Perhaps mention this to the corosync guys, it should probably go into
> their init script.

I don't think so. It is just a part of modern Linux system environment.
corosync is not supposed to mount the root filesystem or /proc –
mounting /dev/shm is not its responsibility either.

BTW  The excerpt above assumes there is a /dev/shm entry in /etc/fstab.
Should this be added there by the corosync init script too?

Greets,
Jacek

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] issues when installing on pxe booted environment

2013-03-28 Thread Andrew Beekhof
On Thu, Mar 28, 2013 at 10:43 PM, Rainer Brestan  wrote:
> Hi John,
> to get Corosync/Pacemaker running during anaconda installation, i have
> created a configuration RPM package which does a few actions before starting
> Corosync and Pacemaker.
>
> An excerpt of the post install of this RPM.
> # mount /dev/shm if not already existing, otherwise openais cannot work
> if [ ! -d /dev/shm ]; then
> mkdir /dev/shm
> mount /dev/shm
> fi

Perhaps mention this to the corosync guys, it should probably go into
their init script.
I'd put it in pacemaker but thats likely too late.

> # resource agents might run as different user
> chmod -R go+rwx /var/lib/heartbeat/cores

I'm about to change the permissions to 775 for this.  Would that be sufficient?

build_path(CRM_CORE_DIR, 0755);
mcp_chown(CRM_CORE_DIR, pcmk_uid, pcmk_gid);


>
> Rainer
>
> Gesendet: Donnerstag, 28. März 2013 um 00:46 Uhr
> Von: "Andrew Beekhof" 
> An: "The Pacemaker cluster resource manager" 
> Betreff: Re: [Pacemaker] issues when installing on pxe booted environment
> What about /dev/shm ?
> Libqb tries to create some shared memory in that location by default.
>
> On Thu, Mar 28, 2013 at 8:50 AM, John White  wrote:
>> Yup:
>> -bash-4.1$ cd /var/run/crm/
>> -bash-4.1$ ls
>> lost+found pcmk pengine st_callback st_command
>> -bash-4.1$ touch blah
>> -bash-4.1$ ls -l
>> total 16
>> -rw-r--r-- 1 hacluster haclient 0 Mar 27 14:50 blah
>> drwx-- 2 root root 16384 Mar 14 15:00 lost+found
>> srwxrwxrwx 1 root root 0 Mar 22 11:25 pcmk
>> srwxrwxrwx 1 hacluster root 0 Mar 22 11:25 pengine
>> srwxrwxrwx 1 root root 0 Mar 22 11:25 st_callback
>> srwxrwxrwx 1 root root 0 Mar 22 11:25 st_command
>> -bash-4.1$ ls -l /var/run/| grep crm
>> drwxr-xr-x 3 hacluster haclient 4096 Mar 27 14:50 crm
>> -bash-4.1$ whoami
>> hacluster
>> -bash-4.1$
>> 
>> John White
>> HPC Systems Engineer
>> (510) 486-7307
>> One Cyclotron Rd, MS: 50C-3209C
>> Lawrence Berkeley National Lab
>> Berkeley, CA 94720
>>
>> On Mar 25, 2013, at 4:21 PM, Andreas Kurz  wrote:
>>
>>> On 2013-03-22 19:31, John White wrote:
>>>> Hello Folks,
>>>> We're trying to get a corosync/pacemaker instance going on a 4 node
>>>> cluster that boots via pxe. There have been a number of state/file system
>>>> issues, but those appear to be *mostly* taken care of thus far. We're
>>>> running into an issue now where cib just isn't staying up with errors akin
>>>> to the following (sorry for the lengthy dump, note the attrd and cib
>>>> connection errors). Any ideas would be greatly appreciated:
>>>>
>>>> Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng:
>>>> Creating RNG parser context
>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked:
>>>> /usr/lib64/heartbeat/attrd
>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed
>>>> active directory to /var/lib/heartbeat/cores/hacluster
>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up
>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster
>>>> type is: 'corosync'
>>>> Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect:
>>>> Connecting to cluster infrastructure: corosync
>>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could
>>>> not connect to the Cluster Process Group API: 2
>>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed
>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection
>>>> active
>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute
>>>> updates
>>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup
>>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked:
>>>> /usr/lib64/heartbeat/pengine
>>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker:
>>>> Changed active directory to /var/lib/heartbeat/cores/hacluster
>>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old
>>>> instances of pengine
>>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug:
>>>> init_client_ipc_comms_nodispatch: Attempting to talk on:
>>>> /var/run/crm/pengine
>>>
>>> That "/var/run/crm" directory is available and owned by
>>> hacluster.haclient ... and wri

Re: [Pacemaker] issues when installing on pxe booted environment

2013-03-28 Thread Rainer Brestan

Hi John,

to get Corosync/Pacemaker running during anaconda installation, i have created a configuration RPM package which does a few actions before starting Corosync and Pacemaker.

 

An excerpt of the post install of this RPM.


# mount /dev/shm if not already existing, otherwise openais cannot work
if [ ! -d /dev/shm ]; then
    mkdir /dev/shm
    mount /dev/shm
fi


# resource agents might run as different user
chmod -R go+rwx /var/lib/heartbeat/cores

 

Rainer

 



Gesendet: Donnerstag, 28. März 2013 um 00:46 Uhr
Von: "Andrew Beekhof" 
An: "The Pacemaker cluster resource manager" 
Betreff: Re: [Pacemaker] issues when installing on pxe booted environment

What about /dev/shm ?
Libqb tries to create some shared memory in that location by default.

On Thu, Mar 28, 2013 at 8:50 AM, John White  wrote:
> Yup:
> -bash-4.1$ cd /var/run/crm/
> -bash-4.1$ ls
> lost+found pcmk pengine st_callback st_command
> -bash-4.1$ touch blah
> -bash-4.1$ ls -l
> total 16
> -rw-r--r-- 1 hacluster haclient 0 Mar 27 14:50 blah
> drwx-- 2 root root 16384 Mar 14 15:00 lost+found
> srwxrwxrwx 1 root root 0 Mar 22 11:25 pcmk
> srwxrwxrwx 1 hacluster root 0 Mar 22 11:25 pengine
> srwxrwxrwx 1 root root 0 Mar 22 11:25 st_callback
> srwxrwxrwx 1 root root 0 Mar 22 11:25 st_command
> -bash-4.1$ ls -l /var/run/| grep crm
> drwxr-xr-x 3 hacluster haclient 4096 Mar 27 14:50 crm
> -bash-4.1$ whoami
> hacluster
> -bash-4.1$
> 
> John White
> HPC Systems Engineer
> (510) 486-7307
> One Cyclotron Rd, MS: 50C-3209C
> Lawrence Berkeley National Lab
> Berkeley, CA 94720
>
> On Mar 25, 2013, at 4:21 PM, Andreas Kurz  wrote:
>
>> On 2013-03-22 19:31, John White wrote:
>>> Hello Folks,
>>> We're trying to get a corosync/pacemaker instance going on a 4 node cluster that boots via pxe. There have been a number of state/file system issues, but those appear to be *mostly* taken care of thus far. We're running into an issue now where cib just isn't staying up with errors akin to the following (sorry for the lengthy dump, note the attrd and cib connection errors). Any ideas would be greatly appreciated:
>>>
>>> Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating RNG parser context
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: /usr/lib64/heartbeat/attrd
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type is: 'corosync'
>>> Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could not connect to the Cluster Process Group API: 2
>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute updates
>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup
>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: /usr/lib64/heartbeat/pengine
>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster
>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old instances of pengine
>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pengine
>>
>> That "/var/run/crm" directory is available and owned by
>> hacluster.haclient ... and writable by at least the hacluster user?
>>
>> Regards,
>> Andreas
>>
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child process attrd exited (pid=25841, rc=100)
>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child process attrd no longer wishes to be respawned
>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: Node n0014.lustre now has process list: 00110312 (was 00111312)
>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/pengine
>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms
>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine
>>> Mar 22 11:25:18 n0014 s

Re: [Pacemaker] issues when installing on pxe booted environment

2013-03-27 Thread Andrew Beekhof
What about /dev/shm ?
Libqb tries to create some shared memory in that location by default.

On Thu, Mar 28, 2013 at 8:50 AM, John White  wrote:
> Yup:
> -bash-4.1$ cd /var/run/crm/
> -bash-4.1$ ls
> lost+found  pcmk  pengine  st_callback  st_command
> -bash-4.1$ touch blah
> -bash-4.1$ ls -l
> total 16
> -rw-r--r-- 1 hacluster haclient 0 Mar 27 14:50 blah
> drwx-- 2 root  root 16384 Mar 14 15:00 lost+found
> srwxrwxrwx 1 root  root 0 Mar 22 11:25 pcmk
> srwxrwxrwx 1 hacluster root 0 Mar 22 11:25 pengine
> srwxrwxrwx 1 root  root 0 Mar 22 11:25 st_callback
> srwxrwxrwx 1 root  root 0 Mar 22 11:25 st_command
> -bash-4.1$ ls -l /var/run/| grep crm
> drwxr-xr-x 3 hacluster haclient 4096 Mar 27 14:50 crm
> -bash-4.1$ whoami
> hacluster
> -bash-4.1$
> 
> John White
> HPC Systems Engineer
> (510) 486-7307
> One Cyclotron Rd, MS: 50C-3209C
> Lawrence Berkeley National Lab
> Berkeley, CA 94720
>
> On Mar 25, 2013, at 4:21 PM, Andreas Kurz  wrote:
>
>> On 2013-03-22 19:31, John White wrote:
>>> Hello Folks,
>>>  We're trying to get a corosync/pacemaker instance going on a 4 node 
>>> cluster that boots via pxe.  There have been a number of state/file system 
>>> issues, but those appear to be *mostly* taken care of thus far.  We're 
>>> running into an issue now where cib just isn't staying up with errors akin 
>>> to the following (sorry for the lengthy dump, note the attrd and cib 
>>> connection errors).  Any ideas would be greatly appreciated:
>>>
>>> Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating 
>>> RNG parser context
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: 
>>> /usr/lib64/heartbeat/attrd
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed 
>>> active directory to /var/lib/heartbeat/cores/hacluster
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type 
>>> is: 'corosync'
>>> Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: 
>>> Connecting to cluster infrastructure: corosync
>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could not 
>>> connect to the Cluster Process Group API: 2
>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute 
>>> updates
>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup
>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: 
>>> /usr/lib64/heartbeat/pengine
>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed 
>>> active directory to /var/lib/heartbeat/cores/hacluster
>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old 
>>> instances of pengine
>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
>>> init_client_ipc_comms_nodispatch: Attempting to talk on: 
>>> /var/run/crm/pengine
>>
>> That "/var/run/crm" directory is available and owned by
>> hacluster.haclient ... and writable by at least the hacluster user?
>>
>> Regards,
>> Andreas
>>
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child 
>>> process attrd exited (pid=25841, rc=100)
>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child 
>>> process attrd no longer wishes to be respawned
>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: 
>>> Node n0014.lustre now has process list: 00110312 
>>> (was 00111312)
>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
>>> init_client_ipc_comms_nodispatch: Could not init comms on: 
>>> /var/run/crm/pengine
>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms
>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine
>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: 
>>> Adding fd=4 to mainloop
>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: init_ais_connection_once: 
>>> Connection to 'corosync': established
>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating 
>>> entry for node n0014.lustre/247988234
>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 
>>> n0014.lustre now has id: 247988234
>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 
>>> 247988234 is now known as n0014.lustre
>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: 
>>> init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk
>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked: 
>>> /usr/lib64/heartbeat/crmd
>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect: 
>>> Channel 0x995530 connected: 1 children
>>> Mar 22 11:25:18 n0014 stonith-ng: [25

Re: [Pacemaker] issues when installing on pxe booted environment

2013-03-27 Thread John White
Yup:
-bash-4.1$ cd /var/run/crm/
-bash-4.1$ ls
lost+found  pcmk  pengine  st_callback  st_command
-bash-4.1$ touch blah
-bash-4.1$ ls -l
total 16
-rw-r--r-- 1 hacluster haclient 0 Mar 27 14:50 blah
drwx-- 2 root  root 16384 Mar 14 15:00 lost+found
srwxrwxrwx 1 root  root 0 Mar 22 11:25 pcmk
srwxrwxrwx 1 hacluster root 0 Mar 22 11:25 pengine
srwxrwxrwx 1 root  root 0 Mar 22 11:25 st_callback
srwxrwxrwx 1 root  root 0 Mar 22 11:25 st_command
-bash-4.1$ ls -l /var/run/| grep crm
drwxr-xr-x 3 hacluster haclient 4096 Mar 27 14:50 crm
-bash-4.1$ whoami
hacluster
-bash-4.1$ 

John White
HPC Systems Engineer
(510) 486-7307
One Cyclotron Rd, MS: 50C-3209C
Lawrence Berkeley National Lab
Berkeley, CA 94720

On Mar 25, 2013, at 4:21 PM, Andreas Kurz  wrote:

> On 2013-03-22 19:31, John White wrote:
>> Hello Folks,
>>  We're trying to get a corosync/pacemaker instance going on a 4 node 
>> cluster that boots via pxe.  There have been a number of state/file system 
>> issues, but those appear to be *mostly* taken care of thus far.  We're 
>> running into an issue now where cib just isn't staying up with errors akin 
>> to the following (sorry for the lengthy dump, note the attrd and cib 
>> connection errors).  Any ideas would be greatly appreciated: 
>> 
>> Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating 
>> RNG parser context
>> Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: 
>> /usr/lib64/heartbeat/attrd 
>> Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed 
>> active directory to /var/lib/heartbeat/cores/hacluster
>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up
>> Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type 
>> is: 'corosync'
>> Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: 
>> Connecting to cluster infrastructure: corosync
>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could not 
>> connect to the Cluster Process Group API: 2
>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed
>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active
>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute updates
>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup
>> Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: 
>> /usr/lib64/heartbeat/pengine 
>> Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed 
>> active directory to /var/lib/heartbeat/cores/hacluster
>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old 
>> instances of pengine
>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
>> init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pengine
> 
> That "/var/run/crm" directory is available and owned by
> hacluster.haclient ... and writable by at least the hacluster user?
> 
> Regards,
> Andreas
> 
> -- 
> Need help with Pacemaker?
> http://www.hastexo.com/now
> 
>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child 
>> process attrd exited (pid=25841, rc=100)
>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child 
>> process attrd no longer wishes to be respawned
>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: Node 
>> n0014.lustre now has process list: 00110312 (was 
>> 00111312)
>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
>> init_client_ipc_comms_nodispatch: Could not init comms on: 
>> /var/run/crm/pengine
>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms
>> Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine
>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: 
>> Adding fd=4 to mainloop
>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: init_ais_connection_once: 
>> Connection to 'corosync': established
>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating 
>> entry for node n0014.lustre/247988234
>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 
>> n0014.lustre now has id: 247988234
>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 
>> 247988234 is now known as n0014.lustre
>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: 
>> init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk
>> Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked: 
>> /usr/lib64/heartbeat/crmd 
>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect: 
>> Channel 0x995530 connected: 1 children
>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: main: Starting stonith-ng 
>> mainloop
>> Mar 22 11:25:18 n0014 crmd: [25843]: info: crm_log_init_worker: Changed 
>> active directory to /var/lib/heartbeat/cores/hacluster
>> Mar 22 11:25:18 n0014 crmd: [25843]: info: main: CRM Hg Version: 
>> a02c0f19a00c1eb

Re: [Pacemaker] issues when installing on pxe booted environment

2013-03-25 Thread Andreas Kurz
On 2013-03-22 19:31, John White wrote:
> Hello Folks,
>   We're trying to get a corosync/pacemaker instance going on a 4 node 
> cluster that boots via pxe.  There have been a number of state/file system 
> issues, but those appear to be *mostly* taken care of thus far.  We're 
> running into an issue now where cib just isn't staying up with errors akin to 
> the following (sorry for the lengthy dump, note the attrd and cib connection 
> errors).  Any ideas would be greatly appreciated: 
> 
> Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating RNG 
> parser context
> Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: 
> /usr/lib64/heartbeat/attrd 
> Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed 
> active directory to /var/lib/heartbeat/cores/hacluster
> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up
> Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type 
> is: 'corosync'
> Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: Connecting 
> to cluster infrastructure: corosync
> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could not 
> connect to the Cluster Process Group API: 2
> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed
> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active
> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute updates
> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup
> Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: 
> /usr/lib64/heartbeat/pengine 
> Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed 
> active directory to /var/lib/heartbeat/cores/hacluster
> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old 
> instances of pengine
> Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
> init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pengine

That "/var/run/crm" directory is available and owned by
hacluster.haclient ... and writable by at least the hacluster user?

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child 
> process attrd exited (pid=25841, rc=100)
> Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child 
> process attrd no longer wishes to be respawned
> Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: Node 
> n0014.lustre now has process list: 00110312 (was 
> 00111312)
> Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
> init_client_ipc_comms_nodispatch: Could not init comms on: 
> /var/run/crm/pengine
> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms
> Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: Adding 
> fd=4 to mainloop
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: init_ais_connection_once: 
> Connection to 'corosync': established
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating 
> entry for node n0014.lustre/247988234
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 
> n0014.lustre now has id: 247988234
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 247988234 
> is now known as n0014.lustre
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: 
> init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk
> Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked: /usr/lib64/heartbeat/crmd 
> Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect: 
> Channel 0x995530 connected: 1 children
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: main: Starting stonith-ng 
> mainloop
> Mar 22 11:25:18 n0014 crmd: [25843]: info: crm_log_init_worker: Changed 
> active directory to /var/lib/heartbeat/cores/hacluster
> Mar 22 11:25:18 n0014 crmd: [25843]: info: main: CRM Hg Version: 
> a02c0f19a00c1eb2527ad38f146ebc0834814558
> Mar 22 11:25:18 n0014 crmd: [25843]: info: crmd_init: Starting crmd
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: s_crmd_fsa: Processing I_STARTUP: 
> [ state=S_STARTING cause=C_STARTUP origin=crmd_init ]
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: 
> #011// A_LOG   
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: 
> #011// A_STARTUP
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Registering Signal 
> Handlers
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Creating CIB and LRM 
> objects
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_update_peer: Node 
> n0014.lustre: id=247988234 state=unknown addr=(null) votes=0 born=0 seen=0 
> proc=00110312 (new)
> Mar 22 11:25:18 n0014 crmd: [25843]: info: G_main_add_SignalHandler: Added 
> signal handler for signal 17
> Mar 22 11:25:1

[Pacemaker] issues when installing on pxe booted environment

2013-03-22 Thread John White
Hello Folks,
We're trying to get a corosync/pacemaker instance going on a 4 node 
cluster that boots via pxe.  There have been a number of state/file system 
issues, but those appear to be *mostly* taken care of thus far.  We're running 
into an issue now where cib just isn't staying up with errors akin to the 
following (sorry for the lengthy dump, note the attrd and cib connection 
errors).  Any ideas would be greatly appreciated: 

Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating RNG 
parser context
Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: /usr/lib64/heartbeat/attrd 
Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed active 
directory to /var/lib/heartbeat/cores/hacluster
Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up
Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type is: 
'corosync'
Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: Connecting 
to cluster infrastructure: corosync
Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could not 
connect to the Cluster Process Group API: 2
Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed
Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active
Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute updates
Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup
Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: 
/usr/lib64/heartbeat/pengine 
Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed 
active directory to /var/lib/heartbeat/cores/hacluster
Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old instances 
of pengine
Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pengine
Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child 
process attrd exited (pid=25841, rc=100)
Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child 
process attrd no longer wishes to be respawned
Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: Node 
n0014.lustre now has process list: 00110312 (was 
00111312)
Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/pengine
Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms
Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine
Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: Adding 
fd=4 to mainloop
Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: init_ais_connection_once: 
Connection to 'corosync': established
Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating entry 
for node n0014.lustre/247988234
Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 
n0014.lustre now has id: 247988234
Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 247988234 
is now known as n0014.lustre
Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: 
init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk
Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked: /usr/lib64/heartbeat/crmd 
Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect: Channel 
0x995530 connected: 1 children
Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: main: Starting stonith-ng 
mainloop
Mar 22 11:25:18 n0014 crmd: [25843]: info: crm_log_init_worker: Changed active 
directory to /var/lib/heartbeat/cores/hacluster
Mar 22 11:25:18 n0014 crmd: [25843]: info: main: CRM Hg Version: 
a02c0f19a00c1eb2527ad38f146ebc0834814558
Mar 22 11:25:18 n0014 crmd: [25843]: info: crmd_init: Starting crmd
Mar 22 11:25:18 n0014 crmd: [25843]: debug: s_crmd_fsa: Processing I_STARTUP: [ 
state=S_STARTING cause=C_STARTUP origin=crmd_init ]
Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: 
#011// A_LOG   
Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: 
#011// A_STARTUP
Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Registering Signal 
Handlers
Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Creating CIB and LRM 
objects
Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_update_peer: Node 
n0014.lustre: id=247988234 state=unknown addr=(null) votes=0 born=0 seen=0 
proc=00110312 (new)
Mar 22 11:25:18 n0014 crmd: [25843]: info: G_main_add_SignalHandler: Added 
signal handler for signal 17
Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: 
#011// A_CIB_START
Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: 
Attempting to talk on: /var/run/crm/cib_rw
Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: 
Could not init comms on: /var/run/crm/cib_rw
Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection 
to