On 2013-03-22 19:31, John White wrote:
> Hello Folks,
>       We're trying to get a corosync/pacemaker instance going on a 4 node 
> cluster that boots via pxe.  There have been a number of state/file system 
> issues, but those appear to be *mostly* taken care of thus far.  We're 
> running into an issue now where cib just isn't staying up with errors akin to 
> the following (sorry for the lengthy dump, note the attrd and cib connection 
> errors).  Any ideas would be greatly appreciated: 
> 
> Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating RNG 
> parser context
> Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: 
> /usr/lib64/heartbeat/attrd 
> Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed 
> active directory to /var/lib/heartbeat/cores/hacluster
> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up
> Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type 
> is: 'corosync'
> Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: Connecting 
> to cluster infrastructure: corosync
> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could not 
> connect to the Cluster Process Group API: 2
> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed
> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active
> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute updates
> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup
> Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: 
> /usr/lib64/heartbeat/pengine 
> Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed 
> active directory to /var/lib/heartbeat/cores/hacluster
> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old 
> instances of pengine
> Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
> init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pengine

That "/var/run/crm" directory is available and owned by
hacluster.haclient ... and writable by at least the hacluster user?

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child 
> process attrd exited (pid=25841, rc=100)
> Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child 
> process attrd no longer wishes to be respawned
> Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: Node 
> n0014.lustre now has process list: 00000000000000000000000000110312 (was 
> 00000000000000000000000000111312)
> Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
> init_client_ipc_comms_nodispatch: Could not init comms on: 
> /var/run/crm/pengine
> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms
> Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: Adding 
> fd=4 to mainloop
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: init_ais_connection_once: 
> Connection to 'corosync': established
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating 
> entry for node n0014.lustre/247988234
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 
> n0014.lustre now has id: 247988234
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 247988234 
> is now known as n0014.lustre
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: 
> init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk
> Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked: /usr/lib64/heartbeat/crmd 
> Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect: 
> Channel 0x995530 connected: 1 children
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: main: Starting stonith-ng 
> mainloop
> Mar 22 11:25:18 n0014 crmd: [25843]: info: crm_log_init_worker: Changed 
> active directory to /var/lib/heartbeat/cores/hacluster
> Mar 22 11:25:18 n0014 crmd: [25843]: info: main: CRM Hg Version: 
> a02c0f19a00c1eb2527ad38f146ebc0834814558
> Mar 22 11:25:18 n0014 crmd: [25843]: info: crmd_init: Starting crmd
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: s_crmd_fsa: Processing I_STARTUP: 
> [ state=S_STARTING cause=C_STARTUP origin=crmd_init ]
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: 
> #011// A_LOG   
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: 
> #011// A_STARTUP
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Registering Signal 
> Handlers
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Creating CIB and LRM 
> objects
> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_update_peer: Node 
> n0014.lustre: id=247988234 state=unknown addr=(null) votes=0 born=0 seen=0 
> proc=00000000000000000000000000110312 (new)
> Mar 22 11:25:18 n0014 crmd: [25843]: info: G_main_add_SignalHandler: Added 
> signal handler for signal 17
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: 
> #011// A_CIB_START
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: 
> Attempting to talk on: /var/run/crm/cib_rw
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: 
> Could not init comms on: /var/run/crm/cib_rw
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection 
> to command channel failed
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: 
> Attempting to talk on: /var/run/crm/cib_callback
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: 
> Could not init comms on: /var/run/crm/cib_callback
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection 
> to callback channel failed
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection 
> to CIB failed: connection failed
> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signoff: Signing out 
> of the CIB Service
> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: Element cib failed to validate 
> content
> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: readCibXmlFile: CIB does not 
> validate with <null>
> Mar 22 11:25:18 n0014 cib: [25839]: info: startCib: CIB Initialization 
> completed successfully
> Mar 22 11:25:18 n0014 cib: [25839]: info: get_cluster_type: Cluster type is: 
> 'corosync'
> Mar 22 11:25:18 n0014 cib: [25839]: notice: crm_cluster_connect: Connecting 
> to cluster infrastructure: corosync
> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: init_cpg_connection: Could not 
> connect to the Cluster Process Group API: 2
> Mar 22 11:25:18 n0014 cib: [25839]: CRIT: cib_init: Cannot sign in to the 
> cluster... terminating
> 
> 
> ----------------
> John White
> HPC Systems Engineer
> (510) 486-7307
> One Cyclotron Rd, MS: 50C-3209C
> Lawrence Berkeley National Lab
> Berkeley, CA 94720
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 




_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to