On 2013-03-22 19:31, John White wrote: > Hello Folks, > We're trying to get a corosync/pacemaker instance going on a 4 node > cluster that boots via pxe. There have been a number of state/file system > issues, but those appear to be *mostly* taken care of thus far. We're > running into an issue now where cib just isn't staying up with errors akin to > the following (sorry for the lengthy dump, note the attrd and cib connection > errors). Any ideas would be greatly appreciated: > > Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating RNG > parser context > Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: > /usr/lib64/heartbeat/attrd > Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed > active directory to /var/lib/heartbeat/cores/hacluster > Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up > Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type > is: 'corosync' > Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: Connecting > to cluster infrastructure: corosync > Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could not > connect to the Cluster Process Group API: 2 > Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed > Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active > Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute updates > Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup > Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: > /usr/lib64/heartbeat/pengine > Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed > active directory to /var/lib/heartbeat/cores/hacluster > Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old > instances of pengine > Mar 22 11:25:18 n0014 pengine: [25842]: debug: > init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pengine
That "/var/run/crm" directory is available and owned by hacluster.haclient ... and writable by at least the hacluster user? Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now > Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child > process attrd exited (pid=25841, rc=100) > Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child > process attrd no longer wishes to be respawned > Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: Node > n0014.lustre now has process list: 00000000000000000000000000110312 (was > 00000000000000000000000000111312) > Mar 22 11:25:18 n0014 pengine: [25842]: debug: > init_client_ipc_comms_nodispatch: Could not init comms on: > /var/run/crm/pengine > Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms > Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine > Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: Adding > fd=4 to mainloop > Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: init_ais_connection_once: > Connection to 'corosync': established > Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating > entry for node n0014.lustre/247988234 > Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node > n0014.lustre now has id: 247988234 > Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 247988234 > is now known as n0014.lustre > Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: > init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk > Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked: /usr/lib64/heartbeat/crmd > Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect: > Channel 0x995530 connected: 1 children > Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: main: Starting stonith-ng > mainloop > Mar 22 11:25:18 n0014 crmd: [25843]: info: crm_log_init_worker: Changed > active directory to /var/lib/heartbeat/cores/hacluster > Mar 22 11:25:18 n0014 crmd: [25843]: info: main: CRM Hg Version: > a02c0f19a00c1eb2527ad38f146ebc0834814558 > Mar 22 11:25:18 n0014 crmd: [25843]: info: crmd_init: Starting crmd > Mar 22 11:25:18 n0014 crmd: [25843]: debug: s_crmd_fsa: Processing I_STARTUP: > [ state=S_STARTING cause=C_STARTUP origin=crmd_init ] > Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: > #011// A_LOG > Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: > #011// A_STARTUP > Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Registering Signal > Handlers > Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Creating CIB and LRM > objects > Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_update_peer: Node > n0014.lustre: id=247988234 state=unknown addr=(null) votes=0 born=0 seen=0 > proc=00000000000000000000000000110312 (new) > Mar 22 11:25:18 n0014 crmd: [25843]: info: G_main_add_SignalHandler: Added > signal handler for signal 17 > Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: > #011// A_CIB_START > Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: > Attempting to talk on: /var/run/crm/cib_rw > Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: > Could not init comms on: /var/run/crm/cib_rw > Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection > to command channel failed > Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: > Attempting to talk on: /var/run/crm/cib_callback > Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: > Could not init comms on: /var/run/crm/cib_callback > Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection > to callback channel failed > Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection > to CIB failed: connection failed > Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signoff: Signing out > of the CIB Service > Mar 22 11:25:18 n0014 cib: [25839]: ERROR: Element cib failed to validate > content > Mar 22 11:25:18 n0014 cib: [25839]: ERROR: readCibXmlFile: CIB does not > validate with <null> > Mar 22 11:25:18 n0014 cib: [25839]: info: startCib: CIB Initialization > completed successfully > Mar 22 11:25:18 n0014 cib: [25839]: info: get_cluster_type: Cluster type is: > 'corosync' > Mar 22 11:25:18 n0014 cib: [25839]: notice: crm_cluster_connect: Connecting > to cluster infrastructure: corosync > Mar 22 11:25:18 n0014 cib: [25839]: ERROR: init_cpg_connection: Could not > connect to the Cluster Process Group API: 2 > Mar 22 11:25:18 n0014 cib: [25839]: CRIT: cib_init: Cannot sign in to the > cluster... terminating > > > ---------------- > John White > HPC Systems Engineer > (510) 486-7307 > One Cyclotron Rd, MS: 50C-3209C > Lawrence Berkeley National Lab > Berkeley, CA 94720 > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org