Yep, we've definitely got /dev/shm (this was done to fix an earlier problem). ---------------- John White HPC Systems Engineer (510) 486-7307 One Cyclotron Rd, MS: 50C-3209C Lawrence Berkeley National Lab Berkeley, CA 94720
On Mar 27, 2013, at 4:46 PM, Andrew Beekhof <and...@beekhof.net> wrote: > What about /dev/shm ? > Libqb tries to create some shared memory in that location by default. > > On Thu, Mar 28, 2013 at 8:50 AM, John White <jwh...@lbl.gov> wrote: >> Yup: >> -bash-4.1$ cd /var/run/crm/ >> -bash-4.1$ ls >> lost+found pcmk pengine st_callback st_command >> -bash-4.1$ touch blah >> -bash-4.1$ ls -l >> total 16 >> -rw-r--r-- 1 hacluster haclient 0 Mar 27 14:50 blah >> drwx------ 2 root root 16384 Mar 14 15:00 lost+found >> srwxrwxrwx 1 root root 0 Mar 22 11:25 pcmk >> srwxrwxrwx 1 hacluster root 0 Mar 22 11:25 pengine >> srwxrwxrwx 1 root root 0 Mar 22 11:25 st_callback >> srwxrwxrwx 1 root root 0 Mar 22 11:25 st_command >> -bash-4.1$ ls -l /var/run/| grep crm >> drwxr-xr-x 3 hacluster haclient 4096 Mar 27 14:50 crm >> -bash-4.1$ whoami >> hacluster >> -bash-4.1$ >> ---------------- >> John White >> HPC Systems Engineer >> (510) 486-7307 >> One Cyclotron Rd, MS: 50C-3209C >> Lawrence Berkeley National Lab >> Berkeley, CA 94720 >> >> On Mar 25, 2013, at 4:21 PM, Andreas Kurz <andr...@hastexo.com> wrote: >> >>> On 2013-03-22 19:31, John White wrote: >>>> Hello Folks, >>>> We're trying to get a corosync/pacemaker instance going on a 4 node >>>> cluster that boots via pxe. There have been a number of state/file system >>>> issues, but those appear to be *mostly* taken care of thus far. We're >>>> running into an issue now where cib just isn't staying up with errors akin >>>> to the following (sorry for the lengthy dump, note the attrd and cib >>>> connection errors). Any ideas would be greatly appreciated: >>>> >>>> Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating >>>> RNG parser context >>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: >>>> /usr/lib64/heartbeat/attrd >>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed >>>> active directory to /var/lib/heartbeat/cores/hacluster >>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up >>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type >>>> is: 'corosync' >>>> Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: >>>> Connecting to cluster infrastructure: corosync >>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could >>>> not connect to the Cluster Process Group API: 2 >>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed >>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active >>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute >>>> updates >>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup >>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: >>>> /usr/lib64/heartbeat/pengine >>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed >>>> active directory to /var/lib/heartbeat/cores/hacluster >>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old >>>> instances of pengine >>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: >>>> init_client_ipc_comms_nodispatch: Attempting to talk on: >>>> /var/run/crm/pengine >>> >>> That "/var/run/crm" directory is available and owned by >>> hacluster.haclient ... and writable by at least the hacluster user? >>> >>> Regards, >>> Andreas >>> >>> -- >>> Need help with Pacemaker? >>> http://www.hastexo.com/now >>> >>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child >>>> process attrd exited (pid=25841, rc=100) >>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child >>>> process attrd no longer wishes to be respawned >>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: >>>> Node n0014.lustre now has process list: 00000000000000000000000000110312 >>>> (was 00000000000000000000000000111312) >>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: >>>> init_client_ipc_comms_nodispatch: Could not init comms on: >>>> /var/run/crm/pengine >>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms >>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine >>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: >>>> Adding fd=4 to mainloop >>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: init_ais_connection_once: >>>> Connection to 'corosync': established >>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating >>>> entry for node n0014.lustre/247988234 >>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node >>>> n0014.lustre now has id: 247988234 >>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node >>>> 247988234 is now known as n0014.lustre >>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: >>>> init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk >>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked: >>>> /usr/lib64/heartbeat/crmd >>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect: >>>> Channel 0x995530 connected: 1 children >>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: main: Starting stonith-ng >>>> mainloop >>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: crm_log_init_worker: Changed >>>> active directory to /var/lib/heartbeat/cores/hacluster >>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: main: CRM Hg Version: >>>> a02c0f19a00c1eb2527ad38f146ebc0834814558 >>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: crmd_init: Starting crmd >>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: s_crmd_fsa: Processing >>>> I_STARTUP: [ state=S_STARTING cause=C_STARTUP origin=crmd_init ] >>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: >>>> #011// A_LOG >>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: >>>> #011// A_STARTUP >>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Registering Signal >>>> Handlers >>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Creating CIB and >>>> LRM objects >>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_update_peer: Node >>>> n0014.lustre: id=247988234 state=unknown addr=(null) votes=0 born=0 seen=0 >>>> proc=00000000000000000000000000110312 (new) >>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: G_main_add_SignalHandler: Added >>>> signal handler for signal 17 >>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: >>>> #011// A_CIB_START >>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: >>>> init_client_ipc_comms_nodispatch: Attempting to talk on: >>>> /var/run/crm/cib_rw >>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: >>>> init_client_ipc_comms_nodispatch: Could not init comms on: >>>> /var/run/crm/cib_rw >>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: >>>> Connection to command channel failed >>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: >>>> init_client_ipc_comms_nodispatch: Attempting to talk on: >>>> /var/run/crm/cib_callback >>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: >>>> init_client_ipc_comms_nodispatch: Could not init comms on: >>>> /var/run/crm/cib_callback >>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: >>>> Connection to callback channel failed >>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: >>>> Connection to CIB failed: connection failed >>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signoff: Signing >>>> out of the CIB Service >>>> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: Element cib failed to validate >>>> content >>>> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: readCibXmlFile: CIB does not >>>> validate with <null> >>>> Mar 22 11:25:18 n0014 cib: [25839]: info: startCib: CIB Initialization >>>> completed successfully >>>> Mar 22 11:25:18 n0014 cib: [25839]: info: get_cluster_type: Cluster type >>>> is: 'corosync' >>>> Mar 22 11:25:18 n0014 cib: [25839]: notice: crm_cluster_connect: >>>> Connecting to cluster infrastructure: corosync >>>> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: init_cpg_connection: Could not >>>> connect to the Cluster Process Group API: 2 >>>> Mar 22 11:25:18 n0014 cib: [25839]: CRIT: cib_init: Cannot sign in to the >>>> cluster... terminating >>>> >>>> >>>> ---------------- >>>> John White >>>> HPC Systems Engineer >>>> (510) 486-7307 >>>> One Cyclotron Rd, MS: 50C-3209C >>>> Lawrence Berkeley National Lab >>>> Berkeley, CA 94720 >>>> >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>>> >>> >>> >>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org