Can anyone guess what the problem is here?  I was under the impression that 
OpenMPI (1.4.4) would look for /tmp and would create its shared-memory backing 
file there, i.e. if you don't set orte_tmpdir_base to anything.

Well, there IS a /tmp and yet it appears that OpenMPI has chosen to use 
/dev/shm.  Why?

And, next question, why doesn't it work?  Here are the oddities of this cluster:

-    the cluster is 'diskless'

-    /tmp is an NFS mount

-    /dev/shm is 12 GB and has 755 permissions

Filesystem            Size  Used Avail Use% Mounted on
tmpfs                  12G  164K   12G   1% /dev/shm

% ls -l output:
drwxr-xr-x  2 root root         40 Oct 28 09:14 shm


The error message below suggests that OpenMPI (1.4.4) has somehow 
auto-magically decided to use /dev/shm and is failing to be able to use it, for 
some reason.

Thanks for whatever help you can offer,

Ed


e8315:02942] opal_os_dirpath_create: Error: Unable to create the sub-directory 
(/dev/shm/openmpi-sessions-estenfte@e8315_0) of 
(/dev/shm/openmpi-sessions-estenfte@e8315_0/8474/0/1), mkdir failed [1]
[e8315:02942] [[8474,0],1] ORTE_ERROR_LOG: Error in file util/session_dir.c at 
line 106
[e8315:02942] [[8474,0],1] ORTE_ERROR_LOG: Error in file util/session_dir.c at 
line 399
[e8315:02942] [[8474,0],1] ORTE_ERROR_LOG: Error in file 
base/ess_base_std_orted.c at line 206
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[e8315:02942] [[8474,0],1] ORTE_ERROR_LOG: Error in file ess_env_module.c at 
line 136
[e8315:02942] [[8474,0],1] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at 
line 132
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_set_name failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[e8315:02942] [[8474,0],1] ORTE_ERROR_LOG: Error in file orted/orted_main.c at 
line 325



Reply via email to