On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote:

Hi,

> I just tried to reproduce the problem that you are experiencing and was 
> unable to.
> 
> SLURM 2.1.15
> Open MPI 1.4.3 configured with: 
> --with-platform=./contrib/platform/lanl/tlcc/debug-nopanasas

I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same platform 
file (the only change was to re-enable btl-tcp).

Unfortunately, the result is the same:
salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi
salloc: Granted job allocation 145

 ========================   JOB MAP   ========================

 Data for node: Name: eng-ipc4.{FQDN}           Num procs: 8
        Process OMPI jobid: [6932,1] Process rank: 0
        Process OMPI jobid: [6932,1] Process rank: 1
        Process OMPI jobid: [6932,1] Process rank: 2
        Process OMPI jobid: [6932,1] Process rank: 3
        Process OMPI jobid: [6932,1] Process rank: 4
        Process OMPI jobid: [6932,1] Process rank: 5
        Process OMPI jobid: [6932,1] Process rank: 6
        Process OMPI jobid: [6932,1] Process rank: 7

 Data for node: Name: ipc3      Num procs: 8
        Process OMPI jobid: [6932,1] Process rank: 8
        Process OMPI jobid: [6932,1] Process rank: 9
        Process OMPI jobid: [6932,1] Process rank: 10
        Process OMPI jobid: [6932,1] Process rank: 11
        Process OMPI jobid: [6932,1] Process rank: 12
        Process OMPI jobid: [6932,1] Process rank: 13
        Process OMPI jobid: [6932,1] Process rank: 14
        Process OMPI jobid: [6932,1] Process rank: 15

 =============================================================
[eng-ipc4:31754] *** Process received signal ***
[eng-ipc4:31754] Signal: Segmentation fault (11)
[eng-ipc4:31754] Signal code: Address not mapped (1)
[eng-ipc4:31754] Failing at address: 0x8012eb748
[eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0]
[eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869) 
[0x7f81cf262869]
[eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338) 
[0x7f81cef93338]
[eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e) 
[0x7f81cef9397e]
[eng-ipc4:31754] [ 4] ~/../openmpi/lib/libopen-pal.so.0(opal_event_loop+0x1f) 
[0x7f81cef9356f]
[eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so.0(opal_progress+0x89) 
[0x7f81cef87916]
[eng-ipc4:31754] [ 6] 
~/../openmpi/lib/libopen-rte.so.0(orte_plm_base_daemon_callback+0x13f) 
[0x7f81cf262e20]
[eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7) 
[0x7f81cf267ed7]
[eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46]
[eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4]
[eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd) [0x7f81ce14bc4d]
[eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9]
[eng-ipc4:31754] *** End of error message ***
salloc: Relinquishing job allocation 145
salloc: Job allocation 145 has been revoked.
zsh: exit 1     salloc -n16 ~/../openmpi/bin/mpirun --display-map 
~/ServerAdmin/mpi

I've anonymised the paths and domain, otherwise pasted verbatim.  The only odd 
thing I notice is that the launching machine uses its full domain name, whereas 
the other machine is referred to by the short name.  Despite the FQDN, the 
domain does not exist in the DNS (for historical reasons), but does exist in 
the /etc/hosts file.  

Any further clues would be appreciated.  In case it may be relevant, core 
system versions are: glibc 2.11, gcc 4.4.3, kernel 2.6.32.  One other point of 
difference may be that our environment is tcp (ethernet) based whereas the LANL 
test environment is not?

Michael


Reply via email to