Another possibility to check - are you sure you are getting the same OMPI version on the backend nodes? When I see it work on local node, but fail multi-node, the most common problem is that you are picking up a different OMPI version due to path differences on the backend nodes.
On Feb 8, 2011, at 8:17 AM, Samuel K. Gutierrez wrote: > Hi Michael, > > You may have tried to send some debug information to the list, but it appears > to have been blocked. Compressed text output of the backtrace text is > sufficient. > > Thanks, > > -- > Samuel K. Gutierrez > Los Alamos National Laboratory > > On Feb 7, 2011, at 8:38 AM, Samuel K. Gutierrez wrote: > >> Hi, >> >> A detailed backtrace from a core dump may help us debug this. Would you be >> willing to provide that information for us? >> >> Thanks, >> >> -- >> Samuel K. Gutierrez >> Los Alamos National Laboratory >> >> On Feb 6, 2011, at 6:36 PM, Michael Curtis wrote: >> >>> >>> On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote: >>> >>> Hi, >>> >>>> I just tried to reproduce the problem that you are experiencing and was >>>> unable to. >>>> >>>> SLURM 2.1.15 >>>> Open MPI 1.4.3 configured with: >>>> --with-platform=./contrib/platform/lanl/tlcc/debug-nopanasas >>> >>> I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same >>> platform file (the only change was to re-enable btl-tcp). >>> >>> Unfortunately, the result is the same: >>> salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi >>> salloc: Granted job allocation 145 >>> >>> ======================== JOB MAP ======================== >>> >>> Data for node: Name: eng-ipc4.{FQDN} Num procs: 8 >>> Process OMPI jobid: [6932,1] Process rank: 0 >>> Process OMPI jobid: [6932,1] Process rank: 1 >>> Process OMPI jobid: [6932,1] Process rank: 2 >>> Process OMPI jobid: [6932,1] Process rank: 3 >>> Process OMPI jobid: [6932,1] Process rank: 4 >>> Process OMPI jobid: [6932,1] Process rank: 5 >>> Process OMPI jobid: [6932,1] Process rank: 6 >>> Process OMPI jobid: [6932,1] Process rank: 7 >>> >>> Data for node: Name: ipc3 Num procs: 8 >>> Process OMPI jobid: [6932,1] Process rank: 8 >>> Process OMPI jobid: [6932,1] Process rank: 9 >>> Process OMPI jobid: [6932,1] Process rank: 10 >>> Process OMPI jobid: [6932,1] Process rank: 11 >>> Process OMPI jobid: [6932,1] Process rank: 12 >>> Process OMPI jobid: [6932,1] Process rank: 13 >>> Process OMPI jobid: [6932,1] Process rank: 14 >>> Process OMPI jobid: [6932,1] Process rank: 15 >>> >>> ============================================================= >>> [eng-ipc4:31754] *** Process received signal *** >>> [eng-ipc4:31754] Signal: Segmentation fault (11) >>> [eng-ipc4:31754] Signal code: Address not mapped (1) >>> [eng-ipc4:31754] Failing at address: 0x8012eb748 >>> [eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0] >>> [eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869) >>> [0x7f81cf262869] >>> [eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338) >>> [0x7f81cef93338] >>> [eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e) >>> [0x7f81cef9397e] >>> [eng-ipc4:31754] [ 4] >>> ~/../openmpi/lib/libopen-pal.so.0(opal_event_loop+0x1f) [0x7f81cef9356f] >>> [eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so.0(opal_progress+0x89) >>> [0x7f81cef87916] >>> [eng-ipc4:31754] [ 6] >>> ~/../openmpi/lib/libopen-rte.so.0(orte_plm_base_daemon_callback+0x13f) >>> [0x7f81cf262e20] >>> [eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7) >>> [0x7f81cf267ed7] >>> [eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46] >>> [eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4] >>> [eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd) >>> [0x7f81ce14bc4d] >>> [eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9] >>> [eng-ipc4:31754] *** End of error message *** >>> salloc: Relinquishing job allocation 145 >>> salloc: Job allocation 145 has been revoked. >>> zsh: exit 1 salloc -n16 ~/../openmpi/bin/mpirun --display-map >>> ~/ServerAdmin/mpi >>> >>> I've anonymised the paths and domain, otherwise pasted verbatim. The only >>> odd thing I notice is that the launching machine uses its full domain name, >>> whereas the other machine is referred to by the short name. Despite the >>> FQDN, the domain does not exist in the DNS (for historical reasons), but >>> does exist in the /etc/hosts file. >>> >>> Any further clues would be appreciated. In case it may be relevant, core >>> system versions are: glibc 2.11, gcc 4.4.3, kernel 2.6.32. One other point >>> of difference may be that our environment is tcp (ethernet) based whereas >>> the LANL test environment is not? >>> >>> Michael >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users