Another possibility to check - are you sure you are getting the same OMPI 
version on the backend nodes? When I see it work on local node, but fail 
multi-node, the most common problem is that you are picking up a different OMPI 
version due to path differences on the backend nodes.


On Feb 8, 2011, at 8:17 AM, Samuel K. Gutierrez wrote:

> Hi Michael,
> 
> You may have tried to send some debug information to the list, but it appears 
> to have been blocked.  Compressed text output of the backtrace text is 
> sufficient.
> 
> Thanks,
> 
> --
> Samuel K. Gutierrez
> Los Alamos National Laboratory
> 
> On Feb 7, 2011, at 8:38 AM, Samuel K. Gutierrez wrote:
> 
>> Hi,
>> 
>> A detailed backtrace from a core dump may help us debug this.  Would you be 
>> willing to provide that information for us?
>> 
>> Thanks,
>> 
>> --
>> Samuel K. Gutierrez
>> Los Alamos National Laboratory
>> 
>> On Feb 6, 2011, at 6:36 PM, Michael Curtis wrote:
>> 
>>> 
>>> On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote:
>>> 
>>> Hi,
>>> 
>>>> I just tried to reproduce the problem that you are experiencing and was 
>>>> unable to.
>>>> 
>>>> SLURM 2.1.15
>>>> Open MPI 1.4.3 configured with: 
>>>> --with-platform=./contrib/platform/lanl/tlcc/debug-nopanasas
>>> 
>>> I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same 
>>> platform file (the only change was to re-enable btl-tcp).
>>> 
>>> Unfortunately, the result is the same:
>>> salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi
>>> salloc: Granted job allocation 145
>>> 
>>> ========================   JOB MAP   ========================
>>> 
>>> Data for node: Name: eng-ipc4.{FQDN}                Num procs: 8
>>>     Process OMPI jobid: [6932,1] Process rank: 0
>>>     Process OMPI jobid: [6932,1] Process rank: 1
>>>     Process OMPI jobid: [6932,1] Process rank: 2
>>>     Process OMPI jobid: [6932,1] Process rank: 3
>>>     Process OMPI jobid: [6932,1] Process rank: 4
>>>     Process OMPI jobid: [6932,1] Process rank: 5
>>>     Process OMPI jobid: [6932,1] Process rank: 6
>>>     Process OMPI jobid: [6932,1] Process rank: 7
>>> 
>>> Data for node: Name: ipc3   Num procs: 8
>>>     Process OMPI jobid: [6932,1] Process rank: 8
>>>     Process OMPI jobid: [6932,1] Process rank: 9
>>>     Process OMPI jobid: [6932,1] Process rank: 10
>>>     Process OMPI jobid: [6932,1] Process rank: 11
>>>     Process OMPI jobid: [6932,1] Process rank: 12
>>>     Process OMPI jobid: [6932,1] Process rank: 13
>>>     Process OMPI jobid: [6932,1] Process rank: 14
>>>     Process OMPI jobid: [6932,1] Process rank: 15
>>> 
>>> =============================================================
>>> [eng-ipc4:31754] *** Process received signal ***
>>> [eng-ipc4:31754] Signal: Segmentation fault (11)
>>> [eng-ipc4:31754] Signal code: Address not mapped (1)
>>> [eng-ipc4:31754] Failing at address: 0x8012eb748
>>> [eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0]
>>> [eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869) 
>>> [0x7f81cf262869]
>>> [eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338) 
>>> [0x7f81cef93338]
>>> [eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e) 
>>> [0x7f81cef9397e]
>>> [eng-ipc4:31754] [ 4] 
>>> ~/../openmpi/lib/libopen-pal.so.0(opal_event_loop+0x1f) [0x7f81cef9356f]
>>> [eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so.0(opal_progress+0x89) 
>>> [0x7f81cef87916]
>>> [eng-ipc4:31754] [ 6] 
>>> ~/../openmpi/lib/libopen-rte.so.0(orte_plm_base_daemon_callback+0x13f) 
>>> [0x7f81cf262e20]
>>> [eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7) 
>>> [0x7f81cf267ed7]
>>> [eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46]
>>> [eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4]
>>> [eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd) 
>>> [0x7f81ce14bc4d]
>>> [eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9]
>>> [eng-ipc4:31754] *** End of error message ***
>>> salloc: Relinquishing job allocation 145
>>> salloc: Job allocation 145 has been revoked.
>>> zsh: exit 1     salloc -n16 ~/../openmpi/bin/mpirun --display-map 
>>> ~/ServerAdmin/mpi
>>> 
>>> I've anonymised the paths and domain, otherwise pasted verbatim.  The only 
>>> odd thing I notice is that the launching machine uses its full domain name, 
>>> whereas the other machine is referred to by the short name.  Despite the 
>>> FQDN, the domain does not exist in the DNS (for historical reasons), but 
>>> does exist in the /etc/hosts file.
>>> 
>>> Any further clues would be appreciated.  In case it may be relevant, core 
>>> system versions are: glibc 2.11, gcc 4.4.3, kernel 2.6.32.  One other point 
>>> of difference may be that our environment is tcp (ethernet) based whereas 
>>> the LANL test environment is not?
>>> 
>>> Michael
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to