Re: [OMPI users] Problems with ompi1.2b2, SGE and DLPOLY[Scanned]

Barry Evans Fri, 19 Jan 2007 03:35:23 -0500

It's gigabit attached, pathscale is there simply to indicate that ompi
was compiled with ekopath


- Barry

-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Galen Shipman
Sent: 19 January 2007 01:56
To: Open MPI Users
Cc: pak....@sun.com
Subject: Re: [OMPI users] Problems with ompi1.2b2, SGE and
DLPOLY[Scanned]




Are you using

-mca pml cm

for pathscale or are you using openib?

- Galen


On Jan 18, 2007, at 4:42 PM, Barry Evans wrote:

> Hi,
>
> We tried running with 32 and 16, had some success but after a  
> reboot of
> the cluster it seems to be any DLPOLY run attempted falls over, either
> interactively or through SGE. Standard benchmarks such as IMB and HPL
> execute to completion.
>
> Here's the full output of a typical error:
>
> Signal:7 info.si_errno:0(Success) si_code:2()
> Failing at addr:0x5107c0
> Signal:7 info.si_errno:0(Success) si_code:2()
> Failing at addr:0x5107c0
> Signal:7 info.si_errno:0(Success) si_code:2()
> Failing at addr:0x5107c0
> Signal:7 info.si_errno:0(Success) si_code:2()
> Failing at addr:0x5107c0
> Signal:7 info.si_errno:0(Success) si_code:2()
> Failing at addr:0x5107c0
> Signal:7 info.si_errno:0(Success) si_code:2()
> Failing at addr:0x5107c0
> Signal:7 info.si_errno:0(Success) si_code:2()
> Failing at addr:0x5107c0
> Signal:7 info.si_errno:0(Success) si_code:2()
> Failing at addr:0x5107c0
> [0] func:/opt/openmpi/pathscale/64/lib/libopal.so.0 [0x2a958b0a68]
> *** End of error message ***
> [0] func:/opt/openmpi/pathscale/64/lib/libopal.so.0 [0x2a958b0a68]
> *** End of error message ***
> [0] func:/opt/openmpi/pathscale/64/lib/libopal.so.0 [0x2a958b0a68]
> *** End of error message ***
> [0] func:/opt/openmpi/pathscale/64/lib/libopal.so.0 [0x2a958b0a68]
> *** End of error message ***
> 17 additional processes aborted (not shown)
>
> Cheers,
> Barry
> -----Original Message-----
> From: users-boun...@open-mpi.org [mailto:users-bounces@open- 
> mpi.org] On
> Behalf Of Pak Lui
> Sent: 17 January 2007 19:16
> To: Open MPI Users
> Subject: Re: [OMPI users] Problems with ompi1.2b2, SGE and
> DLPOLY[Scanned]
>
> Sorry for jumping in late.
>
> I was able to use ~128 SGE slots for my test run, with the either  
> of the
> SGE allocation rules ($fill_up or $round_robin) and -np 64 on my test
> MPI program, but I wasn't able to reproduce your error though on
> Solaris. Like Brian said, having the stack trace could help. Also, I
> wonder if you can can try with a non-MPI program, a smaller number of
> slots, or -np to see if he's still able to see the issue?
>
> Brian W. Barrett wrote:
>> On Jan 15, 2007, at 10:13 AM, Marcelo Maia Garcia wrote:
>>
>>>   I am trying to setup SGE to run DLPOLY compiled with mpif90
>>> (OpenMPI 1.2b2, pathscale Fortran compilers and gcc c/c++). In
>>> general I am much more luckier running DLPOLY interactively then
>>> using SGE. The error that I got is: Signal:7 info.si_errno:0
>>> (Success) si_code:2()[1]. A previous message in the list[2], says
>>> that this is more likely to be a configuration problem. But what
>>> kind of configuration? It is in the run time?
>>
>> Could you include the entire stack trace next time?  That can help
>> localize where the error is occurring.  The message is saying that a
>> process died from a signal 7, which on Linux is a Bus Error.  This
>> usually points to memory errors, either in Open MPI or in the user
>> application.  Without seeing the stack trace, it's difficult to pin
>> down where the error is occurring.
>>
>>>   Another error that I got sometimes is related with "writev"[3]
>>> But this is pretty rare.
>>
>> Usually these point to some process in the job dying and the other
>> processes having issues completing outstanding sends to the dead
>> process.  I would guess that the problem originates with the bus
>> error you are seeing.  Cleaning that up will likely make these errors
>
>> go away.
>>
>> Brian
>>
>>
>>
>>> [1]
>>> [ocf@master TEST2]$ mpirun -np 16 --hostfile /home/ocf/SRIFBENCH/
>>> DLPOLY3/data/nodes_16_slots4.txt /home/ocf/SRIFBENCH/DLPOLY3/
>>> execute/DLPOLY.Y
>>> Signal:7 info.si_errno:0(Success) si_code:2()
>>> Failing at addr:0x5107b0
>>> (...)
>>>
>>> [2] http://www.open-mpi.org/community/lists/users/2007/01/2423.php
>>>
>>>
>>> [3]
>>> [node007:05003] mca_btl_tcp_frag_send: writev failed with errno=104
>>> [node007:05004] mca_btl_tcp_frag_send: writev failed with errno=104
>>> [node006:05170] mca_btl_tcp_frag_send: writev failed with errno=104
>>> [node007:05005] mca_btl_tcp_frag_send: writev failed with errno=104
>>> [node007:05006] mca_btl_tcp_frag_send: writev failed with errno=104
>>> [node006:05170] mca_btl_tcp_frag_send: writev failed with errno=104
>>> [node006:05171] mca_btl_tcp_frag_send: writev failed with errno=104
>>> [node006:05171] mca_btl_tcp_frag_send: writev failed with errno=104
>>> [node006:05172] mca_btl_tcp_frag_send: writev failed with errno=104
>>> [node006:05172] mca_btl_tcp_frag_send: writev failed with errno=104
>>> [node006:05173] mca_btl_tcp_frag_send: writev failed with errno=104
>>> [node006:05173] mca_btl_tcp_frag_send: writev failed with errno=104
>>> mpirun noticed that job rank 0 with PID 0 on node node003 exited on
>>> signal 48.
>>> 15 additional processes aborted (not shown)
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> -- 
>
> Thanks,
>
> - Pak Lui
> pak....@sun.com
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Problems with ompi1.2b2, SGE and DLPOLY[Scanned]

Reply via email to