It's gigabit attached, pathscale is there simply to indicate that ompi was compiled with ekopath
- Barry -----Original Message----- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Galen Shipman Sent: 19 January 2007 01:56 To: Open MPI Users Cc: pak....@sun.com Subject: Re: [OMPI users] Problems with ompi1.2b2, SGE and DLPOLY[Scanned] Are you using -mca pml cm for pathscale or are you using openib? - Galen On Jan 18, 2007, at 4:42 PM, Barry Evans wrote: > Hi, > > We tried running with 32 and 16, had some success but after a > reboot of > the cluster it seems to be any DLPOLY run attempted falls over, either > interactively or through SGE. Standard benchmarks such as IMB and HPL > execute to completion. > > Here's the full output of a typical error: > > Signal:7 info.si_errno:0(Success) si_code:2() > Failing at addr:0x5107c0 > Signal:7 info.si_errno:0(Success) si_code:2() > Failing at addr:0x5107c0 > Signal:7 info.si_errno:0(Success) si_code:2() > Failing at addr:0x5107c0 > Signal:7 info.si_errno:0(Success) si_code:2() > Failing at addr:0x5107c0 > Signal:7 info.si_errno:0(Success) si_code:2() > Failing at addr:0x5107c0 > Signal:7 info.si_errno:0(Success) si_code:2() > Failing at addr:0x5107c0 > Signal:7 info.si_errno:0(Success) si_code:2() > Failing at addr:0x5107c0 > Signal:7 info.si_errno:0(Success) si_code:2() > Failing at addr:0x5107c0 > [0] func:/opt/openmpi/pathscale/64/lib/libopal.so.0 [0x2a958b0a68] > *** End of error message *** > [0] func:/opt/openmpi/pathscale/64/lib/libopal.so.0 [0x2a958b0a68] > *** End of error message *** > [0] func:/opt/openmpi/pathscale/64/lib/libopal.so.0 [0x2a958b0a68] > *** End of error message *** > [0] func:/opt/openmpi/pathscale/64/lib/libopal.so.0 [0x2a958b0a68] > *** End of error message *** > 17 additional processes aborted (not shown) > > Cheers, > Barry > -----Original Message----- > From: users-boun...@open-mpi.org [mailto:users-bounces@open- > mpi.org] On > Behalf Of Pak Lui > Sent: 17 January 2007 19:16 > To: Open MPI Users > Subject: Re: [OMPI users] Problems with ompi1.2b2, SGE and > DLPOLY[Scanned] > > Sorry for jumping in late. > > I was able to use ~128 SGE slots for my test run, with the either > of the > SGE allocation rules ($fill_up or $round_robin) and -np 64 on my test > MPI program, but I wasn't able to reproduce your error though on > Solaris. Like Brian said, having the stack trace could help. Also, I > wonder if you can can try with a non-MPI program, a smaller number of > slots, or -np to see if he's still able to see the issue? > > Brian W. Barrett wrote: >> On Jan 15, 2007, at 10:13 AM, Marcelo Maia Garcia wrote: >> >>> I am trying to setup SGE to run DLPOLY compiled with mpif90 >>> (OpenMPI 1.2b2, pathscale Fortran compilers and gcc c/c++). In >>> general I am much more luckier running DLPOLY interactively then >>> using SGE. The error that I got is: Signal:7 info.si_errno:0 >>> (Success) si_code:2()[1]. A previous message in the list[2], says >>> that this is more likely to be a configuration problem. But what >>> kind of configuration? It is in the run time? >> >> Could you include the entire stack trace next time? That can help >> localize where the error is occurring. The message is saying that a >> process died from a signal 7, which on Linux is a Bus Error. This >> usually points to memory errors, either in Open MPI or in the user >> application. Without seeing the stack trace, it's difficult to pin >> down where the error is occurring. >> >>> Another error that I got sometimes is related with "writev"[3] >>> But this is pretty rare. >> >> Usually these point to some process in the job dying and the other >> processes having issues completing outstanding sends to the dead >> process. I would guess that the problem originates with the bus >> error you are seeing. Cleaning that up will likely make these errors > >> go away. >> >> Brian >> >> >> >>> [1] >>> [ocf@master TEST2]$ mpirun -np 16 --hostfile /home/ocf/SRIFBENCH/ >>> DLPOLY3/data/nodes_16_slots4.txt /home/ocf/SRIFBENCH/DLPOLY3/ >>> execute/DLPOLY.Y >>> Signal:7 info.si_errno:0(Success) si_code:2() >>> Failing at addr:0x5107b0 >>> (...) >>> >>> [2] http://www.open-mpi.org/community/lists/users/2007/01/2423.php >>> >>> >>> [3] >>> [node007:05003] mca_btl_tcp_frag_send: writev failed with errno=104 >>> [node007:05004] mca_btl_tcp_frag_send: writev failed with errno=104 >>> [node006:05170] mca_btl_tcp_frag_send: writev failed with errno=104 >>> [node007:05005] mca_btl_tcp_frag_send: writev failed with errno=104 >>> [node007:05006] mca_btl_tcp_frag_send: writev failed with errno=104 >>> [node006:05170] mca_btl_tcp_frag_send: writev failed with errno=104 >>> [node006:05171] mca_btl_tcp_frag_send: writev failed with errno=104 >>> [node006:05171] mca_btl_tcp_frag_send: writev failed with errno=104 >>> [node006:05172] mca_btl_tcp_frag_send: writev failed with errno=104 >>> [node006:05172] mca_btl_tcp_frag_send: writev failed with errno=104 >>> [node006:05173] mca_btl_tcp_frag_send: writev failed with errno=104 >>> [node006:05173] mca_btl_tcp_frag_send: writev failed with errno=104 >>> mpirun noticed that job rank 0 with PID 0 on node node003 exited on >>> signal 48. >>> 15 additional processes aborted (not shown) >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > -- > > Thanks, > > - Pak Lui > pak....@sun.com > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users