Re: [OMPI users] openib segfaults with Torque

2014-06-13 Thread Fischer, Greg A.
! _ From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa Sent: Wednesday, June 11, 2014 7:13 PM To: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque If that could help Greg, on the compute nodes I normally add this to /etc/security/limits.conf

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Gus Correa
re any other work around that I might try? Something that avoids UDCM? > > -Original Message- > From: Fischer, Greg A. > Sent: Tuesday, June 10, 2014 2:59 PM > To: Nathan Hjelm > Cc: Open MPI Users; Fischer, Greg A. > Subject: RE: [

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Martin Siegert
PM, Jeff Squyres (jsquyres) > >> ><jsquy...@cisco.com> wrote: > >> > > >> > Mellanox -- > >> > > >> > What would cause a CQ to fail to be created? > >> > > >> > On Jun 1

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Jeff Squyres (jsquyres)
PM, "Fischer, Greg A." >> > <fisch...@westinghouse.com> wrote: >> > >> > > Is there any other work around that I might try? Something that >> > avoids UDCM? >> > > >> > > -Original Message---

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Ralph Castain
4, at 3:42 PM, "Fischer, Greg A." > > <fisch...@westinghouse.com> wrote: > > > > > Is there any other work around that I might try? Something that > > avoids UDCM? > > > > > > -Original Message- > >

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Joshua Ladd
ng that > > avoids UDCM? > > > > > > -Original Message- > > > From: Fischer, Greg A. > > > Sent: Tuesday, June 10, 2014 2:59 PM > > > To: Nathan Hjelm > > > Cc: Open MPI Users; Fischer, Greg A. &g

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Nathan Hjelm
eg > > > > -Original Message- > > From: Nathan Hjelm [mailto:hje...@lanl.gov] > > Sent: Tuesday, June 10, 2014 2:58 PM > > To: Fischer, Greg A. > > Cc: Open MPI Users > > Subject: Re: [OMPI users] openib segfa

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Joshua Ladd
er work around that I might try? Something that avoids > UDCM? > > > > -Original Message- > > From: Fischer, Greg A. > > Sent: Tuesday, June 10, 2014 2:59 PM > > To: Nathan Hjelm > > Cc: Open MPI Users; Fischer, Greg A. > > Subject: RE:

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Jeff Squyres (jsquyres)
her, Greg A. > Sent: Tuesday, June 10, 2014 2:59 PM > To: Nathan Hjelm > Cc: Open MPI Users; Fischer, Greg A. > Subject: RE: [OMPI users] openib segfaults with Torque > > [binf316:fischega] $ ulimit -m > unlimited > > Greg > > -Original Message- > From

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Fischer, Greg A.
Is there any other work around that I might try? Something that avoids UDCM? -Original Message- From: Fischer, Greg A. Sent: Tuesday, June 10, 2014 2:59 PM To: Nathan Hjelm Cc: Open MPI Users; Fischer, Greg A. Subject: RE: [OMPI users] openib segfaults with Torque [binf316:fischega

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Fischer, Greg A.
[binf316:fischega] $ ulimit -m unlimited Greg -Original Message- From: Nathan Hjelm [mailto:hje...@lanl.gov] Sent: Tuesday, June 10, 2014 2:58 PM To: Fischer, Greg A. Cc: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque Out of curiosity what is the mlock limit

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Nathan Hjelm
; Cc: Open MPI Users > Subject: Re: [OMPI users] openib segfaults with Torque > > > Well, thats interesting. The output shows that ibv_create_cq is failing. > Strange since an identical call had just succeeded (udcm creates two > completion queues). Some questions that might i

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Fischer, Greg A.
: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque Well, thats interesting. The output shows that ibv_create_cq is failing. Strange since an identical call had just succeeded (udcm creates two completion queues). Some questions that might indicate where the failure might

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Nathan Hjelm
m: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres > (jsquyres) > Sent: Tuesday, June 10, 2014 10:31 AM > To: Nathan Hjelm > Cc: Open MPI Users > Subject: Re: [OMPI users] openib segfaults with Torque > > Greg: > > Can you run with "--mca btl

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Fischer, Greg A.
me know if I can provide anything else. Thanks for looking into this, Greg -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres (jsquyres) Sent: Tuesday, June 10, 2014 10:31 AM To: Nathan Hjelm Cc: Open MPI Users Subject: Re: [OMPI users] openib segf

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Jeff Squyres (jsquyres)
Greg: Can you run with "--mca btl_base_verbose 100" on your debug build so that we can get some additional output to see why UDCM is failing to setup properly? On Jun 10, 2014, at 10:25 AM, Nathan Hjelm wrote: > On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Nathan Hjelm
On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote: > I seem to recall that you have an IB-based cluster, right? > > From a *very quick* glance at the code, it looks like this might be a simple > incorrect-finalization issue. That is: > > - you run the job on a single

Re: [OMPI users] openib segfaults with Torque

2014-06-09 Thread Jeff Squyres (jsquyres)
I seem to recall that you have an IB-based cluster, right? >From a *very quick* glance at the code, it looks like this might be a simple >incorrect-finalization issue. That is: - you run the job on a single server - openib disqualifies itself because you're running on a single server - openib

Re: [OMPI users] openib segfaults with Torque

2014-06-06 Thread Ralph Castain
Process 0 decremented value: 0 > Process 0 exiting > Process 1 exiting > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Friday, June 06, 2014 10:34 AM > To: Open MPI Users > Subject: Re: [OMPI users] openib segfaults with Torque >

Re: [OMPI users] openib segfaults with Torque

2014-06-06 Thread Fischer, Greg A.
Of Ralph Castain Sent: Friday, June 06, 2014 10:34 AM To: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque Huh - how strange. I can't imagine what it has to do with Torque vs rsh - this is failing when the openib BTL is trying to create the connection, which comes way after

Re: [OMPI users] openib segfaults with Torque

2014-06-06 Thread Ralph Castain
64/libc.so.6(__libc_start_main+0xe6)[0x7f3b58301c36] > [binf316:21583] [17] ring_c[0x400889] > [binf316:21583] *** End of error message *** > -- > mpirun noticed that process rank 0 with PID 21583 on node 316 exited on > signal 6 (Aborte

Re: [OMPI users] openib segfaults with Torque

2014-06-06 Thread Fischer, Greg A.
rs-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Thursday, June 05, 2014 7:57 PM To: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque Hmmm...I'm not sure how that is going to run with only one proc (I don't know if the program is protected against that scenario). If you

Re: [OMPI users] openib segfaults with Torque

2014-06-05 Thread Ralph Castain
Hmmm...I'm not sure how that is going to run with only one proc (I don't know if the program is protected against that scenario). If you run with -np 2 -mca btl openib,sm,self, is it happy? On Jun 5, 2014, at 2:16 PM, Fischer, Greg A. wrote: > Here’s the command

Re: [OMPI users] openib segfaults with Torque

2014-06-05 Thread Fischer, Greg A.
Here's the command I'm invoking and the terminal output. (Some of this information doesn't appear to be captured in the backtrace.) [binf316:fischega] $ mpirun -np 1 -mca btl openib,self ring_c ring_c: ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734:

[OMPI users] openib segfaults with Torque

2014-06-05 Thread Fischer, Greg A.
OpenMPI Users, After encountering difficulty with the Intel compilers (see the "intermittent segfaults with openib on ring_c.c" thread), I installed GCC-4.8.3 and recompiled OpenMPI. I ran the simple examples (ring, etc.) with the openib BTL in a typical BASH environment. Everything appeared