Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-09 Thread Samuel K. Gutierrez
On Feb 8, 2011, at 8:21 PM, Ralph Castain wrote: I would personally suggest not reconfiguring your system simply to support a particular version of OMPI. The only difference between the 1.4 and 1.5 series wrt slurm is that we changed a few things to support a more recent version of slurm. I

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Ralph Castain
I would personally suggest not reconfiguring your system simply to support a particular version of OMPI. The only difference between the 1.4 and 1.5 series wrt slurm is that we changed a few things to support a more recent version of slurm. It is relatively easy to backport that code to the 1.4

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Michael Curtis
On 09/02/2011, at 9:16 AM, Ralph Castain wrote: > See below > > > On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote: > >> >> On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote: >> >>> Hi Michael, >>> >>> You may have tried to send some debug information to the list, but it >>> appears to

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Ralph Castain
See below On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote: > > On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote: > >> Hi Michael, >> >> You may have tried to send some debug information to the list, but it >> appears to have been blocked. Compressed text output of the backtrace text >

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Michael Curtis
On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote: > Hi Michael, > > You may have tried to send some debug information to the list, but it appears > to have been blocked. Compressed text output of the backtrace text is > sufficient. Odd, I thought I sent it to you directly. In any case,

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Michael Curtis
On 09/02/2011, at 2:38 AM, Ralph Castain wrote: > Another possibility to check - are you sure you are getting the same OMPI > version on the backend nodes? When I see it work on local node, but fail > multi-node, the most common problem is that you are picking up a different > OMPI version due

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Ralph Castain
Another possibility to check - are you sure you are getting the same OMPI version on the backend nodes? When I see it work on local node, but fail multi-node, the most common problem is that you are picking up a different OMPI version due to path differences on the backend nodes. On Feb 8, 201

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Samuel K. Gutierrez
Hi Michael, You may have tried to send some debug information to the list, but it appears to have been blocked. Compressed text output of the backtrace text is sufficient. Thanks, -- Samuel K. Gutierrez Los Alamos National Laboratory On Feb 7, 2011, at 8:38 AM, Samuel K. Gutierrez wrote:

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-07 Thread Samuel K. Gutierrez
Hi, A detailed backtrace from a core dump may help us debug this. Would you be willing to provide that information for us? Thanks, -- Samuel K. Gutierrez Los Alamos National Laboratory On Feb 6, 2011, at 6:36 PM, Michael Curtis wrote: On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-07 Thread Ralph Castain
The 1.4 series is regularly tested on slurm machines after every modification, and has been running at LANL (and other slurm installations) for quite some time, so I doubt that's the core issue. Likewise, nothing in the system depends upon the FQDN (or anything regarding hostname) - it's just us

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-06 Thread Michael Curtis
On 07/02/2011, at 12:36 PM, Michael Curtis wrote: > > On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote: > > Hi, > >> I just tried to reproduce the problem that you are experiencing and was >> unable to. >> >> SLURM 2.1.15 >> Open MPI 1.4.3 configured with: >> --with-platform=./contrib/p

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-06 Thread Michael Curtis
On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote: Hi, > I just tried to reproduce the problem that you are experiencing and was > unable to. > > SLURM 2.1.15 > Open MPI 1.4.3 configured with: > --with-platform=./contrib/platform/lanl/tlcc/debug-nopanasas I compiled OpenMPI 1.4.3 (vanilla

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-06 Thread Michael Curtis
On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote: > I just tried to reproduce the problem that you are experiencing and was > unable to. > > > SLURM 2.1.15 > Open MPI 1.4.3 configured with: > --with-platform=./contrib/platform/lanl/tlcc/debug-nopanasas > > I'll dig a bit further. Intere

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-03 Thread Samuel K. Gutierrez
Hi, I just tried to reproduce the problem that you are experiencing and was unable to. [samuel@lo1-fe ~]$ salloc -n32 mpirun --display-map ./mpi_app salloc: Job is in held state, pending scheduler release salloc: Pending job allocation 138319 salloc: job 138319 queued and waiting for resource

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-02 Thread Samuel K. Gutierrez
Hi, We'll try to reproduce the problem. Thanks, -- Samuel K. Gutierrez Los Alamos National Laboratory On Feb 2, 2011, at 2:55 AM, Michael Curtis wrote: On 28/01/2011, at 8:16 PM, Michael Curtis wrote: On 27/01/2011, at 4:51 PM, Michael Curtis wrote: Some more debugging information: Is

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-02 Thread Michael Curtis
On 28/01/2011, at 8:16 PM, Michael Curtis wrote: > > On 27/01/2011, at 4:51 PM, Michael Curtis wrote: > > Some more debugging information: Is anyone able to help with this problem? As far as I can tell it's a stock-standard recently installed SLURM installation. I can try 1.5.1 but hesitant

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-01-28 Thread Michael Curtis
On 27/01/2011, at 4:51 PM, Michael Curtis wrote: Some more debugging information: > Failing case: > michael@ipc ~ $ salloc -n8 mpirun --display-map ./mpi > JOB MAP Backtrace with debugging symbols #0 0x77bb5c1e in ?? () from /usr/li

[OMPI users] Segmentation fault with SLURM and non-local nodes

2011-01-27 Thread Michael Curtis
Hi, I'm not sure whether this problem is with SLURM or OpenMPI, but the stack traces (below) point to an issue within OpenMPI. Whenever I try to launch an MPI job within SLURM, mpirun immediately segmentation faults -- but only if the machine that SLURM allocated to MPI is different to the one