Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-09 Thread Andrea Negri
>> >>>>>>>> Does it mean that the process with rank 0 should be bound to >>>>>>>> core 0, 1, or 2 of socket 1? >>>>>>>> >>>>>>>> I tried to use a rankfile and have a problem. My rankfile contai

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Gus Correa
On 09/07/2012 08:02 AM, Jeff Squyres wrote: On Sep 7, 2012, at 5:58 AM, Jeff Squyres wrote: Also look for hardware errors. Perhaps you have some bad RAM somewhere. Is it always the same node that crashes? And so on. Another thought on hardware errors... I actually have seen bad RAM

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Gus Correa
On 09/03/2012 04:39 PM, Andrea Negri wrote: max locked memory (kbytes, -l) 32 max memory size(kbytes, -m) unlimited open files (-n) 1024 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Jeff Squyres
On Sep 7, 2012, at 5:58 AM, Jeff Squyres wrote: > Also look for hardware errors. Perhaps you have some bad RAM somewhere. Is > it always the same node that crashes? And so on. Another thought on hardware errors... I actually have seen bad RAM cause spontaneous reboots with no Linux

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Jeff Squyres
On Sep 5, 2012, at 3:59 AM, Andrea Negri wrote: > I have tried with these flags (I use gcc 4.7 and open mpi 1.6), but > the program doesn't crash, a node go down and the rest of them remain > to wait a signal (there is an ALLREDUCE in the code). > > Anyway, yesterday some processes died (without

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Andrea Negri
George, I hace done some modifications to the code, however this is the first part my zmp_list: !ZEUSMP2 CONFIGURATION FILE LGEOM= 2, LDIMEN = 2 / LRAD = 0, XHYDRO = .TRUE., XFORCE = .TRUE., XMHD = .false.,

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-05 Thread George Bosilca
Andrea, As suggested by the previous answers I guess the size of your problem is too large for the memory available on the nodes. I can runs ZeusMP without any issues up to 64 processes, both over Ethernet and Infiniband. I tried the 1.6 and the current trunk, and both perform as expected.

[OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-05 Thread Andrea Negri
I have tried with these flags (I use gcc 4.7 and open mpi 1.6), but the program doesn't crash, a node go down and the rest of them remain to wait a signal (there is an ALLREDUCE in the code). Anyway, yesterday some processes died (without a log) on the node 10, I logged almost immediately in the

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-04 Thread David Warren
---- Message: 1 Date: Fri, 31 Aug 2012 20:11:41 -0400 From: Gus Correa<g...@ldeo.columbia.edu> Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster of servers To: Open MPI Users<us...@open

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-03 Thread Ralph Castain
------------ >> >> all the time. >> >> == >> >> I configured with: >> >> ./configure --prefix=$HOME/local/... --enable-static --disable-shared >> --with-sge >> >> and adjusted my PATHs accord

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-03 Thread Andrea Negri
users-ow...@open-mpi.org >>> >>> When replying, please edit your Subject line so it is more specific >>> than "Re: Contents of users digest..." >>> >>> >>> Today's Topics: >>> >>> 1. Re: some mpi processes "dis

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-03 Thread Ralph Castain
s) >> 2. Re: users Digest, Vol 2339, Issue 5 (Andrea Negri) >> >> >> ------------------ >> >> Message: 1 >> Date: Sat, 1 Sep 2012 08:48:56 +0100 >> From: John Hearns <hear...@googlemail.com>

[OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-03 Thread Andrea Negri
- > > Message: 1 > Date: Sat, 1 Sep 2012 08:48:56 +0100 > From: John Hearns <hear...@googlemail.com> > Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster > of servers >

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-01 Thread John Hearns
Apologies, I have not taken the time to read your comprehensive diagnostics! As Gus says, this sounds like a memory problem. My suspicion would be the kernel Out Of Memory (OOM) killer. Log into those nodes (or ask your systems manager to do this). Look closely at /var/log/messages where there

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-08-31 Thread Gus Correa
Hi Andrea I would guess this is a memory problem. Do you know how much memory each node has? Do you know the memory that each MPI process in the CFD code requires? If the program starts swapping/paging into disk, because of low memory, those interesting things that you described can happen. I

[OMPI users] some mpi processes "disappear" on a cluster of servers

2012-08-31 Thread Andrea Negri
Hi, I have been in trouble for a year. I run a pure MPI (no openMP) Fortran fluid dynamical code on a cluster of server, and I obtain a strange behaviour by running the code on multiple nodes. The cluster is formed by 16 pc (1 pc is a node) with a dual core processor. Basically, I'm able to run