You might want to run your app through a memory-checking debugger to
see if anything obvious shows up.
Also, check to see if your corelimit size is greater than zero (i.e.,
make it "unlimited"). Then run again and see if you can get corefiles
to see if your app is silently dumping core, and these would give you
a clue as to what is going on.
These are the ares where I typically start with parallel debugging;
hopefully this is at least somewhat helpful...
On Nov 1, 2007, at 2:59 PM, Karsten Bolding wrote:
This is not OpenMPI specific - but maybe somebody on the list can
give a
hint.
I start a parallel job with:
mpirun -np 19 -nolocal -machinefile machinefile bin/getm_prod_IFORT.
0096x0096
everything starts OK and the simulation carries on 2+ hours of
wall clock time - then suddenly without a trace in the logfile:
19:48:46.172 n= 1800
2003-09-01 05:06:00: reading 2D boundary data ...
19:49:21.710 n= 1900
19:49:50.490 n= 2000
or in any system logfiles the simulation stops and all related
processes
on the nodes stops.
If I re-run the simulation does not stop at the same time.
Does anybody have a clue where I shall search.
I use a 4 machine/dual P/dual core cluster connected via GBit/s
ethernet.
Karsten
PS: If I use MPICH I get the same problem.
--
----------------------------------------------------------------------
Karsten Bolding Bolding & Burchard Hydrodynamics
Strandgyden 25 Phone: +45 64422058
DK-5466 Asperup Fax: +45 64422068
Denmark Email: kars...@bolding-burchard.com
http://www.findvej.dk/Strandgyden25,5466,11,3
----------------------------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems