You might want to run your app through a memory-checking debugger to see if anything obvious shows up.

Also, check to see if your corelimit size is greater than zero (i.e., make it "unlimited"). Then run again and see if you can get corefiles to see if your app is silently dumping core, and these would give you a clue as to what is going on.

These are the ares where I typically start with parallel debugging; hopefully this is at least somewhat helpful...


On Nov 1, 2007, at 2:59 PM, Karsten Bolding wrote:

This is not OpenMPI specific - but maybe somebody on the list can give a
hint.

I start a parallel job with:
mpirun -np 19 -nolocal -machinefile machinefile bin/getm_prod_IFORT. 0096x0096

everything starts OK and the simulation carries on 2+ hours of
wall clock time - then suddenly without a trace in the logfile:

   19:48:46.172 n=        1800
           2003-09-01 05:06:00: reading 2D boundary data ...
   19:49:21.710 n=        1900
   19:49:50.490 n=        2000

or in any system logfiles the simulation stops and all related processes
on the nodes stops.

If I re-run the simulation does not stop at the same time.

Does anybody have a clue where I shall search.

I use a 4 machine/dual P/dual core cluster connected via GBit/s ethernet.

Karsten

PS: If I use MPICH I get the same problem.


--
----------------------------------------------------------------------
Karsten Bolding                    Bolding & Burchard Hydrodynamics
Strandgyden 25                     Phone: +45 64422058
DK-5466 Asperup                    Fax:   +45 64422068
Denmark                            Email: kars...@bolding-burchard.com

http://www.findvej.dk/Strandgyden25,5466,11,3
----------------------------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

Reply via email to