On Thu, 17 Dec 2009, Kevin.Buckley at ecs.vuw.ac.nz wrote: > > Ok - the code runs locally fine - but not on 'SunGridEngine' > > > > Not Ok. > > That summary misses the whole point of the errors I am seeing. > > The code runs fine locally AND under Sun Grid Engine, if you only > spawn TWO processes but not FOUR or EIGHT.
Well the the 'np 2' runs could be scheduled on your local node [or a single SMP remote node]. So it could be that a different code path within the mpi library gets used in 2 vs 4 case. [shared memory vs tcp/some-other communication]. Perhaps you can get the nodefile list for each of these [2,4,8 proc] runs and see how the 2-proc run differs. [petsc only] And I suspect there is something wrong in your OpenMPI+SunGridEngine config thats triggering this problem. I don't know exactly how though.. [the basic petsc examples are supporsed to work in any valid MPI enviornment]. > > Wrt SGE - what does it require from MPI. Is it MPI agnostic - or does > > it need a perticular MPI to be used? > > It is more the other way around. > > OpenMPI has been compiled so as to be aware of SGE. ok. > But anyroad, what are the error messages, from PETSc, telling you > is possibly going wrong here? >>>>>>>>> [2]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,probably memory access out of range [2]PETSC ERROR: [2] VecScatterCreateCommon_PtoS line 1699 src/vec/vec/utils/vpscat.c [2]PETSC ERROR: [2] VecScatterCreate_PtoS line 1508 src/vec/vec/utils/vpscat.c [2]PETSC ERROR: User provided function() line 0 in unknown directory unknown file <<<<<<<<< Well it says there was a SEGV - and it gives some approximate location. It could be inside the MPI code in those routines listed here. A run in a debugger will confirm the exact location. [assuming this can be done on this SGE] >>>>>>>>>> 0]PETSC ERROR: Out of memory. This could be due to allocating [0]PETSC ERROR: too large an object or bleeding by not properly [0]PETSC ERROR: destroying unneeded objects. [0]PETSC ERROR: Memory allocated 90628 Memory used by process 0 [0]PETSC ERROR: Try running with -malloc_dump or -malloc_log for info. [0]PETSC ERROR: Memory requested 320! <<<<<<<<<<< Malloc failing at this low memory allocation? Something else is going wrong here. > > BTW: what do you have for 'ldd ex19'? > > $ldd ex19 > ex19: > -lc.12 => /usr/lib/libc.so.12 > -lXau.6 => /usr/pkg/lib/libXau.so.6 > -lXdmcp.6 => /usr/pkg/lib/libXdmcp.so.6 > -lX11.6 => /usr/pkg/lib/libX11.so.6 > -lltdl.3 => /usr/pkg/lib/libltdl.so.3 > -lutil.7 => /usr/lib/libutil.so.7 > -lm.0 => /usr/lib/libm.so.0 > -lpthread.0 => /usr/lib/libpthread.so.0 > -lopen-pal.0 => /usr/pkg/lib/libopen-pal.so.0 > -lopen-rte.0 => /usr/pkg/lib/libopen-rte.so.0 > -lmpi.0 => /usr/pkg/lib/libmpi.so.0 > -lmpi_f77.0 => /usr/pkg/lib/libmpi_f77.so.0 > -lstdc++.6 => /usr/lib/libstdc++.so.6 > -lgcc_s.1 => /usr/lib/libgcc_s.so.1 > -lmpi_cxx.0 => /usr/pkg/lib/libmpi_cxx.so.0 ok - mpi is shared. Can you confirm that the exact same version of openmpi is installed on all the nodes - and that there is no minor version differences that could trigger this? Satish
