Lets ignore 'Sun Grid Engine environment' initially and just figureout your PETSc install.
- What MPI is it built with? Send us the output for the compile of ex19 - you claim 'make test' worked fine - i.e this example ran fine paralley. can you confrim this with manual run? [if thats the case - then PETSc would be working correctly with the MPI specified] >From the info below -- the example crashes happen only in 'Sun Grid Engine environment' What is that? And why should binaries compiled with this default 'MPI' - work in that grid enviornment - without recompiling with a different 'sun-grid-mpi' ? Satish On Wed, 16 Dec 2009, Kevin.Buckley at ecs.vuw.ac.nz wrote: > Hi again, > > I though tI had got things working but maybe not, not completely, > anyway. > > I did this and stuff worked: > > PETSC_DIR=$PWD; export PETSC_DIR > ./configure --with-c++-support --with-hdf5=/usr/pkg > --prefix=/vol/grid/pkg/petsc-3.0.0-p7 > PETSC_ARCH=netbsdelf5.0.-c-debug; export PETSC_ARCH > make all > make install > make test > cd src/snes/examples/tutorials/ > make ex19 > ./ex19 -contours > > Nice pictures! > > I then moved the example ex19 source and the makefile out of the > distribution tree to somwhere else and built it against the > installed stuff and ran it: that worked too. > > export PETSC_DIR=/vol/grid/pkg/petsc-3.0.0-p7 > make ex19 > ./ex19 -dmmg_nlevels 4 -snes_monitor_draw > ./ex19 -contours > > > I then built the package that needs PETSc, PISM, from Univ Alaska at > Fairbanks, and ran that. > > What I then found is that the PISM stuff would fail if we launched it > into an Sun Grid Engine environment with more than TWO processors, > > It also ran if simply mpiexec-d onto a four-processor machine but > not onto a four-machine grid. > > I saw this block of error messages from a 4-node submission > > [2]PETSC ERROR: > ------------------------------------------------------------------------ > [2]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, > probably memory access out of range > [2]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger > [2]PETSC ERROR: or see > http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[2]PETSC > ERROR: or try http://valgrind.org on linux or man libgmalloc on Apple to > find memory corruption errors > [2]PETSC ERROR: likely location of problem given in stack below > [2]PETSC ERROR: --------------------- Stack Frames > ------------------------------------ > [2]PETSC ERROR: Note: The EXACT line numbers in the stack are not available, > [2]PETSC ERROR: INSTEAD the line number of the start of the function > [2]PETSC ERROR: is given. > [2]PETSC ERROR: [2] VecScatterCreateCommon_PtoS line 1699 > src/vec/vec/utils/vpscat.c > [2]PETSC ERROR: [2] VecScatterCreate_PtoS line 1508 > src/vec/vec/utils/vpscat.c > [2]PETSC ERROR: [2] VecScatterCreate line 833 src/vec/vec/utils/vscat.c > [2]PETSC ERROR: [2] DACreate2d line 338 src/dm/da/src/da2.c > [2]PETSC ERROR: --------------------- Error Message > ------------------------------------ > [2]PETSC ERROR: Signal received! > [2]PETSC ERROR: > ------------------------------------------------------------------------ > [2]PETSC ERROR: Petsc Release Version 3.0.0, Patch 7, Mon Jul 6 11:33:34 > CDT 2009 > [2]PETSC ERROR: See docs/changes/index.html for recent updates. > [2]PETSC ERROR: See docs/faq.html for hints about trouble shooting. > [2]PETSC ERROR: See docs/index.html for manual pages. > [2]PETSC ERROR: > -------------------------------------------------------------------------- > MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD > with errorcode 59. > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > You may or may not see output from other processes, depending on > exactly when Open MPI kills them. > -------------------------------------------------------------------------- > ------------------------------------------------------------------------ > [2]PETSC ERROR: /vol/grid/pkg/pism-0.2.1/bin/pismv on a netbsdelf named > citron.ecs.vuw.ac.nz by golledni Wed Dec 16 15:49:09 2009 > [2]PETSC ERROR: Libraries linked from /vol/grid/pkg/petsc-3.0.0-p7/lib > [2]PETSC ERROR: Configure run at Mon Dec 14 17:02:49 2009 > [2]PETSC ERROR: Configure options --with-c++-support --with-hdf5=/usr/pkg > --prefix=/vol/grid/pkg/petsc-3.0.0-p7 --with-shared=0 > [2]PETSC ERROR: > ------------------------------------------------------------------------ > [2]PETSC ERROR: User provided function() line 0 in unknown directory > unknown file > -------------------------------------------------------------------------- > mpirun has exited due to process rank 2 with PID 4365 on > node citron.ecs.vuw.ac.nz exiting without calling "finalize". This may > have caused other processes in the application to be > terminated by signals sent by mpirun (as reported here). > -------------------------------------------------------------------------- > > > and this block of messages from an 8-node submission > > > [3]PETSC ERROR: > ------------------------------------------------------------------------ > [3]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, > probably memory access out of range > [3]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger > [3]PETSC ERROR: or see > http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[3]PETSC > ERROR: or try http://valgrind.org on linux or man libgmalloc on Apple to > find memory corruption errors > [3]PETSC ERROR: likely location of problem given in stack below > [3]PETSC ERROR: --------------------- Stack Frames > ------------------------------------ > [2]PETSC ERROR: > ------------------------------------------------------------------------ > [2]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, > probably memory access out of range > [2]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger > [2]PETSC ERROR: or see > http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[2]PETSC > ERROR: or try http://valgrind.org on linux or man > libgmalloc on Apple to find memory corruption errors > [2]PETSC ERROR: likely location of problem given in stack below > [2]PETSC ERROR: --------------------- Stack Frames > ------------------------------------ > > > > I then went back and tried to run the PETSc example and found similar > happenings, things run when submitted to a two-node "grid" but not a > four-node one, the error message block being: > > [0]PETSC ERROR: --------------------- Error Message > ------------------------------------ > [0]PETSC ERROR: Out of memory. This could be due to allocating > [0]PETSC ERROR: too large an object or bleeding by not properly > [0]PETSC ERROR: destroying unneeded objects. > -------------------------------------------------------------------------- > MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD > with errorcode 1. > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > You may or may not see output from other processes, depending on > exactly when Open MPI kills them. > -------------------------------------------------------------------------- > [0]PETSC ERROR: Memory allocated 90628 Memory used by process 0 > [0]PETSC ERROR: Try running with -malloc_dump or -malloc_log for info. > [0]PETSC ERROR: Memory requested 320! > [0]PETSC ERROR: > ------------------------------------------------------------------------ > [0]PETSC ERROR: Petsc Release Version 3.0.0, Patch 7, Mon Jul 6 11:33:34 > CDT 2009 > [0]PETSC ERROR: See docs/changes/index.html for recent updates. > [0]PETSC ERROR: See docs/faq.html for hints about trouble shooting. > [0]PETSC ERROR: See docs/index.html for manual pages. > [0]PETSC ERROR: > ------------------------------------------------------------------------ > [0]PETSC ERROR: /home/rialto1/kingstlind/kevin/PETSc/ex19 on a netbsdelf > named petit-lyon.ecs.vuw.ac.nz by kingstlind Wed Dec 16 16:45:39 2009 > [0]PETSC ERROR: Libraries linked from /vol/grid/pkg/petsc-3.0.0-p7/lib > [0]PETSC ERROR: Configure run at Mon Dec 14 17:02:49 2009 > [0]PETSC ERROR: Configure options --with-c++-support --with-hdf5=/usr/pkg > --prefix=/vol/grid/pkg/petsc-3.0.0-p7 --with-shared=0 > [0]PETSC ERROR: > ------------------------------------------------------------------------ > [0]PETSC ERROR: PetscMallocAlign() line 61 in src/sys/memory/mal.c > [0]PETSC ERROR: PetscTrMallocDefault() line 194 in src/sys/memory/mtr.c > [0]PETSC ERROR: PetscFListAdd() line 235 in src/sys/dll/reg.c > [0]PETSC ERROR: MatRegister() line 140 in src/mat/interface/matreg.c > [0]PETSC ERROR: MatRegisterAll() line 106 in src/mat/interface/matregis.c > [0]PETSC ERROR: MatInitializePackage() line 54 in > src/mat/interface/dlregismat.c > [0]PETSC ERROR: MatCreate() line 74 in src/mat/utils/gcreate.c > [0]PETSC ERROR: DAGetInterpolation_2D_Q1() line 308 in > src/dm/da/src/dainterp.c > [0]PETSC ERROR: DAGetInterpolation() line 879 in src/dm/da/src/dainterp.c > [0]PETSC ERROR: DMGetInterpolation() line 144 in src/dm/da/utils/dm.c > [0]PETSC ERROR: DMMGSetDM() line 309 in src/snes/utils/damg.c > [0]PETSC ERROR: main() line 108 in src/snes/examples/tutorials/ex19.c > -------------------------------------------------------------------------- > mpirun has exited due to process rank 0 with PID 9757 on > node petit-lyon.ecs.vuw.ac.nz exiting without calling "finalize". This may > have caused other processes in the application to be > terminated by signals sent by mpirun (as reported here). > -------------------------------------------------------------------------- > [1]PETSC ERROR: > ------------------------------------------------------------------------ > [1]PETSC ERROR: Caught signal number 15 Terminate: Somet process (or the > batch system) has told this process to end > [1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger > [1]PETSC ERROR: or see > http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[2]PETSC > ERROR: > ------------------------------------------------------------------------ > [pulcinella.ecs.vuw.ac.nz:24936] opal_sockaddr2str failed:Unknown error > (return code 4) > [3]PETSC ERROR: > ------------------------------------------------------------------------ > [3]PETSC ERROR: Caught signal number 15 Terminate: Somet process (or the > batch system) has told this process to end > [3]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger > [3]PETSC ERROR: or see > http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html#Signal[3]PETSC > ERROR: or try http://valgrind.org on linux or man libgmalloc on Apple to > find memory corruption errors > [3]PETSC ERROR: > > > Do the PETSc error message suggest anything wrong with my PETSc or do > they point to underlying problems with the OpenMPI ? > > Any suggestions/insight welcome, > Kevin > >
