On Thu, 17 Dec 2009, Kevin.Buckley at ecs.vuw.ac.nz wrote: > > >> That summary misses the whole point of the errors I am seeing. > >> > >> The code runs fine locally AND under Sun Grid Engine, if you only > >> spawn TWO processes but not FOUR or EIGHT. > > > > Well the the 'np 2' runs could be scheduled on your local node [or a > > single SMP remote node]. > > Well, they "could be", yes: they are not though. > > Look, you need to trust me when I tell you things (except for version > numbers, ha ha). > > I would not be bothering you if I had not looked into this to a > reasonable extent before deciding to bother you. > > I am in control of where the jobs are running.
Are you saying the np 2 is on 2 different machines - and same with np4,8? And what is OpenMPI using to communicate between these? Sockets? Infiniband? Something else? Is this a cluster - or a distributed machine setup? > > And I suspect there is something wrong in your OpenMPI+SunGridEngine > > config thats triggering this problem. > > I am happy to accept that and I even suggested that might be the case. > > I am happy to go and look around the OpenMPI and SGE sources, if that > turns out to be the case. > > However, I came to the PETSc list for some insight from the PETSc > error messages. > > If they can confirm/reject the notion that it might be an SGE/OpenMPI > issue and not a PETSc one then I will have gained information. > > > > I don't know exactly how though.. > > So far, nothing has been confirmed either way. > > > [the basic petsc examples are supporsed to work in any valid > > MPI enviornment]. > > > I don't doubt for a minute that they are supposed too. > > I am also aware that few people are likley to be using this > software stack on NetBSD and thus there may be some gaps in > your map of "valid MPI environments". > > > > ok - mpi is shared. Can you confirm that the exact same version of > > openmpi is installed on all the nodes - and that there is no minor > > version differences that could trigger this? > > Just take that as read. > > Are you saying that the error messages PETSc is throwing out ARE > consistent with a slightly mis-matched MPI then ? There is a SEGV trapped by PETSc error handler. It doesn't know exactly where its hapenning. You'll have to run in a debugger to get the exact location of this error and the stack trace. [I suspect the segv is in OpenMPI code - but only a debugger can confirm/deny it] Normally you could use -start_in_debugger with PETSc binary - assuming the remote nodes can communicate to your xserver on the desktop [directly or via ssh-portforwared-x11] to do this. Satish > > > I am building an OpenMPI with some debugging in at present. I'll get > back to you once I have rolled it out across the nodes and have > some more info. > > In the meantime, if you can think of anything I can tickle PETSc with, > you being familiar with PETSC, so as to get some error messages that > might tell you something, do let me know. > >
