I have an OpenMPI  program compiled with a version of OpenMPI built using the 
ifort 10.1
compiler. I can compile and run this code with no problem, using the 32 bit
version of ifort. And I can also submit batch jobs using torque with this 
32-bit code.
However, compiling the same code to produce a 64 bit executable produces a code
that runs correctly only in the simplest cases. It does not run correctly when 
run
under the torque batch queuing system, running for awhile and then giving a 
segmentation violation in s section of code that is fine in the 32 bit version.

I have to run the mpi multinode jobs using our torque batch queuing system,
but we do have the capability of running the jobs in an interactive batch 
environment.

If I do a qsub -I -l nodes=1:x4gb
I get an interactive session on the remote node assigned to my job. I can run 
the
job using either 
./MPI_li_64 or
mpirun -np 1 ./MPI_li_64
and the job runs successfully to completion. I can also
start an interactive shell using
qsub -I -l nodes=1:ppn=2:x4gb
and I will get a single dual processor (or greater node). On this single node,
mpirun -np 2 ./MPI_li_64 works.
However, if instead I ask for two nodes in my interactive batch node,
qsub -I -l nodes=2:x4gb,
Two nodes will be assigned to me but when I enter
mpirun -np 2 ./MPI_li_64
the job runs awhile, then fails with a 
mpirun noticed that process rank 1 with PID 23104 on node n339 exited on signal 
11 (Segmentation fault).

I can trace this in the intel debugger and see that the segmentation fault is 
occuring in what should
be good code, and in code that executes with no problem when everything is 
compiled 32-bit. I am
at a loss for what could be preventing this code to run within the batch 
queuing environment in the
64-bit version.

Jim

Reply via email to