Btw, I did compile openmpi with the --with-sge flag.

I am able to compile a test program using openf90 with no errors or
warnings. But when I try to run a test program that just calls
MPI_INIT(ierr), then MPI_COMM_RANK(ierr), I get the following, whether
static or linked, and whether run with mpirun or directly:

[juggling.ucsd.edu:20218] *** An error occurred in MPI_Comm_rank
[juggling.ucsd.edu:20218] *** on communicator MPI_COMM_WORLD
[juggling.ucsd.edu:20218] *** MPI_ERR_COMM: invalid communicator
[juggling.ucsd.edu:20218] *** MPI_ERRORS_ARE_FATAL (your MPI job will now
abort)

Is there something  missing in the linux or parallel environment settings?
Thanks.

-----Original Message-----
From: Jason Palmer [mailto:japalme...@gmail.com] 
Sent: Wednesday, April 06, 2011 4:09 PM
To: 'Open MPI Users'
Subject: SGE and openmpi

Hi,
I am having trouble running a batch job in SGE using openmpi.  I have read
the faq, which says that openmpi will automatically do the right thing, but
something seems to be wrong.

Previously I used MPICH1 under SGE without any problems. I'm avoiding MPICH2
because it doesn't seem to support static compilation, whereas I was able to
get openmpi to compile with open64 and compile my program statically.

But I am having problems launching. According to the documentation, I should
be able to have a script file, qsub.sh:

#!/bin/bash
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -q all.q
#$ -pe orte 18
MPI_DIR=/home/jason/openmpi-1.4.3-install/bin
/home/jason/openmpi-1.4.3-install/bin/mpirun -np $NSLOTS  myprog

Then,
        $ qsub  qsub.sh

Previously with MPICH1 I would have

        -machinefile $TMP/machines

in the mpirun arguments, and the rest of the script the same except -pe
mpich 18, and it would work. The -machinefile argument doesn't seem to work
in orte. The error in qsub.sh.o is:

[jason@juggling ~/amica_open64]$ cat qsub.sh.o7514 [compute-0-0.local:17792]
*** An error occurred in MPI_Comm_rank [compute-0-0.local:17792] *** on
communicator MPI_COMM_WORLD [compute-0-0.local:17792] *** MPI_ERR_COMM:
invalid communicator [compute-0-0.local:17792] *** MPI_ERRORS_ARE_FATAL
(your MPI job will now abort)
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 17792 on node
compute-0-0.local exiting without calling "finalize". This may have caused
other processes in the application to be terminated by signals sent by
mpirun (as reported here).
--------------------------------------------------------------------------
[compute-0-0.local:17788] 8 more processes have sent help message
help-mpi-errors.txt / mpi_errors_are_fatal [compute-0-0.local:17788] Set MCA
parameter "orte_base_help_aggregate" to 0 to see all help / error messages


I ran qconf, and I get the same output as in the documentation:

[jason@juggling ~/amica_open64]$ qconf -sp orte
pe_name            orte
slots              9999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary TRUE

The qconf mpich output is:

[jason@juggling ~/amica_open64]$ qconf -sp mpich
pe_name            mpich
slots              9999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args     /opt/gridengine/mpi/stopmpi.sh
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary TRUE

with specific scripts for start_proc_args and stop_proc_args ...

Am I missing something necessary to run openmpi under SGE?

Thanks very much,
Jason

Reply via email to