This is very odd. The two error messages you are seeing are side effects of the real problem, which is that Open MPI is segfaulting when build with the Intel compiler. We've had some problems with bugs in various versions of the Intel compiler -- just to be on the safe side, can you make sure that the machine has the latest bug fixes from Intel applied? From there, if possible, it would be extremely useful to have a stack trace from a core file, or even to know whether it's mpirun or one of our "orte daemons" that are segfaulting. If you can get a core file, you should be able to figure out which process is causing the segfault.

Brian

On Feb 2, 2007, at 4:07 PM, Dennis McRitchie wrote:

When I submit a simple job (described below) using PBS, I always get one
of the following two errors:
1) [adroit-28:03945] [0,0,1]-[0,0,0] mca_oob_tcp_peer_recv_blocking:
recv() failed with errno=104

2) [adroit-30:03770] [0,0,3]-[0,0,0] mca_oob_tcp_peer_complete_connect:
connection failed (errno=111) - retrying (pid=3770)

The program does a uname and prints out results to standard out. The
only MPI calls it makes are MPI_Init, MPI_Comm_size, MPI_Comm_rank, and MPI_Finalize. I have tried it with both openmpi v 1.1.2 and 1.1.4, built with Intel C compiler 9.1.045, and get the same results. But if I build
the same versions of openmpi using gcc, the test program always works
fine. The app itself is built with mpicc.

It runs successfully if run from the command line with "mpiexec -n X
<test-program-name>", where X is 1 to 8, but if I wrap it in the
following qsub command file:
---------------------------------------------------
#PBS -l pmem=512mb,nodes=1:ppn=1,walltime=0:10:00
#PBS -m abe
# #PBS -o /home0/dmcr/my_mpi/curt/uname_test.gcc.stdout
# #PBS -e /home0/dmcr/my_mpi/curt/uname_test.gcc.stderr

cd /home/dmcr/my_mpi/openmpi
echo "About to call mpiexec"
module list
mpiexec -n 1 uname_test.intel
echo "After call to mpiexec"
----------------------------------------------------

it fails on any number of processors from 1 to 8, and the application
segfaults.

The complete standard error of an 8-processsor job follows (note that
mpiexec ran on adroit-31, but usually there is no info about adroit-31
in standard error):
-------------------------
Currently Loaded Modulefiles:
  1) intel/9.1/32/C/9.1.045         4) intel/9.1/32/default
  2) intel/9.1/32/Fortran/9.1.040   5) openmpi/intel/1.1.2/32
  3) intel/9.1/32/Iidb/9.1.045
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x5
[0] func:/usr/local/openmpi/1.1.4/intel/i386/lib/libopal.so.0 [0xb72c5b]
*** End of error message ***
^@[adroit-29:03934] [0,0,2]-[0,0,0] mca_oob_tcp_peer_recv_blocking:
recv() failed with errno=104
[adroit-28:03945] [0,0,1]-[0,0,0] mca_oob_tcp_peer_recv_blocking: recv()
failed with errno=104
[adroit-30:03770] [0,0,3]-[0,0,0] mca_oob_tcp_peer_complete_connect:
connection failed (errno=111) - retrying (pid=3770)
--------------------------

The complete standard error of an 1-processsor job follows:
--------------------------
Currently Loaded Modulefiles:
  1) intel/9.1/32/C/9.1.045         4) intel/9.1/32/default
  2) intel/9.1/32/Fortran/9.1.040   5) openmpi/intel/1.1.2/32
  3) intel/9.1/32/Iidb/9.1.045
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2
[0] func:/usr/local/openmpi/1.1.2/intel/i386/lib/libopal.so.0 [0x27d847]
*** End of error message ***
^@[adroit-31:08840] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect:
connection failed (errno=111) - retrying (pid=8840)
---------------------------

Any thoughts as to why this might be failing?

Thanks,
       Dennis

Dennis McRitchie
Computational Science and Engineering Support (CSES)
Academic Services Department
Office of Information Technology
Princeton University

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
  Brian Barrett
  Open MPI Team, CCS-1
  Los Alamos National Laboratory



Reply via email to