Hello all:

I've got a user on our ROCKS 4.3 cluster that's having some strange
errors.  I have other users using the cluster without any such errors
reported, but this user also runs this code on other clusters without
any problems, so I'm not really sure where the problem lies.  They are
getting logs with the following:

--------
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
data directory is  /mnt/pvfs2/patton/data/chem/aa1
exec directory is  /mnt/pvfs2/patton/exec/chem/aa1
arch directory is  /mnt/pvfs2/patton/data/chem/aa1
mpirun: killing job...

Terminated
--------------------------------------------------------------------------
WARNING: mpirun is in the process of killing a job, but has detected an
interruption (probably control-C).

It is dangerous to interrupt mpirun while it is killing a job (proper
termination may not be guaranteed).  Hit control-C again within 1
second if you really want to kill mpirun immediately.
--------------------------------------------------------------------------
mpirun noticed that job rank 0 with PID 14126 on node
compute-0-23.local exited on signal 15 (Terminated).
[compute-0-23.local:14124] [0,0,0]-[0,0,1] mca_oob_tcp_msg_recv: readv
failed: Connection reset by peer (104)
---------

The job was submitted with:
---------
#!/bin/csh
##PBS -N for.chem.aa1
#PBS -l nodes=2
#PBS -l walltime=0:30:00
#PBS -m n
#PBS -j oe
#PBS -o /home/patton/logs
#PBS -e /home/patton/logs
#PBS -V
#
# ------ set case specific parameters
#        and setup directory structure
#
set time=000001_000100
#
set case=aa1
set type=chem
#
# ---- set up directories
#
set SCRATCH=/mnt/pvfs2/patton
mkdir -p $SCRATCH

set datadir=$SCRATCH/data/$type/$case
set execdir=$SCRATCH/exec/$type/$case
set archdir=$SCRATCH/data/$type/$case
set les_output=les.$type.$case.out.$time

set compdir=$HOME/compile/$type/$case
#set compdir=$HOME/compile/free/aa1

echo 'data directory is ' $datadir
echo 'exec directory is ' $execdir
echo 'arch directory is ' $archdir

mkdir -p $datadir
mkdir -p $execdir
#
cd $execdir
rm -fr *
cp $compdir/* .
#
# ------- build machine file for code to read setup
#
# ------------ set imachine=0 for NCAR IBM SP    : bluevista
#                  imachine=1 for NCAR IBM SP    : bluesky
#                  imachine=2 for ASC SGI Altix  : eagle
#                  imachine=3 for ERDC Cray XT3  : sapphire
#                  imachine=4 for ASC HP XC      : falcon
#                  imachine=5 for NERSC Cray XT4 : franklin
#                  imachine=6 for WSU Cluster    : aeolus
#
set imachine=6
set store_files=1
set OMP_NUM_THREADS=1
#
echo $imachine > mach.file
echo $store_files >> mach.file
echo $datadir >> mach.file
echo $archdir >> mach.file
#
# ---- submit the run
#
mpirun -n 2 ./lesmpi.a > $les_output
#
# ------ clean up
#
mv $execdir/u.* $datadir
mv $execdir/p.* $datadir
mv $execdir/his.* $datadir
cp $execdir/$les_output $datadir
#
echo 'job ended '
exit
#
-------------
(its possible this particular script doesn't match this particular
error...The user ran the job, and this is what I assembled from
conversations with him.  In any case, its representative to the jobs
he's running, and they're all returning similar errors.)

The error occurs at varying time steps in the runs, and if run without
MPI, it runs fine to completion.

Here's the version info:

[kusznir@aeolus ~]$ rpm -qa |grep pgi
pgilinux86-64-707-1
openmpi-pgi-docs-1.2.4-1
openmpi-pgi-devel-1.2.4-1
roll-pgi-usersguide-4.3-0
openmpi-pgi-runtime-1.2.4-1
mpich-ethernet-pgi-1.2.7p1-1
pgi-rocks-4.3-0

The OpenMPI rpms were built from the supplied spec (or nearly so,
anyway) with the following command line:
CC=pgcc CXX=pgCC F77=pgf77 FC=pgf90 rpmbuild -bb --define 'install_in_opt 1' --d
efine 'install_modulefile 1' --define 'modules_rpm_name environment-modules' --d
efine 'build_all_in_one_rpm 0'  --define 'configure_options --with-tm=/opt/torqu
e' --define '_name openmpi-pgi' --define 'use_default_rpm_opt_flags 0' openmpi.s
pec

Any suggestions?

Thanks!

--Jim

Reply via email to