Hello all: I've got a user on our ROCKS 4.3 cluster that's having some strange errors. I have other users using the cluster without any such errors reported, but this user also runs this code on other clusters without any problems, so I'm not really sure where the problem lies. They are getting logs with the following:
-------- Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. data directory is /mnt/pvfs2/patton/data/chem/aa1 exec directory is /mnt/pvfs2/patton/exec/chem/aa1 arch directory is /mnt/pvfs2/patton/data/chem/aa1 mpirun: killing job... Terminated -------------------------------------------------------------------------- WARNING: mpirun is in the process of killing a job, but has detected an interruption (probably control-C). It is dangerous to interrupt mpirun while it is killing a job (proper termination may not be guaranteed). Hit control-C again within 1 second if you really want to kill mpirun immediately. -------------------------------------------------------------------------- mpirun noticed that job rank 0 with PID 14126 on node compute-0-23.local exited on signal 15 (Terminated). [compute-0-23.local:14124] [0,0,0]-[0,0,1] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) --------- The job was submitted with: --------- #!/bin/csh ##PBS -N for.chem.aa1 #PBS -l nodes=2 #PBS -l walltime=0:30:00 #PBS -m n #PBS -j oe #PBS -o /home/patton/logs #PBS -e /home/patton/logs #PBS -V # # ------ set case specific parameters # and setup directory structure # set time=000001_000100 # set case=aa1 set type=chem # # ---- set up directories # set SCRATCH=/mnt/pvfs2/patton mkdir -p $SCRATCH set datadir=$SCRATCH/data/$type/$case set execdir=$SCRATCH/exec/$type/$case set archdir=$SCRATCH/data/$type/$case set les_output=les.$type.$case.out.$time set compdir=$HOME/compile/$type/$case #set compdir=$HOME/compile/free/aa1 echo 'data directory is ' $datadir echo 'exec directory is ' $execdir echo 'arch directory is ' $archdir mkdir -p $datadir mkdir -p $execdir # cd $execdir rm -fr * cp $compdir/* . # # ------- build machine file for code to read setup # # ------------ set imachine=0 for NCAR IBM SP : bluevista # imachine=1 for NCAR IBM SP : bluesky # imachine=2 for ASC SGI Altix : eagle # imachine=3 for ERDC Cray XT3 : sapphire # imachine=4 for ASC HP XC : falcon # imachine=5 for NERSC Cray XT4 : franklin # imachine=6 for WSU Cluster : aeolus # set imachine=6 set store_files=1 set OMP_NUM_THREADS=1 # echo $imachine > mach.file echo $store_files >> mach.file echo $datadir >> mach.file echo $archdir >> mach.file # # ---- submit the run # mpirun -n 2 ./lesmpi.a > $les_output # # ------ clean up # mv $execdir/u.* $datadir mv $execdir/p.* $datadir mv $execdir/his.* $datadir cp $execdir/$les_output $datadir # echo 'job ended ' exit # ------------- (its possible this particular script doesn't match this particular error...The user ran the job, and this is what I assembled from conversations with him. In any case, its representative to the jobs he's running, and they're all returning similar errors.) The error occurs at varying time steps in the runs, and if run without MPI, it runs fine to completion. Here's the version info: [kusznir@aeolus ~]$ rpm -qa |grep pgi pgilinux86-64-707-1 openmpi-pgi-docs-1.2.4-1 openmpi-pgi-devel-1.2.4-1 roll-pgi-usersguide-4.3-0 openmpi-pgi-runtime-1.2.4-1 mpich-ethernet-pgi-1.2.7p1-1 pgi-rocks-4.3-0 The OpenMPI rpms were built from the supplied spec (or nearly so, anyway) with the following command line: CC=pgcc CXX=pgCC F77=pgf77 FC=pgf90 rpmbuild -bb --define 'install_in_opt 1' --d efine 'install_modulefile 1' --define 'modules_rpm_name environment-modules' --d efine 'build_all_in_one_rpm 0' --define 'configure_options --with-tm=/opt/torqu e' --define '_name openmpi-pgi' --define 'use_default_rpm_opt_flags 0' openmpi.s pec Any suggestions? Thanks! --Jim