On Mar 3, 2014, at 1:48 PM, Beichuan Yan <beichuan....@colorado.edu> wrote:

> 1. After sysadmin installed libibverbs-devel package, I build Open MPI 1.7.4 
> successfully with the command:
> ./configure 
> --prefix=/work4/projects/openmpi/openmpi-1.7.4-gcc-compilers-4.7.3 
> --with-tm=/opt/pbs/default --with-verbs=/hafs_x86_64/devel/usr 
> --with-verbs-libdir=/hafs_x86_64/devel/usr/lib64
> 
> 2. Then I rebuild and run my job in hybrid MPI/OPENMP mode: each compute node 
> only runs 1 process (this 1 process runs 16 OPENMP threads), it can get 
> initialized and run well each time with $TCP setting as follows, this is 
> great:
> TCP="--mca btl_tcp_if_include 10.148.0.0/16"
> mpirun $TCP -np 16 -hostfile $PBS_NODEFILE ./paraEllip3d input.txt

If you're using the native verbs API, you don't need that TCP clause.

Also, if you're running in a PBS job, you don't need the -hostfile clause.  And 
if you're running one process per core in the allocated PBS job, you can skip 
the -np clause, too.  You should be able to run with:

    mpirun ./paraEllip3d input.txt

If you want one process per server, then

    mpirun -np <num_servers> --map-by node ./paraEliip3d input.txt

> 3. Then I test pure-MPI mode: OPENMP is turned off, and each compute node 
> runs 16 processes (clearly shared-memory of MPI is used). Four combinations 
> of "TMPDIR" and "TCP" are tested:
> case 1:
> #export TMPDIR=/home/yanb/tmp
> TCP="--mca btl_tcp_if_include 10.148.0.0/16"
> mpirun $TCP -np 64 -npernode 16 -hostfile $PBS_NODEFILE ./paraEllip3d 
> input.txt
> output:
> Start Prologue v2.5 Mon Mar  3 15:47:16 EST 2014
> End Prologue v2.5 Mon Mar  3 15:47:16 EST 2014
> -bash: line 1: 448597 Terminated              
> /var/spool/PBS/mom_priv/jobs/602244.service12.SC
> Start Epilogue v2.5 Mon Mar  3 15:50:51 EST 2014
> Statistics  
> cpupercent=0,cput=00:00:00,mem=7028kb,ncpus=128,vmem=495768kb,walltime=00:03:24
> End Epilogue v2.5 Mon Mar  3 15:50:52 EST 2014

It looks like you have two general cases:

1. The job fails for no apparent reason (like above), or
2. The job complains that your TMPDIR is on a shared filesystem

Right?

I think the real issue, then, is to figure out why your jobs are failing with 
no output.

Is there anything in the stderr output?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to