Gus, I am using this system: http://centers.hpc.mil/systems/unclassified.html#Spirit. I don't know exactly configurations of the file system. Here is the output of "df -h": Filesystem Size Used Avail Use% Mounted on /dev/sda6 919G 16G 857G 2% / tmpfs 32G 0 32G 0% /dev/shm /dev/sda5 139M 33M 100M 25% /boot adfs3v-s:/adfs3/hafs14 6.5T 678G 5.5T 11% /scratch adfs3v-s:/adfs3/hafs16 6.5T 678G 5.5T 11% /var/spool/mail 10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1 1.2P 136T 1.1P 12% /work1 10.148.18.132@o2ib:10.148.18.133@o2ib:/fs4 1.2P 793T 368T 69% /work4 10.148.18.104@o2ib:10.148.18.165@o2ib:/fs3 1.2P 509T 652T 44% /work3 10.148.18.76@o2ib:10.148.18.164@o2ib:/fs2 1.2P 521T 640T 45% /work2 panfs://172.16.0.10/CWFS 728T 286T 443T 40% /p/cwfs panfs://172.16.1.61/CWFS1 728T 286T 443T 40% /p/CWFS1 panfs://172.16.0.210/CWFS2 728T 286T 443T 40% /p/CWFS2 panfs://172.16.1.125/CWFS3 728T 286T 443T 40% /p/CWFS3 panfs://172.16.1.224/CWFS4 728T 286T 443T 40% /p/CWFS4 panfs://172.16.1.224/CWFS5 728T 286T 443T 40% /p/CWFS5 panfs://172.16.1.224/CWFS6 728T 286T 443T 40% /p/CWFS6 panfs://172.16.1.224/CWFS7 728T 286T 443T 40% /p/CWFS7
1. My home directory is /home/yanb. My simulation files are located at /work3/yanb. The default TMPDIR set by system is just /work3/yanb 2. I did try not to set TMPDIR and let it default, which is just case 1 and case 2. Case1: #export TMPDIR=/home/yanb/tmp TCP="--mca btl_tcp_if_include 10.148.0.0/16" It gives no apparent reason. Case2: #export TMPDIR=/home/yanb/tmp #TCP="--mca btl_tcp_if_include 10.148.0.0/16" It gives warning of shared memory file on network file system. 3. With "export TMPDIR=/tmp", the job gives the same, no apparent reason. 4. FYI, "ls /" gives: ELT apps cgroup hafs1 hafs12 hafs2 hafs5 hafs8 home lost+found mnt p root selinux tftpboot var work3 admin bin dev hafs10 hafs13 hafs3 hafs6 hafs9 lib media net panfs sbin srv tmp work1 work4 app boot etc hafs11 hafs15 hafs4 hafs7 hafs_x86_64 lib64 misc opt proc scratch sys usr work2 workspace Beichuan -----Original Message----- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa Sent: Monday, March 03, 2014 17:24 To: Open MPI Users Subject: Re: [OMPI users] OpenMPI job initializing problem Hi Beichuan If you are using the university cluster, chances are that /home is not local, but on an NFS share, or perhaps Lustre (which you may have mentioned before, I don't remember). Maybe "df -h" will show what is local what is not. It works for NFS, it prefixes file systems with the server name, but I don't know about Lustre. Did you try just not to set TMPDIR and let it default? If the default TMPDIR is on Lustre (did you say this?, anyway I don't remember) you could perhaps try to force it to /tmp: export TMPDIR=/tmp, If the cluster nodes are diskfull /tmp is likely to exist and be local to the cluster nodes. [But the cluster nodes may be diskless ... :( ] I hope this helps, Gus Correa On 03/03/2014 07:10 PM, Beichuan Yan wrote: > How to set TMPDIR to a local filesystem? Is /home/yanb/tmp a local > filesystem? I don't know how to tell a directory is local file system or > network file system. > > -----Original Message----- > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff > Squyres (jsquyres) > Sent: Monday, March 03, 2014 16:57 > To: Open MPI Users > Subject: Re: [OMPI users] OpenMPI job initializing problem > > How about setting TMPDIR to a local filesystem? > > > On Mar 3, 2014, at 3:43 PM, Beichuan Yan<beichuan....@colorado.edu> wrote: > >> I agree there are two cases for pure-MPI mode: 1. Job fails with no apparent >> reason; 2 job complains shared-memory file on network file system, which >> can be resolved by " export TMPDIR=/home/yanb/tmp", /home/yanb/tmp is my >> local directory. The default TMPDIR points to a Lustre directory. >> >> There is no any other output. I checked my job with "qstat -n" and found >> that processes were actually not started on compute nodes even though PBS >> Pro has "started" my job. >> >> Beichuan >> >>> 3. Then I test pure-MPI mode: OPENMP is turned off, and each compute node >>> runs 16 processes (clearly shared-memory of MPI is used). Four combinations >>> of "TMPDIR" and "TCP" are tested: >>> case 1: >>> #export TMPDIR=/home/yanb/tmp >>> TCP="--mca btl_tcp_if_include 10.148.0.0/16" >>> mpirun $TCP -np 64 -npernode 16 -hostfile $PBS_NODEFILE >>> ./paraEllip3d input.txt >>> output: >>> Start Prologue v2.5 Mon Mar 3 15:47:16 EST 2014 End Prologue v2.5 >>> Mon Mar 3 15:47:16 EST 2014 >>> -bash: line 1: 448597 Terminated >>> /var/spool/PBS/mom_priv/jobs/602244.service12.SC >>> Start Epilogue v2.5 Mon Mar 3 15:50:51 EST 2014 Statistics >>> cpupercent=0,cput=00:00:00,mem=7028kb,ncpus=128,vmem=495768kb,wallti >>> m >>> e >>> =00:03:24 End Epilogue v2.5 Mon Mar 3 15:50:52 EST 2014 >> >> It looks like you have two general cases: >> >> 1. The job fails for no apparent reason (like above), or 2. The job >> complains that your TMPDIR is on a shared filesystem >> >> Right? >> >> I think the real issue, then, is to figure out why your jobs are failing >> with no output. >> >> Is there anything in the stderr output? >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users