[OMPI users] Process is hanging

2014-09-21 Thread Lee-Ping Wang
Hi there,



I'm running into an issue where mpirun isn't terminating when my executable
has a nonzero exit status - instead it's hanging indefinitely.  I'm
attaching my process tree, the error message from the application, and the
messages printed to stderr.   Please let me know what I can do.



Thanks,



-  Lee-Ping



=== Process Tree ===

leeping@vsp-compute-13:~$ ps xjf

PPID   PID  PGID   SID TTY  TPGID STAT   UID   TIME COMMAND 

31969 31977 31969 31969 ?   -1 S48618   0:00 sshd: leeping@pts/1

31977 31978 31978 31978 pts/132038 Ss   48618   0:00  \_ -bash

31978 32038 32038 31978 pts/132038 R+   48618   0:00  \_ ps xjf

23667 29307 29307 29307 ?   -1 Ss   48618   0:00 /bin/bash
/home/leeping/temp/leeping-workers/10276/worker1.sh

29307 29308 29307 29307 ?   -1 S48618   0:00  \_ /bin/bash
/home/leeping/temp/leeping-workers/10276/worker2.sh

29308 29425 29307 29307 ?   -1 S48618   0:00  \_
./work_queue_worker -d all --cores 6 -t 86400s localhost 9876

29425 26245 26245 29307 ?   -1 S48618   0:00  |   \_ sh -c
optimize-geometry.py initial.xyz --method b3lyp --basis "6-31g(d)" --charge
0 --mult 1 &> optimize.log

26245 26246 26245 29307 ?   -1 Sl   48618   0:01  |   \_
/home/leeping/local/bin/python /home/leeping/local/bin/optimize-geometry.py
initial.xyz --method b3lyp --basis 6-31g(d) --charge 0 --mult 1

26246 27834 26245 29307 ?   -1 S48618   0:00  |   \_
/bin/sh -c qchem42 -np 6 -save optimize.in optimize.out optimize.d 2>
optimize.err

27834 27835 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/bash /home/leeping/opt/bin/qchem42 -np 6 -save optimize.in
optimize.out optimize.d

27835 27897 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/qchem -np 6 -nt 1
-save optimize.in optimize.out optimize.d

27897 27921 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/parallel.csh
optimize.in 6 0 ./optimize.d/ 27897

27921 27926 26245 29307 ?   -1 Sl   48618   0:00  |
\_ /opt/scratch/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun -np 6
/opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe .optimize.in.27897.qcin.1
./optimize.d/



=== Application Error Message ===

100-843.2762335150  5.69E-08  0Convergence failure



Q-Chem fatal error occurred in module
/home/leeping/src/qchem/scfman/scfman.C, line 4377:



SCF failed to converge



Sat Sep 20 23:57:37 2014



=== Standard error ===

leeping@vsp-compute-13:/opt/scratch/leeping/worker-48618-29425/t.62$ cat
optimize.err

[vsp-compute-13:27929] *** Process received signal ***

[vsp-compute-13:27929] Signal: Aborted (6)

[vsp-compute-13:27929] Signal code:  (-6)

[vsp-compute-13:27932] *** Process received signal ***

[vsp-compute-13:27932] Signal: Aborted (6)

[vsp-compute-13:27932] Signal code:  (-6)

[vsp-compute-13:27934] *** Process received signal ***

[vsp-compute-13:27934] Signal: Aborted (6)

[vsp-compute-13:27934] Signal code:  (-6)

[vsp-compute-13:27928] *** Process received signal ***

[vsp-compute-13:27928] Signal: Aborted (6)

[vsp-compute-13:27928] Signal code:  (-6)

[vsp-compute-13:27936] *** Process received signal ***

[vsp-compute-13:27936] Signal: Aborted (6)

[vsp-compute-13:27936] Signal code:  (-6)

[vsp-compute-13:27930] *** Process received signal ***

[vsp-compute-13:27930] Signal: Aborted (6)

[vsp-compute-13:27930] Signal code:  (-6)

[vsp-compute-13:27932] [ 0] /lib64/libpthread.so.0[0x3464c0eb70]

[vsp-compute-13:27932] [ 1] [vsp-compute-13:27928] [ 0]
/lib64/libpthread.so.0[0x3464c0eb70]

[vsp-compute-13:27928] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]

[vsp-compute-13:27928] [ 2] [vsp-compute-13:27934] [ 0]
/lib64/libpthread.so.0[0x3464c0eb70]

[vsp-compute-13:27934] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]

[vsp-compute-13:27934] [ 2] [vsp-compute-13:27929] [ 0]
/lib64/libpthread.so.0[0x3464c0eb70]

[vsp-compute-13:27929] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]

[vsp-compute-13:27929] [ 2] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]

[vsp-compute-13:27932] [ 2] /lib64/libc.so.6(abort+0x110)[0x3464431d10]

[vsp-compute-13:27932] [ 3] /lib64/libc.so.6(abort+0x110)[0x3464431d10]

[vsp-compute-13:27928] [ 3]
/opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe[0xc304ca6]

[vsp-compute-13:27928] [ 4]
/opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe[0x41a0cf5]

[vsp-compute-13:27928] [ 5] /lib64/libc.so.6(abort+0x110)[0x3464431d10]

[vsp-compute-13:27934] [ 3]
/opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe[0xc304ca6]

[vsp-compute-13:27934] [ 4]
/opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe[0x41a0cf5]

[vsp-compute-13:27934] [ 5] /lib64/libc.so.6(abort+0x110)[0x3464431d10]

[vsp-compute-13:27929] [ 3]
/opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe[0xc304ca6]

[vsp-compute-13:27929] [ 4]
/opt/scratch/leeping/opt/qchem-4

Re: [OMPI users] Process is hanging

2014-09-21 Thread Lee-Ping Wang
I'm using version 1.8.1.



Thanks,



-  Lee-Ping



From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Sunday, September 21, 2014 6:56 AM
To: Open MPI Users
Subject: Re: [OMPI users] Process is hanging



Can you please tell us what version of OMPI you are using?





On Sep 21, 2014, at 6:08 AM, Lee-Ping Wang  wrote:





Hi there,



I'm running into an issue where mpirun isn't terminating when my executable
has a nonzero exit status - instead it's hanging indefinitely.  I'm
attaching my process tree, the error message from the application, and the
messages printed to stderr.   Please let me know what I can do.



Thanks,



-  Lee-Ping



=== Process Tree ===

leeping@vsp-compute-13:~$ ps xjf

PPID   PID  PGID   SID TTY  TPGID STAT   UID   TIME COMMAND

31969 31977 31969 31969 ?   -1 S48618   0:00 sshd: leeping@pts/1

31977 31978 31978 31978 pts/132038 Ss   48618   0:00  \_ -bash

31978 32038 32038 31978 pts/132038 R+   48618   0:00  \_ ps xjf

23667 29307 29307 29307 ?   -1 Ss   48618   0:00 /bin/bash
/home/leeping/temp/leeping-workers/10276/worker1.sh

29307 29308 29307 29307 ?   -1 S48618   0:00  \_ /bin/bash
/home/leeping/temp/leeping-workers/10276/worker2.sh

29308 29425 29307 29307 ?   -1 S48618   0:00  \_
./work_queue_worker -d all --cores 6 -t 86400s localhost 9876

29425 26245 26245 29307 ?   -1 S48618   0:00  |   \_ sh -c
optimize-geometry.py initial.xyz --method b3lyp --basis "6-31g(d)" --charge
0 --mult 1 &> optimize.log

26245 26246 26245 29307 ?   -1 Sl   48618   0:01  |   \_
/home/leeping/local/bin/python /home/leeping/local/bin/optimize-geometry.py
initial.xyz --method b3lyp --basis 6-31g(d) --charge 0 --mult 1

26246 27834 26245 29307 ?   -1 S48618   0:00  |   \_
/bin/sh -c qchem42 -np 6 -save optimize.in optimize.out optimize.d 2>
optimize.err

27834 27835 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/bash /home/leeping/opt/bin/qchem42 -np 6 -save optimize.in
optimize.out optimize.d

27835 27897 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/qchem -np 6 -nt 1
-save optimize.in optimize.out optimize.d

27897 27921 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/parallel.csh
optimize.in 6 0 ./optimize.d/ 27897

27921 27926 26245 29307 ?   -1 Sl   48618   0:00  |
\_ /opt/scratch/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun -np 6
/opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe .optimize.in.27897.qcin.1
./optimize.d/



=== Application Error Message ===

100-843.2762335150  5.69E-08  0Convergence failure



Q-Chem fatal error occurred in module
/home/leeping/src/qchem/scfman/scfman.C, line 4377:



SCF failed to converge



Sat Sep 20 23:57:37 2014



=== Standard error ===

leeping@vsp-compute-13:/opt/scratch/leeping/worker-48618-29425/t.62$ cat
optimize.err

[vsp-compute-13:27929] *** Process received signal ***

[vsp-compute-13:27929] Signal: Aborted (6)

[vsp-compute-13:27929] Signal code:  (-6)

[vsp-compute-13:27932] *** Process received signal ***

[vsp-compute-13:27932] Signal: Aborted (6)

[vsp-compute-13:27932] Signal code:  (-6)

[vsp-compute-13:27934] *** Process received signal ***

[vsp-compute-13:27934] Signal: Aborted (6)

[vsp-compute-13:27934] Signal code:  (-6)

[vsp-compute-13:27928] *** Process received signal ***

[vsp-compute-13:27928] Signal: Aborted (6)

[vsp-compute-13:27928] Signal code:  (-6)

[vsp-compute-13:27936] *** Process received signal ***

[vsp-compute-13:27936] Signal: Aborted (6)

[vsp-compute-13:27936] Signal code:  (-6)

[vsp-compute-13:27930] *** Process received signal ***

[vsp-compute-13:27930] Signal: Aborted (6)

[vsp-compute-13:27930] Signal code:  (-6)

[vsp-compute-13:27932] [ 0] /lib64/libpthread.so.0[0x3464c0eb70]

[vsp-compute-13:27932] [ 1] [vsp-compute-13:27928] [ 0]
/lib64/libpthread.so.0[0x3464c0eb70]

[vsp-compute-13:27928] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]

[vsp-compute-13:27928] [ 2] [vsp-compute-13:27934] [ 0]
/lib64/libpthread.so.0[0x3464c0eb70]

[vsp-compute-13:27934] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]

[vsp-compute-13:27934] [ 2] [vsp-compute-13:27929] [ 0]
/lib64/libpthread.so.0[0x3464c0eb70]

[vsp-compute-13:27929] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]

[vsp-compute-13:27929] [ 2] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]

[vsp-compute-13:27932] [ 2] /lib64/libc.so.6(abort+0x110)[0x3464431d10]

[vsp-compute-13:27932] [ 3] /lib64/libc.so.6(abort+0x110)[0x3464431d10]

[vsp-compute-13:27928] [ 3]
/opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe[0xc304ca6]

[vsp-compute-13:27928] [ 4]
/opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe[0x41a0cf5]

[vsp-compute-13:27928] [ 5] /lib64/libc.so.6(abort+0x110)[0x

Re: [OMPI users] Process is hanging

2014-09-21 Thread Lee-Ping Wang
My program isn't segfaulting - it's returning a non-zero status and then
existing.  



Thanks,



-  Lee-Ping



From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Sunday, September 21, 2014 8:54 AM
To: Open MPI Users
Subject: Re: [OMPI users] Process is hanging



Just to be clear: is your program returning a non-zero status and then
exiting, or is it segfaulting?





On Sep 21, 2014, at 8:22 AM, Lee-Ping Wang  wrote:





I'm using version 1.8.1.



Thanks,



-  Lee-Ping



From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Sunday, September 21, 2014 6:56 AM
To: Open MPI Users
Subject: Re: [OMPI users] Process is hanging



Can you please tell us what version of OMPI you are using?





On Sep 21, 2014, at 6:08 AM, Lee-Ping Wang < <mailto:leep...@stanford.edu>
leep...@stanford.edu> wrote:






Hi there,



I'm running into an issue where mpirun isn't terminating when my executable
has a nonzero exit status - instead it's hanging indefinitely.  I'm
attaching my process tree, the error message from the application, and the
messages printed to stderr.   Please let me know what I can do.



Thanks,



-  Lee-Ping



=== Process Tree ===

leeping@vsp-compute-13:~$ ps xjf

PPID   PID  PGID   SID TTY  TPGID STAT   UID   TIME COMMAND

31969 31977 31969 31969 ?   -1 S48618   0:00 sshd: leeping@pts/1

31977 31978 31978 31978 pts/132038 Ss   48618   0:00  \_ -bash

31978 32038 32038 31978 pts/132038 R+   48618   0:00  \_ ps xjf

23667 29307 29307 29307 ?   -1 Ss   48618   0:00 /bin/bash
/home/leeping/temp/leeping-workers/10276/worker1.sh

29307 29308 29307 29307 ?   -1 S48618   0:00  \_ /bin/bash
/home/leeping/temp/leeping-workers/10276/worker2.sh

29308 29425 29307 29307 ?   -1 S48618   0:00  \_
./work_queue_worker -d all --cores 6 -t 86400s localhost 9876

29425 26245 26245 29307 ?   -1 S48618   0:00  |   \_ sh -c
optimize-geometry.py initial.xyz --method b3lyp --basis "6-31g(d)" --charge
0 --mult 1 &> optimize.log

26245 26246 26245 29307 ?   -1 Sl   48618   0:01  |   \_
/home/leeping/local/bin/python /home/leeping/local/bin/optimize-geometry.py
initial.xyz --method b3lyp --basis 6-31g(d) --charge 0 --mult 1

26246 27834 26245 29307 ?   -1 S48618   0:00  |   \_
/bin/sh -c qchem42 -np 6 -save optimize.in optimize.out optimize.d 2>
optimize.err

27834 27835 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/bash /home/leeping/opt/bin/qchem42 -np 6 -save optimize.in
optimize.out optimize.d

27835 27897 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/qchem -np 6 -nt 1
-save optimize.in optimize.out optimize.d

27897 27921 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/parallel.csh
optimize.in 6 0 ./optimize.d/ 27897

27921 27926 26245 29307 ?   -1 Sl   48618   0:00  |
\_ /opt/scratch/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun -np 6
/opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe .optimize.in.27897.qcin.1
./optimize.d/



=== Application Error Message ===

100-843.2762335150  5.69E-08  0Convergence failure



Q-Chem fatal error occurred in module
/home/leeping/src/qchem/scfman/scfman.C, line 4377:



SCF failed to converge



Sat Sep 20 23:57:37 2014



=== Standard error ===


<mailto:leeping@vsp-compute-13:/opt/scratch/leeping/worker-48618-29425/t.62$
> leeping@vsp-compute-13:/opt/scratch/leeping/worker-48618-29425/t.62$ cat
optimize.err

[vsp-compute-13:27929] *** Process received signal ***

[vsp-compute-13:27929] Signal: Aborted (6)

[vsp-compute-13:27929] Signal code:  (-6)

[vsp-compute-13:27932] *** Process received signal ***

[vsp-compute-13:27932] Signal: Aborted (6)

[vsp-compute-13:27932] Signal code:  (-6)

[vsp-compute-13:27934] *** Process received signal ***

[vsp-compute-13:27934] Signal: Aborted (6)

[vsp-compute-13:27934] Signal code:  (-6)

[vsp-compute-13:27928] *** Process received signal ***

[vsp-compute-13:27928] Signal: Aborted (6)

[vsp-compute-13:27928] Signal code:  (-6)

[vsp-compute-13:27936] *** Process received signal ***

[vsp-compute-13:27936] Signal: Aborted (6)

[vsp-compute-13:27936] Signal code:  (-6)

[vsp-compute-13:27930] *** Process received signal ***

[vsp-compute-13:27930] Signal: Aborted (6)

[vsp-compute-13:27930] Signal code:  (-6)

[vsp-compute-13:27932] [ 0] /lib64/libpthread.so.0[0x3464c0eb70]

[vsp-compute-13:27932] [ 1] [vsp-compute-13:27928] [ 0]
/lib64/libpthread.so.0[0x3464c0eb70]

[vsp-compute-13:27928] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]

[vsp-compute-13:27928] [ 2] [vsp-compute-13:27934] [ 0]
/lib64/libpthread.so.0[0x3464c0eb70]

[vsp-compute-13:27934] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]

[vs

Re: [OMPI users] Process is hanging

2014-09-21 Thread Lee-Ping Wang
Hmm, I didn't know those were segfault reports.  It could indeed be a
segfault if the code isn't exiting properly - but the code really is trying
to exit there with the "SCF failed to converge" error.  Thanks for the help!



Thanks,



-  Lee-Ping



From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Sunday, September 21, 2014 11:49 AM
To: Open MPI Users
Subject: Re: [OMPI users] Process is hanging



Thanks - I asked because the output you sent shows a bunch of segfault
reports. I'll investigate the non-zero status question



On Sep 21, 2014, at 10:02 AM, Lee-Ping Wang  wrote:





My program isn't segfaulting - it's returning a non-zero status and then
existing. 



Thanks,



-  Lee-Ping



From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Sunday, September 21, 2014 8:54 AM
To: Open MPI Users
Subject: Re: [OMPI users] Process is hanging



Just to be clear: is your program returning a non-zero status and then
exiting, or is it segfaulting?





On Sep 21, 2014, at 8:22 AM, Lee-Ping Wang < <mailto:leep...@stanford.edu>
leep...@stanford.edu> wrote:






I'm using version 1.8.1.



Thanks,



-  Lee-Ping



From: users [ <mailto:users-boun...@open-mpi.org>
mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Sunday, September 21, 2014 6:56 AM
To: Open MPI Users
Subject: Re: [OMPI users] Process is hanging



Can you please tell us what version of OMPI you are using?





On Sep 21, 2014, at 6:08 AM, Lee-Ping Wang < <mailto:leep...@stanford.edu>
leep...@stanford.edu> wrote:







Hi there,



I'm running into an issue where mpirun isn't terminating when my executable
has a nonzero exit status - instead it's hanging indefinitely.  I'm
attaching my process tree, the error message from the application, and the
messages printed to stderr.   Please let me know what I can do.



Thanks,



-  Lee-Ping



=== Process Tree ===

leeping@vsp-compute-13:~$ ps xjf

PPID   PID  PGID   SID TTY  TPGID STAT   UID   TIME COMMAND

31969 31977 31969 31969 ?   -1 S48618   0:00 sshd: leeping@pts/1

31977 31978 31978 31978 pts/132038 Ss   48618   0:00  \_ -bash

31978 32038 32038 31978 pts/132038 R+   48618   0:00  \_ ps xjf

23667 29307 29307 29307 ?   -1 Ss   48618   0:00 /bin/bash
/home/leeping/temp/leeping-workers/10276/worker1.sh

29307 29308 29307 29307 ?   -1 S48618   0:00  \_ /bin/bash
/home/leeping/temp/leeping-workers/10276/worker2.sh

29308 29425 29307 29307 ?   -1 S48618   0:00  \_
./work_queue_worker -d all --cores 6 -t 86400s localhost 9876

29425 26245 26245 29307 ?   -1 S48618   0:00  |   \_ sh -c
optimize-geometry.py initial.xyz --method b3lyp --basis "6-31g(d)" --charge
0 --mult 1 &> optimize.log

26245 26246 26245 29307 ?   -1 Sl   48618   0:01  |   \_
/home/leeping/local/bin/python /home/leeping/local/bin/optimize-geometry.py
initial.xyz --method b3lyp --basis 6-31g(d) --charge 0 --mult 1

26246 27834 26245 29307 ?   -1 S48618   0:00  |   \_
/bin/sh -c qchem42 -np 6 -save optimize.in optimize.out optimize.d 2>
optimize.err

27834 27835 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/bash /home/leeping/opt/bin/qchem42 -np 6 -save optimize.in
optimize.out optimize.d

27835 27897 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/qchem -np 6 -nt 1
-save optimize.in optimize.out optimize.d

27897 27921 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/parallel.csh
optimize.in 6 0 ./optimize.d/ 27897

27921 27926 26245 29307 ?   -1 Sl   48618   0:00  |
\_ /opt/scratch/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun -np 6
/opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe .optimize.in.27897.qcin.1
./optimize.d/



=== Application Error Message ===

100-843.2762335150  5.69E-08  0Convergence failure



Q-Chem fatal error occurred in module
/home/leeping/src/qchem/scfman/scfman.C, line 4377:



SCF failed to converge



Sat Sep 20 23:57:37 2014



=== Standard error ===


<mailto:leeping@vsp-compute-13:/opt/scratch/leeping/worker-48618-29425/t.62$
> leeping@vsp-compute-13:/opt/scratch/leeping/worker-48618-29425/t.62$ cat
optimize.err

[vsp-compute-13:27929] *** Process received signal ***

[vsp-compute-13:27929] Signal: Aborted (6)

[vsp-compute-13:27929] Signal code:  (-6)

[vsp-compute-13:27932] *** Process received signal ***

[vsp-compute-13:27932] Signal: Aborted (6)

[vsp-compute-13:27932] Signal code:  (-6)

[vsp-compute-13:27934] *** Process received signal ***

[vsp-compute-13:27934] Signal: Aborted (6)

[vsp-compute-13:27934] Signal code:  (-6)

[vsp-compute-13:27928] *** Process received signal ***

[vsp-compute-1

Re: [OMPI users] Process is hanging

2014-09-22 Thread Lee-Ping Wang
Hi Ralph,



Thank you, I'll try to reproduce this error today.  Should I recompile my
executable using the new mpicc and libraries, or is using the updated mpirun
sufficient?  



Also, this error occurs inside a fairly complicated workflow so it might
take me some time to find a reproducible failure.



-  Lee-Ping



From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Monday, September 22, 2014 8:09 AM
To: Open MPI Users
Subject: Re: [OMPI users] Process is hanging



Could you try using the nightly 1.8 tarball? I know there was a problem
earlier in the 1.8 series, but I can't replicate it now with the nightly 1.8
tarball that is about to be released as 1.8.3.



http://www.open-mpi.org/nightly/v1.8/





On Sep 21, 2014, at 12:25 PM, Lee-Ping Wang  wrote:





Hmm, I didn't know those were segfault reports.  It could indeed be a
segfault if the code isn't exiting properly - but the code really is trying
to exit there with the "SCF failed to converge" error.  Thanks for the help!



Thanks,



-  Lee-Ping



From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Sunday, September 21, 2014 11:49 AM
To: Open MPI Users
Subject: Re: [OMPI users] Process is hanging



Thanks - I asked because the output you sent shows a bunch of segfault
reports. I'll investigate the non-zero status question



On Sep 21, 2014, at 10:02 AM, Lee-Ping Wang < <mailto:leep...@stanford.edu>
leep...@stanford.edu> wrote:






My program isn't segfaulting - it's returning a non-zero status and then
existing. 



Thanks,



-  Lee-Ping



From: users [ <mailto:users-boun...@open-mpi.org>
mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Sunday, September 21, 2014 8:54 AM
To: Open MPI Users
Subject: Re: [OMPI users] Process is hanging



Just to be clear: is your program returning a non-zero status and then
exiting, or is it segfaulting?





On Sep 21, 2014, at 8:22 AM, Lee-Ping Wang < <mailto:leep...@stanford.edu>
leep...@stanford.edu> wrote:







I'm using version 1.8.1.



Thanks,



-  Lee-Ping



From: users [ <mailto:users-boun...@open-mpi.org>
mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Sunday, September 21, 2014 6:56 AM
To: Open MPI Users
Subject: Re: [OMPI users] Process is hanging



Can you please tell us what version of OMPI you are using?





On Sep 21, 2014, at 6:08 AM, Lee-Ping Wang < <mailto:leep...@stanford.edu>
leep...@stanford.edu> wrote:








Hi there,



I'm running into an issue where mpirun isn't terminating when my executable
has a nonzero exit status - instead it's hanging indefinitely.  I'm
attaching my process tree, the error message from the application, and the
messages printed to stderr.   Please let me know what I can do.



Thanks,



-  Lee-Ping



=== Process Tree ===

leeping@vsp-compute-13:~$ ps xjf

PPID   PID  PGID   SID TTY  TPGID STAT   UID   TIME COMMAND

31969 31977 31969 31969 ?   -1 S48618   0:00 sshd: leeping@pts/1

31977 31978 31978 31978 pts/132038 Ss   48618   0:00  \_ -bash

31978 32038 32038 31978 pts/132038 R+   48618   0:00  \_ ps xjf

23667 29307 29307 29307 ?   -1 Ss   48618   0:00 /bin/bash
/home/leeping/temp/leeping-workers/10276/worker1.sh

29307 29308 29307 29307 ?   -1 S48618   0:00  \_ /bin/bash
/home/leeping/temp/leeping-workers/10276/worker2.sh

29308 29425 29307 29307 ?   -1 S48618   0:00  \_
./work_queue_worker -d all --cores 6 -t 86400s localhost 9876

29425 26245 26245 29307 ?   -1 S48618   0:00  |   \_ sh -c
optimize-geometry.py initial.xyz --method b3lyp --basis "6-31g(d)" --charge
0 --mult 1 &> optimize.log

26245 26246 26245 29307 ?   -1 Sl   48618   0:01  |   \_
/home/leeping/local/bin/python /home/leeping/local/bin/optimize-geometry.py
initial.xyz --method b3lyp --basis 6-31g(d) --charge 0 --mult 1

26246 27834 26245 29307 ?   -1 S48618   0:00  |   \_
/bin/sh -c qchem42 -np 6 -save optimize.in optimize.out optimize.d 2>
optimize.err

27834 27835 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/bash /home/leeping/opt/bin/qchem42 -np 6 -save optimize.in
optimize.out optimize.d

27835 27897 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/qchem -np 6 -nt 1
-save optimize.in optimize.out optimize.d

27897 27921 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/parallel.csh
optimize.in 6 0 ./optimize.d/ 27897

27921 27926 26245 29307 ?   -1 Sl   48618   0:00  |
\_ /opt/scratch/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun -np 6
/opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe .optimize.in.27897.qcin.1
./optimize.d/



=== Applic

[OMPI users] OpenMPI 1.8.3 build without BTL

2014-09-29 Thread Lee-Ping Wang
Hi there,

I'm building OpenMPI 1.8.3 on a system where I explicitly don't want any of the 
BTL components (they tend to break my single node jobs).  

./configure CC=gcc CXX=g++ F77=gfortran FC=gfortran 
--prefix=$QC_EXT_LIBS/openmpi --enable-static --enable-mca-no-build=btl

Building gives me this error in the vt component - it appears to be expecting 
some Infiniband stuff:

  CCLD otfmerge-mpi
/u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
 undefined reference to `ibv_free_device_list'
/u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
 undefined reference to `ibv_alloc_pd'
/u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
 undefined reference to `ibv_close_device'
/u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
 undefined reference to `ibv_dealloc_pd'
/u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
 undefined reference to `ibv_destroy_qp'
/u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
 undefined reference to `ibv_create_cq'
/u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
 undefined reference to `ibv_get_sysfs_path'
/u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
 undefined reference to `ibv_open_device'
/u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
 undefined reference to `ibv_create_qp'
/u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
 undefined reference to `ibv_query_device'
/u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
 undefined reference to `ibv_get_device_list'
/u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
 undefined reference to `ibv_get_device_name'
/u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
 undefined reference to `ibv_destroy_cq'
collect2: error: ld returned 1 exit status
make[10]: *** [otfmerge-mpi] Error 1

I've decided to disable the vt component since I doubt I'm using it, but this 
could be good to know.

Thanks,

- Lee-Ping

Re: [OMPI users] OpenMPI 1.8.3 build without BTL

2014-09-29 Thread Lee-Ping Wang
Hmm, the build doesn't finish - it breaks when trying to create the man page.  
I guess I'll disable only a few specific BTL components that have given me 
issues in the past. 

Creating ompi_info.1 man page...
  CCLD ompi_info
../../../ompi/.libs/libmpi.so: undefined reference to `ibv_free_device_list'
../../../ompi/.libs/libmpi.so: undefined reference to `ibv_alloc_pd'
../../../ompi/.libs/libmpi.so: undefined reference to `ibv_close_device'
../../../ompi/.libs/libmpi.so: undefined reference to `ibv_dealloc_pd'
./../../ompi/.libs/libmpi.so: undefined reference to `ibv_destroy_qp'
../../../ompi/.libs/libmpi.so: undefined reference to `ibv_create_cq'
../../../ompi/.libs/libmpi.so: undefined reference to `ibv_get_sysfs_path'
../../../ompi/.libs/libmpi.so: undefined reference to `ibv_open_device'
../../../ompi/.libs/libmpi.so: undefined reference to `ibv_create_qp'
../../../ompi/.libs/libmpi.so: undefined reference to `ibv_query_device'
../../../ompi/.libs/libmpi.so: undefined reference to `ibv_get_device_list'
../../../ompi/.libs/libmpi.so: undefined reference to `ibv_get_device_name'
../../../ompi/.libs/libmpi.so: undefined reference to `ibv_destroy_cq'
collect2: error: ld returned 1 exit status

Thanks,

- Lee-Ping

On Sep 29, 2014, at 5:27 AM, Lee-Ping Wang  wrote:

> Hi there,
> 
> I'm building OpenMPI 1.8.3 on a system where I explicitly don't want any of 
> the BTL components (they tend to break my single node jobs).  
> 
> ./configure CC=gcc CXX=g++ F77=gfortran FC=gfortran 
> --prefix=$QC_EXT_LIBS/openmpi --enable-static --enable-mca-no-build=btl
> 
> Building gives me this error in the vt component - it appears to be expecting 
> some Infiniband stuff:
> 
>   CCLD otfmerge-mpi
> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>  undefined reference to `ibv_free_device_list'
> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>  undefined reference to `ibv_alloc_pd'
> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>  undefined reference to `ibv_close_device'
> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>  undefined reference to `ibv_dealloc_pd'
> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>  undefined reference to `ibv_destroy_qp'
> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>  undefined reference to `ibv_create_cq'
> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>  undefined reference to `ibv_get_sysfs_path'
> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>  undefined reference to `ibv_open_device'
> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>  undefined reference to `ibv_create_qp'
> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>  undefined reference to `ibv_query_device'
> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>  undefined reference to `ibv_get_device_list'
> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>  undefined reference to `ibv_get_device_name'
> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>  undefined reference to `ibv_destroy_cq'
> collect2: error: ld returned 1 exit status
> make[10]: *** [otfmerge-mpi] Error 1
> 
> I've decided to disable the vt component since I doubt I'm using it, but this 
> could be good to know.
> 
> Thanks,
> 
> - Lee-Ping



Re: [OMPI users] OpenMPI 1.8.3 build without BTL

2014-09-29 Thread Lee-Ping Wang
Hi Gus,

Thank you.  I did start from a completely clean directory tree every time (I 
deleted the whole folder and re-extracted the tarball).

I noticed that disabling any of the BTL components resulted in the same error, 
so my solution was to build everything and disable certain components at 
runtime.

- Lee-Ping

On Sep 29, 2014, at 6:03 AM, Gustavo Correa  wrote:

> Hi Lee-Ping 
> 
> Did you cleanup the old build, to start fresh?
> 
> make distclean 
> configure --disable-vt ...
> ...
> 
> I hope this helps,
> Gus Correa
> 
> On Sep 29, 2014, at 8:47 AM, Lee-Ping Wang wrote:
> 
>> Hmm, the build doesn't finish - it breaks when trying to create the man 
>> page.  I guess I'll disable only a few specific BTL components that have 
>> given me issues in the past. 
>> 
>> Creating ompi_info.1 man page...
>>  CCLD ompi_info
>> ../../../ompi/.libs/libmpi.so: undefined reference to `ibv_free_device_list'
>> ../../../ompi/.libs/libmpi.so: undefined reference to `ibv_alloc_pd'
>> ../../../ompi/.libs/libmpi.so: undefined reference to `ibv_close_device'
>> ../../../ompi/.libs/libmpi.so: undefined reference to `ibv_dealloc_pd'
>> ../../../ompi/.libs/libmpi.so: undefined reference to `ibv_destroy_qp'
>> ../../../ompi/.libs/libmpi.so: undefined reference to `ibv_create_cq'
>> ../../../ompi/.libs/libmpi.so: undefined reference to `ibv_get_sysfs_path'
>> ../../../ompi/.libs/libmpi.so: undefined reference to `ibv_open_device'
>> ../../../ompi/.libs/libmpi.so: undefined reference to `ibv_create_qp'
>> ../../../ompi/.libs/libmpi.so: undefined reference to `ibv_query_device'
>> ../../../ompi/.libs/libmpi.so: undefined reference to `ibv_get_device_list'
>> ../../../ompi/.libs/libmpi.so: undefined reference to `ibv_get_device_name'
>> ../../../ompi/.libs/libmpi.so: undefined reference to `ibv_destroy_cq'
>> collect2: error: ld returned 1 exit status
>> 
>> Thanks,
>> 
>> - Lee-Ping
>> 
>> On Sep 29, 2014, at 5:27 AM, Lee-Ping Wang  wrote:
>> 
>>> Hi there,
>>> 
>>> I'm building OpenMPI 1.8.3 on a system where I explicitly don't want any of 
>>> the BTL components (they tend to break my single node jobs).  
>>> 
>>> ./configure CC=gcc CXX=g++ F77=gfortran FC=gfortran 
>>> --prefix=$QC_EXT_LIBS/openmpi --enable-static --enable-mca-no-build=btl
>>> 
>>> Building gives me this error in the vt component - it appears to be 
>>> expecting some Infiniband stuff:
>>> 
>>>  CCLD otfmerge-mpi
>>> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>>>  undefined reference to `ibv_free_device_list'
>>> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>>>  undefined reference to `ibv_alloc_pd'
>>> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>>>  undefined reference to `ibv_close_device'
>>> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>>>  undefined reference to `ibv_dealloc_pd'
>>> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>>>  undefined reference to `ibv_destroy_qp'
>>> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>>>  undefined reference to `ibv_create_cq'
>>> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>>>  undefined reference to `ibv_get_sysfs_path'
>>> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>>>  undefined reference to `ibv_open_device'
>>> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>>>  undefined reference to `ibv_create_qp'
>>> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>>>  undefined reference to `ibv_query_device'
>>> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>>>  undefined reference to `ibv_get_device_list'
>>> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/vt/../../../.libs/libmpi.so:
>>>  undefined reference to `ibv_get_device_name'
>>> /u/sciteam/leeping/opt/qchem-4.2/ext-libs/openmpi-1.8.3/ompi/contrib/vt/v

[OMPI users] General question about running single-node jobs.

2014-09-29 Thread Lee-Ping Wang
Hi there,

My application uses MPI to run parallel jobs on a single node, so I have no 
need of any support for communication between nodes.  However, when I use 
mpirun to launch my application I see strange errors such as:

--
No network interfaces were found for out-of-band communications. We require
at least one available network for out-of-band messaging.
--

[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket for 
out-of-band communications in file oob_tcp_listener.c at line 113
[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket for 
out-of-band communications in file oob_tcp_component.c at line 584
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_oob_base_select failed
  --> Returned value (null) (-43) instead of ORTE_SUCCESS
--

/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(+0xfeaa9)[0x2b77e9de5aa9]
/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xd0)[0x2b77e9de13a0]

It seems like in each case, OpenMPI is trying to use some feature related to 
networking and crashing as a result.  My workaround is to deduce the components 
that are crashing and disable them in my environment variables like this:

export OMPI_MCA_btl=self,sm
export OMPI_MCA_oob=^tcp

Is there a better way to do this - i.e. explicitly prohibit OpenMPI from using 
any network-related feature and run only on the local node?

Thanks,

- Lee-Ping



Re: [OMPI users] General question about running single-node jobs.

2014-09-29 Thread Lee-Ping Wang
Sorry for my last email - I think I spoke too quick.  I realized after reading 
some more documentation that OpenMPI always uses TCP sockets for out-of-band 
communication, so it doesn't make sense for me to set OMPI_MCA_oob=^tcp.  That 
said, I am still running into a strange problem in my application when running 
on a specific machine (Blue Waters compute node); I don't see this problem on 
any other nodes.

When I run the same job (~5 seconds) in rapid succession, I see the following 
error message on the second execution:

/tmp/leeping/opt/qchem-4.2/bin/parallel.csh,  , qcopt_reactants.in, 8, 0, 
./qchem24825/
MPIRUN in parallel.csh is /tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun
P4_RSHCOMMAND in parallel.csh is ssh
QCOUTFILE is stdout
Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines
[nid15081:24859] Warning: could not find environment variable "QCLOCALSCR"
[nid15081:24859] Warning: could not find environment variable "QCREF"
initial socket setup ...start
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---
--
mpirun detected that one or more processes exited with non-zero status, thus 
causing
the job to be terminated. The first process to do so was:

  Process name: [[46773,1],0]
  Exit code:255
--

And here's the source code where the program is exiting (before "initial socket 
setup ...done")

int GPICommSoc::init(MPI_Comm comm0) {

/* setup basic MPI information */
init_comm(comm0);

MPI_Barrier(comm);
/*-- start inisock and set serveradd[] array --*/
if (me == 0) {
fprintf(stdout,"initial socket setup ...start\n");
fflush(stdout);
}

// create the initial socket 
inisock = new_server_socket(NULL,0);

// fill and gather the serveraddr array
int szsock = sizeof(SOCKADDR);
memset(&serveraddr[0],0, szsock*nproc);
int iniport=get_sockport(inisock);
set_sockaddr_byhname(NULL, iniport, &serveraddr[me]);
//printsockaddr( serveraddr[me] );

SOCKADDR addrsend = serveraddr[me];
MPI_Allgather(&addrsend,szsock,MPI_BYTE,
  &serveraddr[0], szsock,MPI_BYTE, comm);
if (me == 0) {
   fprintf(stdout,"initial socket setup ...done \n"
);
fflush(stdout);}

I didn't write this part of the program and I'm really a novice to MPI - but it 
seems like the initial execution of the program isn't freeing up some system 
resource as it should.  Is there something that needs to be corrected in the 
code?

Thanks,

- Lee-Ping

On Sep 29, 2014, at 5:12 PM, Lee-Ping Wang  wrote:

> Hi there,
> 
> My application uses MPI to run parallel jobs on a single node, so I have no 
> need of any support for communication between nodes.  However, when I use 
> mpirun to launch my application I see strange errors such as:
> 
> --
> No network interfaces were found for out-of-band communications. We require
> at least one available network for out-of-band messaging.
> --
> 
> [nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket 
> for out-of-band communications in file oob_tcp_listener.c at line 113
> [nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket 
> for out-of-band communications in file oob_tcp_component.c at line 584
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>   orte_oob_base_select failed
>   --> Returned value (null) (-43) instead of ORTE_SUCCESS
> --
> 
> /home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(+0xfeaa9)[0x2b77e9de5aa9]
> /home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xd0)[0x2b77e9de13a0]
> 
> It seems like in each case, OpenMPI is trying to use some feature related to 
> networking and crashing as a result.  My workaround is to deduce the 
> components that are crashing and disable them in my environment variables 

Re: [OMPI users] General question about running single-node jobs.

2014-09-29 Thread Lee-Ping Wang
Here's another data point that might be useful: The error message is much more 
rare if I run my application on 4 cores instead of 8.

Thanks,

- Lee-Ping

On Sep 29, 2014, at 5:38 PM, Lee-Ping Wang  wrote:

> Sorry for my last email - I think I spoke too quick.  I realized after 
> reading some more documentation that OpenMPI always uses TCP sockets for 
> out-of-band communication, so it doesn't make sense for me to set 
> OMPI_MCA_oob=^tcp.  That said, I am still running into a strange problem in 
> my application when running on a specific machine (Blue Waters compute node); 
> I don't see this problem on any other nodes.
> 
> When I run the same job (~5 seconds) in rapid succession, I see the following 
> error message on the second execution:
> 
> /tmp/leeping/opt/qchem-4.2/bin/parallel.csh,  , qcopt_reactants.in, 8, 0, 
> ./qchem24825/
> MPIRUN in parallel.csh is 
> /tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun
> P4_RSHCOMMAND in parallel.csh is ssh
> QCOUTFILE is stdout
> Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines
> [nid15081:24859] Warning: could not find environment variable "QCLOCALSCR"
> [nid15081:24859] Warning: could not find environment variable "QCREF"
> initial socket setup ...start
> ---
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> ---
> --
> mpirun detected that one or more processes exited with non-zero status, thus 
> causing
> the job to be terminated. The first process to do so was:
> 
>   Process name: [[46773,1],0]
>   Exit code:255
> --
> 
> And here's the source code where the program is exiting (before "initial 
> socket setup ...done")
> 
> int GPICommSoc::init(MPI_Comm comm0) {
> 
> /* setup basic MPI information */
> init_comm(comm0);
> 
> MPI_Barrier(comm);
> /*-- start inisock and set serveradd[] array --*/
> if (me == 0) {
> fprintf(stdout,"initial socket setup ...start\n");
> fflush(stdout);
> }
> 
> // create the initial socket 
> inisock = new_server_socket(NULL,0);
> 
> // fill and gather the serveraddr array
> int szsock = sizeof(SOCKADDR);
> memset(&serveraddr[0],0, szsock*nproc);
> int iniport=get_sockport(inisock);
> set_sockaddr_byhname(NULL, iniport, &serveraddr[me]);
> //printsockaddr( serveraddr[me] );
> 
> SOCKADDR addrsend = serveraddr[me];
> MPI_Allgather(&addrsend,szsock,MPI_BYTE,
>   &serveraddr[0], szsock,MPI_BYTE, comm);
> if (me == 0) {
>fprintf(stdout,"initial socket setup ...done \n"
> );
> fflush(stdout);}
> 
> I didn't write this part of the program and I'm really a novice to MPI - but 
> it seems like the initial execution of the program isn't freeing up some 
> system resource as it should.  Is there something that needs to be corrected 
> in the code?
> 
> Thanks,
> 
> - Lee-Ping
> 
> On Sep 29, 2014, at 5:12 PM, Lee-Ping Wang  wrote:
> 
>> Hi there,
>> 
>> My application uses MPI to run parallel jobs on a single node, so I have no 
>> need of any support for communication between nodes.  However, when I use 
>> mpirun to launch my application I see strange errors such as:
>> 
>> --
>> No network interfaces were found for out-of-band communications. We require
>> at least one available network for out-of-band messaging.
>> --
>> 
>> [nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket 
>> for out-of-band communications in file oob_tcp_listener.c at line 113
>> [nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket 
>> for out-of-band communications in file oob_tcp_component.c at line 584
>> --
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> O

Re: [OMPI users] OpenMPI 1.8.3 build without BTL

2014-09-30 Thread Lee-Ping Wang
Hi Jeff and Ralph,

Thanks.  I'm really a novice user - and in cases like this one I don't really 
know what I'm doing.  In this case, I just wanted to get my application to run 
without throwing strange error messages and quitting. :)  That said, I would 
much rather learn about the components of MPI rather than taking shots in the 
dark.

On different clusters where I was getting error messages related to a 
component, the advice from this mailing list was to disable that component.  
Currently I'm building OpenMPI with all components, and my environment 
variables disable the components at runtime: I have OMPI_MCA_ras=^tm and 
OMPI_MCA_btl=self,sm,tcp.  

The latter seems to disable the advanced networking-related components that 
were throwing the errors.  I am not sure how a BTL works for MPI ranks that are 
running on the same node.  Do the different MPI ranks (processes) on a node 
still use a BTL to communicate with each other?  And which one does it prefer 
to use (sm, tcp or something else?)

Thanks,

- Lee-Ping

On Sep 30, 2014, at 7:29 AM, Jeff Squyres (jsquyres)  wrote:

> How can you run MPI jobs at all without any BTLs?  That sounds weird -- this 
> is not a case for which we designed the code base.
> 
> All that being said, you're getting compile errors in the OMPI build because 
> of two things:
> 
> - you selected to build static
> - you didn't disable enough stuff
> 
> Specifically, statically building verbs-based code is not for the meek (see 
> the FAQ).  We have verbs-based code in a few places: the BTLs, and also in 
> the "common" framework.  Hence, the linker errors you are getting are because 
> the "common" verbs component was still built (because it wasn't disabled), 
> and because building statically with verbs is... tricky (see the FAQ).
> 
> You might have better luck with:
> 
> ./configure --enable-mca-no-build=btl,common-verbs ...
> 
> Or, better yet:
> 
> ./configure --enable-mca-no-build=btl --without-verbs ...
> 
> But again, I'm not sure how well OMPI will function without any BTLs.
> 
> 
> 
> On Sep 29, 2014, at 11:47 PM, Ralph Castain  wrote:
> 
>> ompi_info is just the first time when an executable is built, and so it 
>> always is the place where we find missing library issues. It looks like 
>> someone has left incorrect configure logic in the system such that we always 
>> attempt to build Infiniband-related code, but without linking against the 
>> library.
>> 
>> We'll have to try and track it down.
>> 
>> On Sep 29, 2014, at 5:08 PM, Lee-Ping Wang  wrote:
>> 
>>> Hi Gus,
>>> 
>>> Thank you.  I did start from a completely clean directory tree every time 
>>> (I deleted the whole folder and re-extracted the tarball).
>>> 
>>> I noticed that disabling any of the BTL components resulted in the same 
>>> error, so my solution was to build everything and disable certain 
>>> components at runtime.
>>> 
>>> - Lee-Ping
>>> 
>>> On Sep 29, 2014, at 6:03 AM, Gustavo Correa  wrote:
>>> 
>>>> Hi Lee-Ping 
>>>> 
>>>> Did you cleanup the old build, to start fresh?
>>>> 
>>>> make distclean 
>>>> configure --disable-vt ...
>>>> ...
>>>> 
>>>> I hope this helps,
>>>> Gus Correa
>>>> 
>>>> On Sep 29, 2014, at 8:47 AM, Lee-Ping Wang wrote:
>>>> 
>>>>> Hmm, the build doesn't finish - it breaks when trying to create the man 
>>>>> page.  I guess I'll disable only a few specific BTL components that have 
>>>>> given me issues in the past. 
>>>>> 
>>>>> Creating ompi_info.1 man page...
>>>>> CCLD ompi_info
>>>>> ../../../ompi/.libs/libmpi.so: undefined reference to 
>>>>> `ibv_free_device_list'
>>>>> ../../../ompi/.libs/libmpi.so: undefined reference to `ibv_alloc_pd'
>>>>> ../../../ompi/.libs/libmpi.so: undefined reference to `ibv_close_device'
>>>>> ../../../ompi/.libs/libmpi.so: undefined reference to `ibv_dealloc_pd'
>>>>> ../../../ompi/.libs/libmpi.so: undefined reference to `ibv_destroy_qp'
>>>>> ../../../ompi/.libs/libmpi.so: undefined reference to `ibv_create_cq'
>>>>> ../../../ompi/.libs/libmpi.so: undefined reference to `ibv_get_sysfs_path'
>>>>> ../../../ompi/.libs/libmpi.so: undefined reference to `ibv_open_device'
>>>>> ../../../ompi/.libs/libmpi.so: undefined reference to `ibv_create_qp'
>>>

Re: [OMPI users] General question about running single-node jobs.

2014-09-30 Thread Lee-Ping Wang
Hi Ralph,

Thank you.  I think your diagnosis is probably correct.  Are these sockets the 
same as TCP/UDP ports (though different numbers) that are used in web servers, 
email etc?  If so, then I should be able to (1) locate where the port number is 
defined in the code, and (2) randomize the port number every time it's called 
to work around the issue.  What do you think?

- Lee-Ping

On Sep 29, 2014, at 8:45 PM, Ralph Castain  wrote:

> I don't know anything about your application, or what the functions in your 
> code are doing. I imagine it's possible that you are trying to open 
> statically defined ports, which means that running the job again too soon 
> could leave the OS thinking the socket is already busy. It takes awhile for 
> the OS to release a socket resource.
> 
> 
> On Sep 29, 2014, at 5:49 PM, Lee-Ping Wang  wrote:
> 
>> Here's another data point that might be useful: The error message is much 
>> more rare if I run my application on 4 cores instead of 8.
>> 
>> Thanks,
>> 
>> - Lee-Ping
>> 
>> On Sep 29, 2014, at 5:38 PM, Lee-Ping Wang  wrote:
>> 
>>> Sorry for my last email - I think I spoke too quick.  I realized after 
>>> reading some more documentation that OpenMPI always uses TCP sockets for 
>>> out-of-band communication, so it doesn't make sense for me to set 
>>> OMPI_MCA_oob=^tcp.  That said, I am still running into a strange problem in 
>>> my application when running on a specific machine (Blue Waters compute 
>>> node); I don't see this problem on any other nodes.
>>> 
>>> When I run the same job (~5 seconds) in rapid succession, I see the 
>>> following error message on the second execution:
>>> 
>>> /tmp/leeping/opt/qchem-4.2/bin/parallel.csh,  , qcopt_reactants.in, 8, 0, 
>>> ./qchem24825/
>>> MPIRUN in parallel.csh is 
>>> /tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun
>>> P4_RSHCOMMAND in parallel.csh is ssh
>>> QCOUTFILE is stdout
>>> Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines
>>> [nid15081:24859] Warning: could not find environment variable "QCLOCALSCR"
>>> [nid15081:24859] Warning: could not find environment variable "QCREF"
>>> initial socket setup ...start
>>> ---
>>> Primary job  terminated normally, but 1 process returned
>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>> ---
>>> --
>>> mpirun detected that one or more processes exited with non-zero status, 
>>> thus causing
>>> the job to be terminated. The first process to do so was:
>>> 
>>>   Process name: [[46773,1],0]
>>>   Exit code:255
>>> --
>>> 
>>> And here's the source code where the program is exiting (before "initial 
>>> socket setup ...done")
>>> 
>>> int GPICommSoc::init(MPI_Comm comm0) {
>>> 
>>> /* setup basic MPI information */
>>> init_comm(comm0);
>>> 
>>> MPI_Barrier(comm);
>>> /*-- start inisock and set serveradd[] array --*/
>>> if (me == 0) {
>>> fprintf(stdout,"initial socket setup ...start\n");
>>> fflush(stdout);
>>> }
>>> 
>>> // create the initial socket 
>>> inisock = new_server_socket(NULL,0);
>>> 
>>> // fill and gather the serveraddr array
>>> int szsock = sizeof(SOCKADDR);
>>> memset(&serveraddr[0],0, szsock*nproc);
>>> int iniport=get_sockport(inisock);
>>> set_sockaddr_byhname(NULL, iniport, &serveraddr[me]);
>>> //printsockaddr( serveraddr[me] );
>>> 
>>> SOCKADDR addrsend = serveraddr[me];
>>> MPI_Allgather(&addrsend,szsock,MPI_BYTE,
>>>   &serveraddr[0], szsock,MPI_BYTE, comm);
>>> if (me == 0) {
>>>fprintf(stdout,"initial socket setup ...done \n"
>>> );
>>> fflush(stdout);}
>>> 
>>> I didn't write this part of the program and I'm really a novice to MPI - 
>>> but it seems like the initial execution of the program isn't freeing up 
>>> some system resource as it should.  Is there something that needs to be 
>>> corrected in the co

Re: [OMPI users] General question about running single-node jobs.

2014-09-30 Thread Lee-Ping Wang
Hi Ralph,

>>  If so, then I should be able to (1) locate where the port number is defined 
>> in the code, and (2) randomize the port number every time it's called to 
>> work around the issue.  What do you think?
> 
> That might work, depending on the code. I'm not sure what it is trying to 
> connect to, and if that code knows how to handle arbitrary connections


The main reason why Q-Chem is using MPI is for executing parallel tasks on a 
single node.  Thus, I think it's just the MPI ranks attempting to connect with 
each other on the same machine.  This could be off the mark because I'm still a 
novice with respect to MPI concepts - but I am sure it is just one machine.

> You might check about those warnings - could be that QCLOCALSCR and QCREF 
> need to be set for the code to work.

Thanks; I don't think these environment variables are the issue but I will 
check again.  The calculation runs without any problems on four different 
clusters (where I don't set these environment variables either), it's only 
broken on the Blue Waters compute node.  Also, the calculation runs without any 
problems the first time it's executed on the BW compute node - it's only 
subsequent executions that give the error messages.

Thanks,

- Lee-Ping

On Sep 30, 2014, at 11:05 AM, Ralph Castain  wrote:

> 
> On Sep 30, 2014, at 10:49 AM, Lee-Ping Wang  wrote:
> 
>> Hi Ralph,
>> 
>> Thank you.  I think your diagnosis is probably correct.  Are these sockets 
>> the same as TCP/UDP ports (though different numbers) that are used in web 
>> servers, email etc?
> 
> Yes
> 
>>  If so, then I should be able to (1) locate where the port number is defined 
>> in the code, and (2) randomize the port number every time it's called to 
>> work around the issue.  What do you think?
> 
> That might work, depending on the code. I'm not sure what it is trying to 
> connect to, and if that code knows how to handle arbitrary connections
> 
> You might check about those warnings - could be that QCLOCALSCR and QCREF 
> need to be set for the code to work.
> 
>> 
>> - Lee-Ping
>> 
>> On Sep 29, 2014, at 8:45 PM, Ralph Castain  wrote:
>> 
>>> I don't know anything about your application, or what the functions in your 
>>> code are doing. I imagine it's possible that you are trying to open 
>>> statically defined ports, which means that running the job again too soon 
>>> could leave the OS thinking the socket is already busy. It takes awhile for 
>>> the OS to release a socket resource.
>>> 
>>> 
>>> On Sep 29, 2014, at 5:49 PM, Lee-Ping Wang  wrote:
>>> 
>>>> Here's another data point that might be useful: The error message is much 
>>>> more rare if I run my application on 4 cores instead of 8.
>>>> 
>>>> Thanks,
>>>> 
>>>> - Lee-Ping
>>>> 
>>>> On Sep 29, 2014, at 5:38 PM, Lee-Ping Wang  wrote:
>>>> 
>>>>> Sorry for my last email - I think I spoke too quick.  I realized after 
>>>>> reading some more documentation that OpenMPI always uses TCP sockets for 
>>>>> out-of-band communication, so it doesn't make sense for me to set 
>>>>> OMPI_MCA_oob=^tcp.  That said, I am still running into a strange problem 
>>>>> in my application when running on a specific machine (Blue Waters compute 
>>>>> node); I don't see this problem on any other nodes.
>>>>> 
>>>>> When I run the same job (~5 seconds) in rapid succession, I see the 
>>>>> following error message on the second execution:
>>>>> 
>>>>> /tmp/leeping/opt/qchem-4.2/bin/parallel.csh,  , qcopt_reactants.in, 8, 0, 
>>>>> ./qchem24825/
>>>>> MPIRUN in parallel.csh is 
>>>>> /tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun
>>>>> P4_RSHCOMMAND in parallel.csh is ssh
>>>>> QCOUTFILE is stdout
>>>>> Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines
>>>>> [nid15081:24859] Warning: could not find environment variable "QCLOCALSCR"
>>>>> [nid15081:24859] Warning: could not find environment variable "QCREF"
>>>>> initial socket setup ...start
>>>>> ---
>>>>> Primary job  terminated normally, but 1 process returned
>>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>>>> --

Re: [OMPI users] General question about running single-node jobs.

2014-09-30 Thread Lee-Ping Wang
Hi Ralph,

Thanks.  I'll add some print statements to the code and try to figure out 
precisely where the failure is happening.

- Lee-Ping

On Sep 30, 2014, at 12:06 PM, Ralph Castain  wrote:

> 
> On Sep 30, 2014, at 11:19 AM, Lee-Ping Wang  wrote:
> 
>> Hi Ralph,
>> 
>>>>  If so, then I should be able to (1) locate where the port number is 
>>>> defined in the code, and (2) randomize the port number every time it's 
>>>> called to work around the issue.  What do you think?
>>> 
>>> That might work, depending on the code. I'm not sure what it is trying to 
>>> connect to, and if that code knows how to handle arbitrary connections
>> 
>> 
>> The main reason why Q-Chem is using MPI is for executing parallel tasks on a 
>> single node.  Thus, I think it's just the MPI ranks attempting to connect 
>> with each other on the same machine.  This could be off the mark because I'm 
>> still a novice with respect to MPI concepts - but I am sure it is just one 
>> machine.
> 
> Your statement doesn't match what you sent us - you showed that it was your 
> connection code that was failing, not ours. You wouldn't have gotten that far 
> if our connections failed as you would have failed in MPI_Init. You are 
> clearly much further than that as you already passed an MPI_Barrier before 
> reaching the code in question.
> 
>> 
>>> You might check about those warnings - could be that QCLOCALSCR and QCREF 
>>> need to be set for the code to work.
>> 
>> Thanks; I don't think these environment variables are the issue but I will 
>> check again.  The calculation runs without any problems on four different 
>> clusters (where I don't set these environment variables either), it's only 
>> broken on the Blue Waters compute node.  Also, the calculation runs without 
>> any problems the first time it's executed on the BW compute node - it's only 
>> subsequent executions that give the error messages.
>> 
>> Thanks,
>> 
>> - Lee-Ping
>> 
>> On Sep 30, 2014, at 11:05 AM, Ralph Castain  wrote:
>> 
>>> 
>>> On Sep 30, 2014, at 10:49 AM, Lee-Ping Wang  wrote:
>>> 
>>>> Hi Ralph,
>>>> 
>>>> Thank you.  I think your diagnosis is probably correct.  Are these sockets 
>>>> the same as TCP/UDP ports (though different numbers) that are used in web 
>>>> servers, email etc?
>>> 
>>> Yes
>>> 
>>>>  If so, then I should be able to (1) locate where the port number is 
>>>> defined in the code, and (2) randomize the port number every time it's 
>>>> called to work around the issue.  What do you think?
>>> 
>>> That might work, depending on the code. I'm not sure what it is trying to 
>>> connect to, and if that code knows how to handle arbitrary connections
>>> 
>>> You might check about those warnings - could be that QCLOCALSCR and QCREF 
>>> need to be set for the code to work.
>>> 
>>>> 
>>>> - Lee-Ping
>>>> 
>>>> On Sep 29, 2014, at 8:45 PM, Ralph Castain  wrote:
>>>> 
>>>>> I don't know anything about your application, or what the functions in 
>>>>> your code are doing. I imagine it's possible that you are trying to open 
>>>>> statically defined ports, which means that running the job again too soon 
>>>>> could leave the OS thinking the socket is already busy. It takes awhile 
>>>>> for the OS to release a socket resource.
>>>>> 
>>>>> 
>>>>> On Sep 29, 2014, at 5:49 PM, Lee-Ping Wang  wrote:
>>>>> 
>>>>>> Here's another data point that might be useful: The error message is 
>>>>>> much more rare if I run my application on 4 cores instead of 8.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> - Lee-Ping
>>>>>> 
>>>>>> On Sep 29, 2014, at 5:38 PM, Lee-Ping Wang  wrote:
>>>>>> 
>>>>>>> Sorry for my last email - I think I spoke too quick.  I realized after 
>>>>>>> reading some more documentation that OpenMPI always uses TCP sockets 
>>>>>>> for out-of-band communication, so it doesn't make sense for me to set 
>>>>>>> OMPI_MCA_oob=^tcp.  That said, I am still running into a strange 
>>>>>>> problem in my application when runnin

Re: [OMPI users] General question about running single-node jobs.

2014-10-02 Thread Lee-Ping Wang
Hi Ralph,



I've been troubleshooting this issue and communicating with Blue Waters
support.  It turns out that Q-Chem and OpenMPI are both trying to open
sockets, and I get different error messages depending on which one fails.  



As an aside, I don't know why Q-Chem needs sockets of its own to communicate
between ranks; shouldn't OpenMPI be taking care of all that?  (I'm
unfamiliar with this part of the Q-Chem code base, maybe it's trying to
duplicate some functionality?)



The Blue Waters support has indicated that there's a problem with their
realm-specific IP addressing (RSIP) for the compute nodes, which they're
working on fixing.  I also tried running the same Q-Chem / OpenMPI job on a
management node which I think has the same hardware (but not the RSIP), and
the problem went away.  So I think I'll shelve this problem for now, until
Blue Waters support gets back to me with the fix. :)



Thanks,



-  Lee-Ping



From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lee-Ping Wang
Sent: Tuesday, September 30, 2014 1:15 PM
To: Open MPI Users
Subject: Re: [OMPI users] General question about running single-node jobs.



Hi Ralph,



Thanks.  I'll add some print statements to the code and try to figure out
precisely where the failure is happening.



- Lee-Ping



On Sep 30, 2014, at 12:06 PM, Ralph Castain  wrote:







On Sep 30, 2014, at 11:19 AM, Lee-Ping Wang  wrote:





Hi Ralph,



 If so, then I should be able to (1) locate where the port number is defined
in the code, and (2) randomize the port number every time it's called to
work around the issue.  What do you think?



That might work, depending on the code. I'm not sure what it is trying to
connect to, and if that code knows how to handle arbitrary connections



The main reason why Q-Chem is using MPI is for executing parallel tasks on a
single node.  Thus, I think it's just the MPI ranks attempting to connect
with each other on the same machine.  This could be off the mark because I'm
still a novice with respect to MPI concepts - but I am sure it is just one
machine.



Your statement doesn't match what you sent us - you showed that it was your
connection code that was failing, not ours. You wouldn't have gotten that
far if our connections failed as you would have failed in MPI_Init. You are
clearly much further than that as you already passed an MPI_Barrier before
reaching the code in question.







You might check about those warnings - could be that QCLOCALSCR and QCREF
need to be set for the code to work.



Thanks; I don't think these environment variables are the issue but I will
check again.  The calculation runs without any problems on four different
clusters (where I don't set these environment variables either), it's only
broken on the Blue Waters compute node.  Also, the calculation runs without
any problems the first time it's executed on the BW compute node - it's only
subsequent executions that give the error messages.



Thanks,



- Lee-Ping



On Sep 30, 2014, at 11:05 AM, Ralph Castain  wrote:







On Sep 30, 2014, at 10:49 AM, Lee-Ping Wang  wrote:





Hi Ralph,



Thank you.  I think your diagnosis is probably correct.  Are these sockets
the same as TCP/UDP ports (though different numbers) that are used in web
servers, email etc?



Yes





 If so, then I should be able to (1) locate where the port number is defined
in the code, and (2) randomize the port number every time it's called to
work around the issue.  What do you think?



That might work, depending on the code. I'm not sure what it is trying to
connect to, and if that code knows how to handle arbitrary connections



You might check about those warnings - could be that QCLOCALSCR and QCREF
need to be set for the code to work.







- Lee-Ping



On Sep 29, 2014, at 8:45 PM, Ralph Castain  wrote:





I don't know anything about your application, or what the functions in your
code are doing. I imagine it's possible that you are trying to open
statically defined ports, which means that running the job again too soon
could leave the OS thinking the socket is already busy. It takes awhile for
the OS to release a socket resource.





On Sep 29, 2014, at 5:49 PM, Lee-Ping Wang  wrote:





Here's another data point that might be useful: The error message is much
more rare if I run my application on 4 cores instead of 8.



Thanks,



- Lee-Ping



On Sep 29, 2014, at 5:38 PM, Lee-Ping Wang  wrote:





Sorry for my last email - I think I spoke too quick.  I realized after
reading some more documentation that OpenMPI always uses TCP sockets for
out-of-band communication, so it doesn't make sense for me to set
OMPI_MCA_oob=^tcp.  That said, I am still running into a strange problem in
my application when running on a specific machine (Blue Waters compute
node); I don't see this problem on any other

Re: [OMPI users] General question about running single-node jobs.

2014-10-02 Thread Lee-Ping Wang
Hi Gus,

Thanks for the suggestions!  

I know that QCSCRATCH and QCLOCALSCR are not the problem.  When I set 
QCSCRATCH="." and unset QCLOCALSCR it writes all the scratch files to the 
current directory, which is the behavior I want.  The environment variables are 
correctly passed in the mpirun command line.

Since my jobs have a fair bit of I/O, I make sure to change to the locally 
mounted /tmp folder before running the calculations.  I do have permissions to 
write in there.  

When I run jobs without OpenMPI they are stable on Blue Waters compute nodes, 
which suggests the issues are not due to the above.

I compiled Q-Chem from the source code, so I built OpenMPI 1.8.3 first and 
added $OMPI/bin to my PATH (and $OMPI/lib to LD_LIBRARY_PATH).  I configured 
the Q-Chem build so it  properly uses "mpicc", etc.  The environment variables 
for OpenMPI are correctly set at runtime.

At this point, I think the main problem is a limitation on the networking in 
the compute nodes, and I believe Blue Waters support is currently working on 
this.  I'll make sure to send an update if anything happens.

- Lee-Ping

On Oct 2, 2014, at 12:09 PM, Gus Correa  wrote:

> 
> Hi Lee-Ping
> 
> Computational Chemistry is Greek to me.
> 
> However, on pp. 12 of the Q-Chem manual 3.2
> 
> (PDF online 
> http://www.q-chem.com/qchem-website/doc_for_web/qchem_manual_3.2.pdf)
> 
> there are explanations of the meaning of QCSCRATCH and
> QLOCALSRC, etc, which as Ralph pointed out, seem to be a sticking point,
> and showed up in the warning messages, which I enclose below.
> 
> QLOCALSRC specifies a local disk for IO.
> I wonder if the node(s) is (are) diskless, and this might cause the problem.
> Another possibility is that mpiexec may not be passing these
> environment variables.
> (Do you pass them in the mpiexec/mpirun command line?)
> 
> 
> QCSCRATCH defines a directory for temporary files.
> If this is a network shared directory, could it be that some nodes
> are not mounting it correctly?
> Likewise, if your home directory or your job run directory are not
> mounted that could be a problem.
> Or maybe you don't have write permission (sometimes this
> happens in /tmp, specially if it is a ramdir/tmpdir, which may also have a 
> small size).
> 
> Your BlueWaters system administrator may be able to shed some light on these 
> things.
> 
> Also the Q-Chem manual says it is a pre-compiled executable,
> which as far as I know would require a matching version of OpenMPI.
> (Ralph, please correct me if I am wrong.).
> 
> However, you seem to have the source code, at least you sent a
> snippet of it. [With all those sockets being opened besides MPI ...]
> 
> Did you recompile with OpenMPI?
> Did you add the $OMPI/bin to PATH and $OMPI/lib to LD_LIBRARY_PATH
> and are these environment variables propagated to the job execution nodes 
> (specially those that are failing)?
> 
> 
> Anyway, just a bunch of guesses ...
> Gus Correa
> 
> *
> QCSCRATCH Defines the directory in which
> Q-Chem
> will store temporary files.
> Q-Chem
> will usually remove these files on successful completion of t
> he job, but they
> can be saved, if so wished. Therefore,
> QCSCRATCH
> should not reside in
> a directory that will be automatically removed at the end of a
> job, if the
> files are to be kept for further calculations.
> Note that many of these files can be very large, and it should be
> ensured that
> the volume that contains this directory has sufficient disk sp
> ace available.
> The
> QCSCRATCH
> directory should be periodically checked for scratch
> files remaining from abnormally terminated jobs.
> QCSCRATCH
> defaults
> to the working directory if not explicitly set. Please see se
> ction 2.6 for
> details on saving temporary files and consult your systems ad
> ministrator.
> 
> 
> QCLOCALSCR On certain platforms, such as Linux clusters, it
> is sometimes preferable to
> write the temporary files to a disk local to the node.
> QCLOCALSCR
> spec-
> ifies this directory. The temporary files will be copied to
> QCSCRATCH
> at
> the end of the job, unless the job is terminated abnormally. I
> n such cases
> Q-Chem
> will attempt to remove the files in
> QCLOCALSCR
> , but may not
> be able to due to access restrictions. Please specify this va
> riable only if
> required
> *
> 
> On 10/02/2014 02:08 PM, Lee-Ping Wang wrote:
>> Hi Ralph,
>> 
>> I’ve been troubleshooting this issue and communicating with Blue Waters
>> support.  It turns out that Q-Chem and OpenMPI are both trying to open
>> sockets, and I ge

[OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
Hi there,



Recently, I've begun some calculations on a cluster where I submit a
multiple node job to the Torque batch system, and the job executes multiple
single-node parallel tasks.  That is to say, these tasks are intended to use
OpenMPI parallelism on each node, but no parallelism across nodes.  



Some background: The actual program being executed is Q-Chem 4.0.  I use
OpenMPI 1.4.2 for this, because Q-Chem is notoriously difficult to compile
and this is the last known version of OpenMPI that this version of Q-Chem is
known to work with.



My jobs are failing with the error message below; I do not observe this
error when submitting single-node jobs.  From reading the mailing list
archives (http://www.open-mpi.org/community/lists/users/2010/03/12348.php),
I believe it is looking for a PBS node file somewhere.  Since my jobs are
only parallel over the node they're running on, I believe that a node file
of any kind is unnecessary.  



My question is: Why is OpenMPI behaving differently when I submit a
multi-node job compared to a single-node job?  How does OpenMPI detect that
it is running under a multi-node allocation?  Is there a way I can change
OpenMPI's behavior so it always thinks it's running on a single node,
regardless of the type of job I submit to the batch system?



Thank you,



-      Lee-Ping Wang (Postdoc in Dept. of Chemistry, Stanford
University)



[compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in
file ras_tm_module.c at line 153

[compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in
file ras_tm_module.c at line 153

[compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in
file ras_tm_module.c at line 153

[compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in
file ras_tm_module.c at line 87

[compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in
file ras_tm_module.c at line 87

[compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in
file ras_tm_module.c at line 87

[compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in
file base/ras_base_allocate.c at line 133

[compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in
file base/ras_base_allocate.c at line 133

[compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in
file base/ras_base_allocate.c at line 133

[compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in
file base/plm_base_launch_support.c at line 72

[compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in
file base/plm_base_launch_support.c at line 72

[compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in
file base/plm_base_launch_support.c at line 72

[compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in
file plm_tm_module.c at line 167

[compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in
file plm_tm_module.c at line 167

[compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in
file plm_tm_module.c at line 167



Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
Hi Gus,

Thank you for your reply.  I want to run MPI jobs inside a single node, but
due to the resource allocation policies on the clusters, I could get many
more resources if I submit multiple-node "batch jobs".  Once I have a
multiple-node batch job, then I can use a command like "pbsdsh" to run
single node MPI jobs on each node that is allocated to me.  Thus, the MPI
jobs on each node are running independently of each other and unaware of one
another.

The actual call to mpirun is nontrivial to get, because Q-Chem has a
complicated series of wrapper scripts which ultimately calls mpirun.  If the
jobs are failing immediately, then I only have a small window to view the
actual command through "ps" or something.

Another option is for me to compile OpenMPI without Torque / PBS support.
If I do that, then it won't look for the node file anymore.  Is this
correct? 

I will try your suggestions and get back to you.  Thanks!

- Lee-Ping

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gustavo Correa
Sent: Saturday, August 10, 2013 12:04 PM
To: Open MPI Users
Subject: Re: [OMPI users] Error launching single-node tasks from
multiple-node job.

Hi Lee-Ping

I know nothing about Q-Chem, but I was confused by these sentences:

"That is to say, these tasks are intended to use OpenMPI parallelism on each
node, but no parallelism across nodes. "

"I do not observe this error when submitting single-node jobs."

"Since my jobs are only parallel over the node they're running on, I believe
that a node file of any kind is unnecessary. "

Are you trying to run MPI jobs across several nodes or inside a single node?

***

Anyway, as far as I know,
if your OpenMPI was compiled with Torque/PBS support, the mpiexec/mpirun
command will look for the $PBS_NODEFILE to learn in which node(s) it should
launch the MPI processes, regardless of whether you are using one node or
more than one node.

You didn't send your mpiexec command line (which would help), but assuming
that Q-Chem allows some level of standard mpiexec command options, you could
force passing the $PBS_NODEFILE to it.

Something like this (for two nodes with 8 cores each):

#PBS -q myqueue
#PBS -l nodes=2:ppn=8
#PBS -N myjob
cd $PBS_O_WORKDIR
ls -l $PBS_NODEFILE
cat $PBS_NODEFILE

mpiexec -hostfile $PBS_NODEFILE -np 16 ./my-Q-chem-executable 

I hope this helps,
Gus Correa

On Aug 10, 2013, at 1:51 PM, Lee-Ping Wang wrote:

> Hi there,
>  
> Recently, I've begun some calculations on a cluster where I submit a
multiple node job to the Torque batch system, and the job executes multiple
single-node parallel tasks.  That is to say, these tasks are intended to use
OpenMPI parallelism on each node, but no parallelism across nodes. 
>  
> Some background: The actual program being executed is Q-Chem 4.0.  I use
OpenMPI 1.4.2 for this, because Q-Chem is notoriously difficult to compile
and this is the last known version of OpenMPI that this version of Q-Chem is
known to work with.
>  
> My jobs are failing with the error message below; I do not observe this
error when submitting single-node jobs.  From reading the mailing list
archives (http://www.open-mpi.org/community/lists/users/2010/03/12348.php),
I believe it is looking for a PBS node file somewhere.  Since my jobs are
only parallel over the node they're running on, I believe that a node file
of any kind is unnecessary. 
>  
> My question is: Why is OpenMPI behaving differently when I submit a
multi-node job compared to a single-node job?  How does OpenMPI detect that
it is running under a multi-node allocation?  Is there a way I can change
OpenMPI's behavior so it always thinks it's running on a single node,
regardless of the type of job I submit to the batch system?
>  
> Thank you,
>  
> -  Lee-Ping Wang (Postdoc in Dept. of Chemistry, Stanford
University)
>  
> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open 
> failure in file ras_tm_module.c at line 153 [compute-1-1.local:10909] 
> [[42009,0],0] ORTE_ERROR_LOG: File open failure in file 
> ras_tm_module.c at line 153 [compute-1-1.local:10911] [[42011,0],0] 
> ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 153 
> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open 
> failure in file ras_tm_module.c at line 87 [compute-1-1.local:10909] 
> [[42009,0],0] ORTE_ERROR_LOG: File open failure in file 
> ras_tm_module.c at line 87 [compute-1-1.local:10911] [[42011,0],0] 
> ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 87 
> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open 
> failure in file base/ras_base_allocate.c at line 133 
> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open 
> failure in file base/ras_base_allocate.c at line 133 
> [compute

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
Hi Gus,

Thank you.  You gave me many helpful suggestions, which I will try out and
get back to you.  I will provide more specifics (e.g. how my jobs were
submitted) in a future email.  

As for the queue policy, that is a highly political issue because the
cluster is a shared resource.  My usual recourse is to use the batch system
as effectively as possible within the confines of their policies.  This is
why it makes sense to submit a single multiple-node batch job, which then
executes several independent single-node tasks.

- Lee-Ping

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gustavo Correa
Sent: Saturday, August 10, 2013 12:39 PM
To: Open MPI Users
Subject: Re: [OMPI users] Error launching single-node tasks from
multiple-node job.

Hi Lee-Ping
On Aug 10, 2013, at 3:15 PM, Lee-Ping Wang wrote:

> Hi Gus,
> 
> Thank you for your reply.  I want to run MPI jobs inside a single 
> node, but due to the resource allocation policies on the clusters, I 
> could get many more resources if I submit multiple-node "batch jobs".  
> Once I have a multiple-node batch job, then I can use a command like 
> "pbsdsh" to run single node MPI jobs on each node that is allocated to 
> me.  Thus, the MPI jobs on each node are running independently of each 
> other and unaware of one another.

Even if you use pbdsh to launch separate MPI jobs on individual nodes, you
probably (not 100% sure about that), probably need to specify he -hostfile
naming the specific node that each job will run on.

Still quite confused because you didn't tell how your "qsub" command looks
like, what Torque script (if any) it is launching, etc.

> 
> The actual call to mpirun is nontrivial to get, because Q-Chem has a 
> complicated series of wrapper scripts which ultimately calls mpirun.

Yes, I just found this out on the Web.  See my previous email.

> If the
> jobs are failing immediately, then I only have a small window to view 
> the actual command through "ps" or something.
> 

Are you launching the jobs interactively?  
I.e., with the -I switch to qsub?


> Another option is for me to compile OpenMPI without Torque / PBS support.
> If I do that, then it won't look for the node file anymore.  Is this 
> correct?

You will need to tell mpiexec where to launch the jobs.
If I understand what you are trying to achieve (and I am not sure I do), one
way to do it would be to programatically split the $PBS_NODEFILE into
several hostfiles, one per MPI job (so to speak) that you want to launch.
Then use each of these nodefiles for each of the MPI jobs.
Note that the PBS_NODEFILE has one line per-node-per-core, *not* one line
per node.
I have no idea how the trick above could be reconciled with the Q-Chem
scripts, though.

Overall, I don't understand why you would benefit from such a complicated
scheme, rather than lauching either a big MPI job across all nodes that you
requested (if the problem is large enough to benefit from  this many cores),
or launch several small single-node jobs (if the problem is small enough to
fit well a single node).

You may want to talk to the cluster managers, because there must be a way to
reconcile their queue policies with your needs (if this not already in
place).
We run tons of parallel single-node jobs here, for problems that fit well a
single node.


My two cents
Gus Correa

> 
> I will try your suggestions and get back to you.  Thanks!
> 
> - Lee-Ping
> 
> -Original Message-
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gustavo
Correa
> Sent: Saturday, August 10, 2013 12:04 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Error launching single-node tasks from
> multiple-node job.
> 
> Hi Lee-Ping
> 
> I know nothing about Q-Chem, but I was confused by these sentences:
> 
> "That is to say, these tasks are intended to use OpenMPI parallelism on
each
> node, but no parallelism across nodes. "
> 
> "I do not observe this error when submitting single-node jobs."
> 
> "Since my jobs are only parallel over the node they're running on, I
believe
> that a node file of any kind is unnecessary. "
> 
> Are you trying to run MPI jobs across several nodes or inside a single
node?
> 
> ***
> 
> Anyway, as far as I know,
> if your OpenMPI was compiled with Torque/PBS support, the mpiexec/mpirun
> command will look for the $PBS_NODEFILE to learn in which node(s) it
should
> launch the MPI processes, regardless of whether you are using one node or
> more than one node.
> 
> You didn't send your mpiexec command line (which would help), but assuming
> that Q-Chem allows some level of standard mpiexec command options, you
could
> force passing the $PBS_NODEFILE to it.
> 
> Something like this (

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
Hi Gus,

I tried your suggestions.  Here is the command line which executes mpirun.
I was puzzled because it still reported a file open failure, so I inserted a
print statement into ras_tm_module.c and recompiled.  The results are below.
As you can see, it tries to open a different file
(/scratch/leeping/272055.certainty.stanford.edu) than the one I specified
(/scratch/leeping/pbs_nodefile.compute-3-3.local).

- Lee-Ping

=== mpirun command line ===
/home/leeping/opt/openmpi-1.4.2-intel11-dbg/bin/mpirun -machinefile
/scratch/leeping/pbs_nodefile.compute-3-3.local -x HOME -x PWD -x QC -x
QCAUX -x QCCLEAN -x QCFILEPREF -x QCLOCALSCR -x QCPLATFORM -x QCREF -x QCRSH
-x QCRUNNAME -x QCSCRATCH 
   -np 24 /home/leeping/opt/qchem40/exe/qcprog.exe
.B.in.28642.qcin.1 ./qchem28642/ >>B.out

=== Error message from compute node ===
[compute-3-3.local:28666] Warning: could not find environment variable
"QCLOCALSCR"
[compute-3-3.local:28666] Warning: could not find environment variable
"QCREF"
[compute-3-3.local:28666] Warning: could not find environment variable
"QCRUNNAME"
Attempting to open /scratch/leeping/272055.certainty.stanford.edu
[compute-3-3.local:28666] [[56726,0],0] ORTE_ERROR_LOG: File open failure in
file ras_tm_module.c at line 155
[compute-3-3.local:28666] [[56726,0],0] ORTE_ERROR_LOG: File open failure in
file ras_tm_module.c at line 87
[compute-3-3.local:28666] [[56726,0],0] ORTE_ERROR_LOG: File open failure in
file base/ras_base_allocate.c at line 133
[compute-3-3.local:28666] [[56726,0],0] ORTE_ERROR_LOG: File open failure in
file base/plm_base_launch_support.c at line 72
[compute-3-3.local:28666] [[56726,0],0] ORTE_ERROR_LOG: File open failure in
file plm_tm_module.c at line 167

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lee-Ping Wang
Sent: Saturday, August 10, 2013 12:51 PM
To: 'Open MPI Users'
Subject: Re: [OMPI users] Error launching single-node tasks from
multiple-node job.

Hi Gus,

Thank you.  You gave me many helpful suggestions, which I will try out and
get back to you.  I will provide more specifics (e.g. how my jobs were
submitted) in a future email.  

As for the queue policy, that is a highly political issue because the
cluster is a shared resource.  My usual recourse is to use the batch system
as effectively as possible within the confines of their policies.  This is
why it makes sense to submit a single multiple-node batch job, which then
executes several independent single-node tasks.

- Lee-Ping

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gustavo Correa
Sent: Saturday, August 10, 2013 12:39 PM
To: Open MPI Users
Subject: Re: [OMPI users] Error launching single-node tasks from
multiple-node job.

Hi Lee-Ping
On Aug 10, 2013, at 3:15 PM, Lee-Ping Wang wrote:

> Hi Gus,
> 
> Thank you for your reply.  I want to run MPI jobs inside a single 
> node, but due to the resource allocation policies on the clusters, I 
> could get many more resources if I submit multiple-node "batch jobs".
> Once I have a multiple-node batch job, then I can use a command like 
> "pbsdsh" to run single node MPI jobs on each node that is allocated to 
> me.  Thus, the MPI jobs on each node are running independently of each 
> other and unaware of one another.

Even if you use pbdsh to launch separate MPI jobs on individual nodes, you
probably (not 100% sure about that), probably need to specify he -hostfile
naming the specific node that each job will run on.

Still quite confused because you didn't tell how your "qsub" command looks
like, what Torque script (if any) it is launching, etc.

> 
> The actual call to mpirun is nontrivial to get, because Q-Chem has a 
> complicated series of wrapper scripts which ultimately calls mpirun.

Yes, I just found this out on the Web.  See my previous email.

> If the
> jobs are failing immediately, then I only have a small window to view 
> the actual command through "ps" or something.
> 

Are you launching the jobs interactively?  
I.e., with the -I switch to qsub?


> Another option is for me to compile OpenMPI without Torque / PBS support.
> If I do that, then it won't look for the node file anymore.  Is this 
> correct?

You will need to tell mpiexec where to launch the jobs.
If I understand what you are trying to achieve (and I am not sure I do), one
way to do it would be to programatically split the $PBS_NODEFILE into
several hostfiles, one per MPI job (so to speak) that you want to launch.
Then use each of these nodefiles for each of the MPI jobs.
Note that the PBS_NODEFILE has one line per-node-per-core, *not* one line
per node.
I have no idea how the trick above could be reconciled with the Q-Chem
scripts, though.

Overall, I don't understand why you would benefit from such a complicated
scheme, rath

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
Hi Gus,

It seems the calculation is now working, or at least it didn't crash.  I set
the PBS_JOBID environment variable to the name of my custom node file.  That
is to say, I set PBS_JOBID=pbs_nodefile.compute-3-3.local.  It appears that
ras_tm_module.c is trying to open the file located at
/scratch/leeping/$PBS_JOBID for some reason, and it is disregarding the
machinefile argument on the command line.

It'll be a few hours before I know for sure whether the job actually worked.
I still don't know why things are structured this way, however. 

Thanks,

- Lee-Ping

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lee-Ping Wang
Sent: Saturday, August 10, 2013 3:07 PM
To: 'Open MPI Users'
Subject: Re: [OMPI users] Error launching single-node tasks from
multiple-node job.

Hi Gus,

I tried your suggestions.  Here is the command line which executes mpirun.
I was puzzled because it still reported a file open failure, so I inserted a
print statement into ras_tm_module.c and recompiled.  The results are below.
As you can see, it tries to open a different file
(/scratch/leeping/272055.certainty.stanford.edu) than the one I specified
(/scratch/leeping/pbs_nodefile.compute-3-3.local).

- Lee-Ping

=== mpirun command line ===
/home/leeping/opt/openmpi-1.4.2-intel11-dbg/bin/mpirun -machinefile
/scratch/leeping/pbs_nodefile.compute-3-3.local -x HOME -x PWD -x QC -x
QCAUX -x QCCLEAN -x QCFILEPREF -x QCLOCALSCR -x QCPLATFORM -x QCREF -x QCRSH
-x QCRUNNAME -x QCSCRATCH 
   -np 24 /home/leeping/opt/qchem40/exe/qcprog.exe
.B.in.28642.qcin.1 ./qchem28642/ >>B.out

=== Error message from compute node ===
[compute-3-3.local:28666] Warning: could not find environment variable
"QCLOCALSCR"
[compute-3-3.local:28666] Warning: could not find environment variable
"QCREF"
[compute-3-3.local:28666] Warning: could not find environment variable
"QCRUNNAME"
Attempting to open /scratch/leeping/272055.certainty.stanford.edu
[compute-3-3.local:28666] [[56726,0],0] ORTE_ERROR_LOG: File open failure in
file ras_tm_module.c at line 155 [compute-3-3.local:28666] [[56726,0],0]
ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 87
[compute-3-3.local:28666] [[56726,0],0] ORTE_ERROR_LOG: File open failure in
file base/ras_base_allocate.c at line 133 [compute-3-3.local:28666]
[[56726,0],0] ORTE_ERROR_LOG: File open failure in file
base/plm_base_launch_support.c at line 72 [compute-3-3.local:28666]
[[56726,0],0] ORTE_ERROR_LOG: File open failure in file plm_tm_module.c at
line 167

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lee-Ping Wang
Sent: Saturday, August 10, 2013 12:51 PM
To: 'Open MPI Users'
Subject: Re: [OMPI users] Error launching single-node tasks from
multiple-node job.

Hi Gus,

Thank you.  You gave me many helpful suggestions, which I will try out and
get back to you.  I will provide more specifics (e.g. how my jobs were
submitted) in a future email.  

As for the queue policy, that is a highly political issue because the
cluster is a shared resource.  My usual recourse is to use the batch system
as effectively as possible within the confines of their policies.  This is
why it makes sense to submit a single multiple-node batch job, which then
executes several independent single-node tasks.

- Lee-Ping

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gustavo Correa
Sent: Saturday, August 10, 2013 12:39 PM
To: Open MPI Users
Subject: Re: [OMPI users] Error launching single-node tasks from
multiple-node job.

Hi Lee-Ping
On Aug 10, 2013, at 3:15 PM, Lee-Ping Wang wrote:

> Hi Gus,
> 
> Thank you for your reply.  I want to run MPI jobs inside a single 
> node, but due to the resource allocation policies on the clusters, I 
> could get many more resources if I submit multiple-node "batch jobs".
> Once I have a multiple-node batch job, then I can use a command like 
> "pbsdsh" to run single node MPI jobs on each node that is allocated to 
> me.  Thus, the MPI jobs on each node are running independently of each 
> other and unaware of one another.

Even if you use pbdsh to launch separate MPI jobs on individual nodes, you
probably (not 100% sure about that), probably need to specify he -hostfile
naming the specific node that each job will run on.

Still quite confused because you didn't tell how your "qsub" command looks
like, what Torque script (if any) it is launching, etc.

> 
> The actual call to mpirun is nontrivial to get, because Q-Chem has a 
> complicated series of wrapper scripts which ultimately calls mpirun.

Yes, I just found this out on the Web.  See my previous email.

> If the
> jobs are failing immediately, then I only have a small window to view 
> the actual command through "ps" or something.
> 

Are yo

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
Hi Gus,

I agree that $PBS_JOBID should not point to a file in normal situations,
because it is the job identifier given by the scheduler.  However,
ras_tm_module.c actually does search for a file named $PBS_JOBID, and that
seems to be why it was failing.  You can see this in the source code as well
(look at ras_tm_module.c, I uploaded it to
https://dl.dropboxusercontent.com/u/5381783/ras_tm_module.c ).  Once I
changed the $PBS_JOBID environment variable to the name of the node file,
things seemed to work - though I agree, it's not very logical.  

I doubt Q-Chem is causing the issue, because I was able to "fix" things by
changing $PBS_JOBID before Q-Chem is called.  Also, I provided the command
line to mpirun in a previous email, where the -machinefile argument
correctly points to the custom machine file that I created.  The missing
environment variables should not matter.

The PBS_NODEFILE created by Torque is
/opt/torque/aux//272139.certainty.stanford.edu and it never gets touched.  I
followed the advice in your earlier email and I created my own node file on
each node called /scratch/leeping/pbs_nodefile.$HOSTNAME, and I set
PBS_NODEFILE to point to this file.  However, this file does not get used
either, even if I include it on the mpirun command line, unless I set
PBS_JOBID to the file name.  

Finally, I was not able to build OpenMPI 1.4.2 without pbs support.  I used
the configure flag --without-rte-support, but the build failed halfway
through.

Thanks,

- Lee-Ping

leeping@certainty-a:~/temp$ qsub -I -q debug -l walltime=1:00:00 -l
nodes=1:ppn=12
qsub: waiting for job 272139.certainty.stanford.edu to start
qsub: job 272139.certainty.stanford.edu ready

leeping@compute-140-4:~$ echo $PBS_NODEFILE 
/opt/torque/aux//272139.certainty.stanford.edu

leeping@compute-140-4:~$ cat $PBS_NODEFILE
compute-140-4
compute-140-4
compute-140-4
compute-140-4
compute-140-4
compute-140-4
compute-140-4
compute-140-4
compute-140-4
compute-140-4
compute-140-4
compute-140-4

leeping@compute-140-4:~$ echo $PBS_JOBID
272139.certainty.stanford.edu

leeping@compute-140-4:~$ cat $PBS_JOBID
cat: 272139.certainty.stanford.edu: No such file or directory

leeping@compute-140-4:~$ env | grep PBS
PBS_VERSION=TORQUE-2.5.3
PBS_JOBNAME=STDIN
PBS_ENVIRONMENT=PBS_INTERACTIVE
PBS_O_WORKDIR=/home/leeping/temp
PBS_TASKNUM=1
PBS_O_HOME=/home/leeping
PBS_MOMPORT=15003
PBS_O_QUEUE=debug
PBS_O_LOGNAME=leeping
PBS_O_LANG=en_US.iso885915
PBS_JOBCOOKIE=A27B00DAF72024CBEBB7CD3752BDBADC
PBS_NODENUM=0
PBS_NUM_NODES=1
PBS_O_SHELL=/bin/bash
PBS_SERVER=certainty.stanford.edu
PBS_JOBID=272139.certainty.stanford.edu
PBS_O_HOST=certainty-a.local
PBS_VNODENUM=0
PBS_QUEUE=debug
PBS_O_MAIL=/var/spool/mail/leeping
PBS_NUM_PPN=12
PBS_NODEFILE=/opt/torque/aux//272139.certainty.stanford.edu
PBS_O_PATH=/opt/intel/Compiler/11.1/064/bin/intel64:/opt/intel/Compiler/11.1
/064/bin/intel64:/usr/local/cuda/bin:/home/leeping/opt/psi-4.0b5/bin:/home/l
eeping/opt/tinker/bin:/home/leeping/opt/cctools/bin:/home/leeping/bin:/home/
leeping/local/bin:/home/leeping/opt/bin:/usr/kerberos/bin:/usr/java/latest/b
in:/usr/local/bin:/bin:/usr/bin:/opt/ganglia/bin:/opt/ganglia/sbin:/opt/open
mpi/bin/:/opt/maui/bin:/opt/torque/bin:/opt/torque/sbin:/opt/rocks/bin:/opt/
rocks/sbin:/opt/sun-ct/bin:/home/leeping/bin

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gustavo Correa
Sent: Saturday, August 10, 2013 3:58 PM
To: Open MPI Users
Subject: Re: [OMPI users] Error launching single-node tasks from
multiple-node job.

Lee-Ping

Something looks amiss.
PBS_JOBID contains the job name.
PBS_NODEFILE contains a list (with repetitions up to the number of cores) of
the nodes that torque assigned to the job.

Why things get twisted it is hard to tell, it may be something in the Q-Chem
scripts (could it be mixing up PBS_JOBID and PBS_NODEFILE?), it may be
something else.
A more remote possibility is if the cluster has a Torque qsub wrapper that
may perhaps produce the aforementioned confusion.  Unlikely, but possible.

To sort out, run any simple job (mpiexec -np 32 hostname), or even your very
Q-Chem job, but precede it with a bunch of printouts of the PBS environment
variables:
echo $PBS_JOBID
echo $PBS_NODEFILE
ls -l $PBS_NODEFILE
cat $PBS_NODEFILE
cat $PBS_JOBID [this one should fail, because that is not a file, but may
work the PBS variables were messed up along the way]

I hope this helps,
Gus Correa

On Aug 10, 2013, at 6:39 PM, Lee-Ping Wang wrote:

> Hi Gus,
> 
> It seems the calculation is now working, or at least it didn't crash.  
> I set the PBS_JOBID environment variable to the name of my custom node 
> file.  That is to say, I set PBS_JOBID=pbs_nodefile.compute-3-3.local.  
> It appears that ras_tm_module.c is trying to open the file located at 
> /scratch/leeping/$PBS_JOBID for some reason, and it is disregarding 
> the machinefile argument on the command line.
> 
> It'll be a few hours

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
Hi Ralph,

Thank you.  I didn't know that "--without-tm" was the correct configure
option.  I built and reinstalled OpenMPI 1.4.2, and now I no longer need to
set PBS_JOBID for it to recognize the correct machine file.  My current
workflow is:

1) Submit a multiple-node batch job. 
2) Launch a separate process on each node with "pbsdsh".
2) On each node, create a file called
/scratch/leeping/pbs_nodefile.$HOSTNAME which contains 24 instances of the
hostname (since there are 24 cores).
3) Set $PBS_NODEFILE=/scratch/leeping/pbs_nodefile.$HOSTNAME.
4) In the Q-Chem wrapper script, make sure mpirun is called with the command
line argument: -machinefile $PBS_NODEFILE

Everything seems to work, thanks to your help and Gus.  I might report back
if the jobs fail halfway through or if there is no speedup, but for now
everything seems to be in place.

- Lee-Ping

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Saturday, August 10, 2013 4:28 PM
To: Open MPI Users
Subject: Re: [OMPI users] Error launching single-node tasks from
multiple-node job.

It helps if you use the correct configure option: --without-tm

Regardless, you can always deselect Torque support at runtime. Just put the
following in your environment:

OMPI_MCA_ras=^tm

That will tell ORTE to ignore the Torque allocation module and it should
then look at the machinefile.


On Aug 10, 2013, at 4:18 PM, "Lee-Ping Wang"  wrote:

> Hi Gus,
> 
> I agree that $PBS_JOBID should not point to a file in normal 
> situations, because it is the job identifier given by the scheduler.  
> However, ras_tm_module.c actually does search for a file named 
> $PBS_JOBID, and that seems to be why it was failing.  You can see this 
> in the source code as well (look at ras_tm_module.c, I uploaded it to 
> https://dl.dropboxusercontent.com/u/5381783/ras_tm_module.c ).  Once I 
> changed the $PBS_JOBID environment variable to the name of the node 
> file, things seemed to work - though I agree, it's not very logical.
> 
> I doubt Q-Chem is causing the issue, because I was able to "fix" 
> things by changing $PBS_JOBID before Q-Chem is called.  Also, I 
> provided the command line to mpirun in a previous email, where the 
> -machinefile argument correctly points to the custom machine file that 
> I created.  The missing environment variables should not matter.
> 
> The PBS_NODEFILE created by Torque is
> /opt/torque/aux//272139.certainty.stanford.edu and it never gets 
> touched.  I followed the advice in your earlier email and I created my 
> own node file on each node called 
> /scratch/leeping/pbs_nodefile.$HOSTNAME, and I set PBS_NODEFILE to 
> point to this file.  However, this file does not get used either, even 
> if I include it on the mpirun command line, unless I set PBS_JOBID to the
file name.
> 
> Finally, I was not able to build OpenMPI 1.4.2 without pbs support.  I 
> used the configure flag --without-rte-support, but the build failed 
> halfway through.
> 
> Thanks,
> 
> - Lee-Ping
> 
> leeping@certainty-a:~/temp$ qsub -I -q debug -l walltime=1:00:00 -l
> nodes=1:ppn=12
> qsub: waiting for job 272139.certainty.stanford.edu to start
> qsub: job 272139.certainty.stanford.edu ready
> 
> leeping@compute-140-4:~$ echo $PBS_NODEFILE 
> /opt/torque/aux//272139.certainty.stanford.edu
> 
> leeping@compute-140-4:~$ cat $PBS_NODEFILE
> compute-140-4
> compute-140-4
> compute-140-4
> compute-140-4
> compute-140-4
> compute-140-4
> compute-140-4
> compute-140-4
> compute-140-4
> compute-140-4
> compute-140-4
> compute-140-4
> 
> leeping@compute-140-4:~$ echo $PBS_JOBID 272139.certainty.stanford.edu
> 
> leeping@compute-140-4:~$ cat $PBS_JOBID
> cat: 272139.certainty.stanford.edu: No such file or directory
> 
> leeping@compute-140-4:~$ env | grep PBS
> PBS_VERSION=TORQUE-2.5.3
> PBS_JOBNAME=STDIN
> PBS_ENVIRONMENT=PBS_INTERACTIVE
> PBS_O_WORKDIR=/home/leeping/temp
> PBS_TASKNUM=1
> PBS_O_HOME=/home/leeping
> PBS_MOMPORT=15003
> PBS_O_QUEUE=debug
> PBS_O_LOGNAME=leeping
> PBS_O_LANG=en_US.iso885915
> PBS_JOBCOOKIE=A27B00DAF72024CBEBB7CD3752BDBADC
> PBS_NODENUM=0
> PBS_NUM_NODES=1
> PBS_O_SHELL=/bin/bash
> PBS_SERVER=certainty.stanford.edu
> PBS_JOBID=272139.certainty.stanford.edu
> PBS_O_HOST=certainty-a.local
> PBS_VNODENUM=0
> PBS_QUEUE=debug
> PBS_O_MAIL=/var/spool/mail/leeping
> PBS_NUM_PPN=12
> PBS_NODEFILE=/opt/torque/aux//272139.certainty.stanford.edu
> PBS_O_PATH=/opt/intel/Compiler/11.1/064/bin/intel64:/opt/intel/Compile
> r/11.1 
> /064/bin/intel64:/usr/local/cuda/bin:/home/leeping/opt/psi-4.0b5/bin:/
> home/l 
> eeping/opt/tinker/bin:/home/leeping/opt/cctools/bin:/home/leeping/bin:
> /home/ 
&g

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

2013-08-10 Thread Lee-Ping Wang
Hi Gus,

I think your suggestion sounds good.  I'll leave the PBS_NODEFILE intact.
Thank you again for your assistance!

- Lee-Ping

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gustavo Correa
Sent: Saturday, August 10, 2013 5:36 PM
To: Open MPI Users
Subject: Re: [OMPI users] Error launching single-node tasks from
multiple-node job.

Hi Lee-Ping

Yes, configuring --without-tm, as Ralph told you to do, will make your
OpenMPI independent from Torque, although as Ralph said, even with an Open
MPI configured with Torque support you can override it at runtime.

I don't know what Open MPI uses the PBS_JOBID for, maybe some internal
check, but I would guess it will eventually use the PBS_NODEFILE as the list
of nodes that is passed to mpiexec under the hood.

I would just do your steps 3 and 4 below slightly different.
I don't think you should change the PBS_NODEFILE environment variable, as
Torque may use it for other purposes (say, keep track of the nodes in use,
etc).
[You may not be able to change it, but I haven't tried to.]

My suggestion is:

3&4)  In the Q-Chem wrapper script, make sure mpirun is called with the
comman line argument: -machinefile /scratch/leeping/pbs_nodefile.$HOSTNAME 

This will leave the PBS_NODEFILE variable intact, and have the same net
effect as your workflow.

Anyway, congratulations for sorting things out and making it work!

Gus Correa

On Aug 10, 2013, at 7:40 PM, Lee-Ping Wang wrote:

> Hi Ralph,
> 
> Thank you.  I didn't know that "--without-tm" was the correct 
> configure option.  I built and reinstalled OpenMPI 1.4.2, and now I no 
> longer need to set PBS_JOBID for it to recognize the correct machine 
> file.  My current workflow is:
> 
> 1) Submit a multiple-node batch job. 
> 2) Launch a separate process on each node with "pbsdsh".
> 2) On each node, create a file called
> /scratch/leeping/pbs_nodefile.$HOSTNAME which contains 24 instances of 
> the hostname (since there are 24 cores).
> 3) Set $PBS_NODEFILE=/scratch/leeping/pbs_nodefile.$HOSTNAME.
> 4) In the Q-Chem wrapper script, make sure mpirun is called with the 
> command line argument: -machinefile $PBS_NODEFILE
> 
> Everything seems to work, thanks to your help and Gus.  I might report 
> back if the jobs fail halfway through or if there is no speedup, but 
> for now everything seems to be in place.
> 
> - Lee-Ping
> 
> -Original Message-
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph 
> Castain
> Sent: Saturday, August 10, 2013 4:28 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Error launching single-node tasks from 
> multiple-node job.
> 
> It helps if you use the correct configure option: --without-tm
> 
> Regardless, you can always deselect Torque support at runtime. Just 
> put the following in your environment:
> 
> OMPI_MCA_ras=^tm
> 
> That will tell ORTE to ignore the Torque allocation module and it 
> should then look at the machinefile.
> 
> 
> On Aug 10, 2013, at 4:18 PM, "Lee-Ping Wang"  wrote:
> 
>> Hi Gus,
>> 
>> I agree that $PBS_JOBID should not point to a file in normal 
>> situations, because it is the job identifier given by the scheduler.
>> However, ras_tm_module.c actually does search for a file named 
>> $PBS_JOBID, and that seems to be why it was failing.  You can see 
>> this in the source code as well (look at ras_tm_module.c, I uploaded 
>> it to https://dl.dropboxusercontent.com/u/5381783/ras_tm_module.c ).  
>> Once I changed the $PBS_JOBID environment variable to the name of the 
>> node file, things seemed to work - though I agree, it's not very logical.
>> 
>> I doubt Q-Chem is causing the issue, because I was able to "fix" 
>> things by changing $PBS_JOBID before Q-Chem is called.  Also, I 
>> provided the command line to mpirun in a previous email, where the 
>> -machinefile argument correctly points to the custom machine file 
>> that I created.  The missing environment variables should not matter.
>> 
>> The PBS_NODEFILE created by Torque is 
>> /opt/torque/aux//272139.certainty.stanford.edu and it never gets 
>> touched.  I followed the advice in your earlier email and I created 
>> my own node file on each node called 
>> /scratch/leeping/pbs_nodefile.$HOSTNAME, and I set PBS_NODEFILE to 
>> point to this file.  However, this file does not get used either, 
>> even if I include it on the mpirun command line, unless I set 
>> PBS_JOBID to the
> file name.
>> 
>> Finally, I was not able to build OpenMPI 1.4.2 without pbs support.  
>> I used the configure flag --without-rte-support, but the build failed 
>> halfway throu

[OMPI users] Changing directory from /tmp

2013-09-04 Thread Lee-Ping Wang
Hi there,

On a few clusters I am running into an issue where a temporary directory cannot 
be created due to the root filesystem being full, causing mpirun to crash.  
Would it be possible to change the location where this directory is being 
created?

[compute-109-4.local:12055] opal_os_dirpath_create: Error: Unable to create the 
sub-directory (/tmp/openmpi-sessions-leeping@compute-109-4.local_0) of 
(/tmp/openmpi-sessions-leeping@compute-109-4.local_0/28512/0/0), mkdir failed 
[1]

Thanks,

- Lee-Ping



Re: [OMPI users] Changing directory from /tmp

2013-09-04 Thread Lee-Ping Wang

Hi everyone,

Thanks for the help!  As Gus pointed out, I could also have found the 
answer from the FAQ but it might have taken me longer to find.


- Lee-Ping

On 09/04/2013 10:29 AM, Ralph Castain wrote:

Yep - sure would. Lots of ways to specify it :-)


On Sep 4, 2013, at 10:24 AM, Reuti  wrote:


Hi,

Am 04.09.2013 um 19:21 schrieb Ralph Castain:


you can specify it with OMPI_TMPDIR in your environment, or "-mca orte_tmpdir_base 
" on your cmd line

Wouldn't --tmpdir=... do the same with `mpirun` for way the latter you 
mentioned?

-- Reuti



On Sep 4, 2013, at 10:13 AM, Lee-Ping Wang  wrote:


Hi there,

On a few clusters I am running into an issue where a temporary directory cannot 
be created due to the root filesystem being full, causing mpirun to crash.  
Would it be possible to change the location where this directory is being 
created?

[compute-109-4.local:12055] opal_os_dirpath_create: Error: Unable to create the 
sub-directory (/tmp/openmpi-sessions-leeping@compute-109-4.local_0) of 
(/tmp/openmpi-sessions-leeping@compute-109-4.local_0/28512/0/0), mkdir failed 
[1]

Thanks,

- Lee-Ping

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users