[OMPI users] Process is hanging

2014-09-21 Thread Lee-Ping Wang
Hi there,



I'm running into an issue where mpirun isn't terminating when my executable
has a nonzero exit status - instead it's hanging indefinitely.  I'm
attaching my process tree, the error message from the application, and the
messages printed to stderr.   Please let me know what I can do.



Thanks,



-  Lee-Ping



=== Process Tree ===

leeping@vsp-compute-13:~$ ps xjf

PPID   PID  PGID   SID TTY  TPGID STAT   UID   TIME COMMAND 

31969 31977 31969 31969 ?   -1 S48618   0:00 sshd: leeping@pts/1

31977 31978 31978 31978 pts/132038 Ss   48618   0:00  \_ -bash

31978 32038 32038 31978 pts/132038 R+   48618   0:00  \_ ps xjf

23667 29307 29307 29307 ?   -1 Ss   48618   0:00 /bin/bash
/home/leeping/temp/leeping-workers/10276/worker1.sh

29307 29308 29307 29307 ?   -1 S48618   0:00  \_ /bin/bash
/home/leeping/temp/leeping-workers/10276/worker2.sh

29308 29425 29307 29307 ?   -1 S48618   0:00  \_
./work_queue_worker -d all --cores 6 -t 86400s localhost 9876

29425 26245 26245 29307 ?   -1 S48618   0:00  |   \_ sh -c
optimize-geometry.py initial.xyz --method b3lyp --basis "6-31g(d)" --charge
0 --mult 1 &> optimize.log

26245 26246 26245 29307 ?   -1 Sl   48618   0:01  |   \_
/home/leeping/local/bin/python /home/leeping/local/bin/optimize-geometry.py
initial.xyz --method b3lyp --basis 6-31g(d) --charge 0 --mult 1

26246 27834 26245 29307 ?   -1 S48618   0:00  |   \_
/bin/sh -c qchem42 -np 6 -save optimize.in optimize.out optimize.d 2>
optimize.err

27834 27835 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/bash /home/leeping/opt/bin/qchem42 -np 6 -save optimize.in
optimize.out optimize.d

27835 27897 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/qchem -np 6 -nt 1
-save optimize.in optimize.out optimize.d

27897 27921 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/parallel.csh
optimize.in 6 0 ./optimize.d/ 27897

27921 27926 26245 29307 ?   -1 Sl   48618   0:00  |
\_ /opt/scratch/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun -np 6
/opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe .optimize.in.27897.qcin.1
./optimize.d/



=== Application Error Message ===

100-843.2762335150  5.69E-08  0Convergence failure



Q-Chem fatal error occurred in module
/home/leeping/src/qchem/scfman/scfman.C, line 4377:



SCF failed to converge



Sat Sep 20 23:57:37 2014



=== Standard error ===

leeping@vsp-compute-13:/opt/scratch/leeping/worker-48618-29425/t.62$ cat
optimize.err

[vsp-compute-13:27929] *** Process received signal ***

[vsp-compute-13:27929] Signal: Aborted (6)

[vsp-compute-13:27929] Signal code:  (-6)

[vsp-compute-13:27932] *** Process received signal ***

[vsp-compute-13:27932] Signal: Aborted (6)

[vsp-compute-13:27932] Signal code:  (-6)

[vsp-compute-13:27934] *** Process received signal ***

[vsp-compute-13:27934] Signal: Aborted (6)

[vsp-compute-13:27934] Signal code:  (-6)

[vsp-compute-13:27928] *** Process received signal ***

[vsp-compute-13:27928] Signal: Aborted (6)

[vsp-compute-13:27928] Signal code:  (-6)

[vsp-compute-13:27936] *** Process received signal ***

[vsp-compute-13:27936] Signal: Aborted (6)

[vsp-compute-13:27936] Signal code:  (-6)

[vsp-compute-13:27930] *** Process received signal ***

[vsp-compute-13:27930] Signal: Aborted (6)

[vsp-compute-13:27930] Signal code:  (-6)

[vsp-compute-13:27932] [ 0] /lib64/libpthread.so.0[0x3464c0eb70]

[vsp-compute-13:27932] [ 1] [vsp-compute-13:27928] [ 0]
/lib64/libpthread.so.0[0x3464c0eb70]

[vsp-compute-13:27928] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]

[vsp-compute-13:27928] [ 2] [vsp-compute-13:27934] [ 0]
/lib64/libpthread.so.0[0x3464c0eb70]

[vsp-compute-13:27934] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]

[vsp-compute-13:27934] [ 2] [vsp-compute-13:27929] [ 0]
/lib64/libpthread.so.0[0x3464c0eb70]

[vsp-compute-13:27929] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]

[vsp-compute-13:27929] [ 2] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]

[vsp-compute-13:27932] [ 2] /lib64/libc.so.6(abort+0x110)[0x3464431d10]

[vsp-compute-13:27932] [ 3] /lib64/libc.so.6(abort+0x110)[0x3464431d10]

[vsp-compute-13:27928] [ 3]
/opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe[0xc304ca6]

[vsp-compute-13:27928] [ 4]
/opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe[0x41a0cf5]

[vsp-compute-13:27928] [ 5] /lib64/libc.so.6(abort+0x110)[0x3464431d10]

[vsp-compute-13:27934] [ 3]
/opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe[0xc304ca6]

[vsp-compute-13:27934] [ 4]
/opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe[0x41a0cf5]

[vsp-compute-13:27934] [ 5] /lib64/libc.so.6(abort+0x110)[0x3464431d10]

[vsp-compute-13:27929] [ 3]
/opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe[0xc304ca6]

[vsp-compute-13:27929] [ 4]
/opt/scratch/leeping/opt/qchem-4

Re: [OMPI users] Process is hanging

2014-09-21 Thread Ralph Castain
Can you please tell us what version of OMPI you are using?


On Sep 21, 2014, at 6:08 AM, Lee-Ping Wang  wrote:

> Hi there,
>  
> I’m running into an issue where mpirun isn’t terminating when my executable 
> has a nonzero exit status – instead it’s hanging indefinitely.  I’m attaching 
> my process tree, the error message from the application, and the messages 
> printed to stderr.   Please let me know what I can do.
>  
> Thanks,
>  
> -  Lee-Ping
>  
> === Process Tree ===
> leeping@vsp-compute-13:~$ ps xjf
> PPID   PID  PGID   SID TTY  TPGID STAT   UID   TIME COMMAND
> 31969 31977 31969 31969 ?   -1 S48618   0:00 sshd: leeping@pts/1
> 31977 31978 31978 31978 pts/132038 Ss   48618   0:00  \_ -bash
> 31978 32038 32038 31978 pts/132038 R+   48618   0:00  \_ ps xjf
> 23667 29307 29307 29307 ?   -1 Ss   48618   0:00 /bin/bash 
> /home/leeping/temp/leeping-workers/10276/worker1.sh
> 29307 29308 29307 29307 ?   -1 S48618   0:00  \_ /bin/bash 
> /home/leeping/temp/leeping-workers/10276/worker2.sh
> 29308 29425 29307 29307 ?   -1 S48618   0:00  \_ 
> ./work_queue_worker -d all --cores 6 -t 86400s localhost 9876
> 29425 26245 26245 29307 ?   -1 S48618   0:00  |   \_ sh -c 
> optimize-geometry.py initial.xyz --method b3lyp --basis "6-31g(d)" --charge 0 
> --mult 1 &> optimize.log
> 26245 26246 26245 29307 ?   -1 Sl   48618   0:01  |   \_ 
> /home/leeping/local/bin/python /home/leeping/local/bin/optimize-geometry.py 
> initial.xyz --method b3lyp --basis 6-31g(d) --charge 0 --mult 1
> 26246 27834 26245 29307 ?   -1 S48618   0:00  |   \_ 
> /bin/sh -c qchem42 -np 6 -save optimize.in optimize.out optimize.d 2> 
> optimize.err
> 27834 27835 26245 29307 ?   -1 S48618   0:00  |   
> \_ /bin/bash /home/leeping/opt/bin/qchem42 -np 6 -save optimize.in 
> optimize.out optimize.d
> 27835 27897 26245 29307 ?   -1 S48618   0:00  |   
> \_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/qchem -np 6 -nt 1 
> -save optimize.in optimize.out optimize.d
> 27897 27921 26245 29307 ?   -1 S48618   0:00  |   
> \_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/parallel.csh 
> optimize.in 6 0 ./optimize.d/ 27897
> 27921 27926 26245 29307 ?   -1 Sl   48618   0:00  |   
> \_ /opt/scratch/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun 
> -np 6 /opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe 
> .optimize.in.27897.qcin.1 ./optimize.d/
>  
> === Application Error Message ===
> 100-843.2762335150  5.69E-08  0Convergence failure
>  
> Q-Chem fatal error occurred in module 
> /home/leeping/src/qchem/scfman/scfman.C, line 4377:
>  
> SCF failed to converge
>  
> Sat Sep 20 23:57:37 2014
>  
> === Standard error ===
> leeping@vsp-compute-13:/opt/scratch/leeping/worker-48618-29425/t.62$ cat 
> optimize.err
> [vsp-compute-13:27929] *** Process received signal ***
> [vsp-compute-13:27929] Signal: Aborted (6)
> [vsp-compute-13:27929] Signal code:  (-6)
> [vsp-compute-13:27932] *** Process received signal ***
> [vsp-compute-13:27932] Signal: Aborted (6)
> [vsp-compute-13:27932] Signal code:  (-6)
> [vsp-compute-13:27934] *** Process received signal ***
> [vsp-compute-13:27934] Signal: Aborted (6)
> [vsp-compute-13:27934] Signal code:  (-6)
> [vsp-compute-13:27928] *** Process received signal ***
> [vsp-compute-13:27928] Signal: Aborted (6)
> [vsp-compute-13:27928] Signal code:  (-6)
> [vsp-compute-13:27936] *** Process received signal ***
> [vsp-compute-13:27936] Signal: Aborted (6)
> [vsp-compute-13:27936] Signal code:  (-6)
> [vsp-compute-13:27930] *** Process received signal ***
> [vsp-compute-13:27930] Signal: Aborted (6)
> [vsp-compute-13:27930] Signal code:  (-6)
> [vsp-compute-13:27932] [ 0] /lib64/libpthread.so.0[0x3464c0eb70]
> [vsp-compute-13:27932] [ 1] [vsp-compute-13:27928] [ 0] 
> /lib64/libpthread.so.0[0x3464c0eb70]
> [vsp-compute-13:27928] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]
> [vsp-compute-13:27928] [ 2] [vsp-compute-13:27934] [ 0] 
> /lib64/libpthread.so.0[0x3464c0eb70]
> [vsp-compute-13:27934] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]
> [vsp-compute-13:27934] [ 2] [vsp-compute-13:27929] [ 0] 
> /lib64/libpthread.so.0[0x3464c0eb70]
> [vsp-compute-13:27929] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]
> [vsp-compute-13:27929] [ 2] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]
> [vsp-compute-13:27932] [ 2] /lib64/libc.so.6(abort+0x110)[0x3464431d10]
> [vsp-compute-13:27932] [ 3] /lib64/libc.so.6(abort+0x110)[0x3464431d10]
> [vsp-compute-13:27928] [ 3] 
> /opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe[0xc304ca6]
> [vsp-compute-13:27928] [ 4] 
> /opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe[0x41a0cf5]
> [vsp-compute-13:27928] [ 5] /lib64/libc.so.6(abort+0x110)[0x3464431d10]
> [vsp-compute-13:27934] [ 3] 
> /opt/scratch/leepi

Re: [OMPI users] Process is hanging

2014-09-21 Thread Lee-Ping Wang
I'm using version 1.8.1.



Thanks,



-  Lee-Ping



From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Sunday, September 21, 2014 6:56 AM
To: Open MPI Users
Subject: Re: [OMPI users] Process is hanging



Can you please tell us what version of OMPI you are using?





On Sep 21, 2014, at 6:08 AM, Lee-Ping Wang  wrote:





Hi there,



I'm running into an issue where mpirun isn't terminating when my executable
has a nonzero exit status - instead it's hanging indefinitely.  I'm
attaching my process tree, the error message from the application, and the
messages printed to stderr.   Please let me know what I can do.



Thanks,



-  Lee-Ping



=== Process Tree ===

leeping@vsp-compute-13:~$ ps xjf

PPID   PID  PGID   SID TTY  TPGID STAT   UID   TIME COMMAND

31969 31977 31969 31969 ?   -1 S48618   0:00 sshd: leeping@pts/1

31977 31978 31978 31978 pts/132038 Ss   48618   0:00  \_ -bash

31978 32038 32038 31978 pts/132038 R+   48618   0:00  \_ ps xjf

23667 29307 29307 29307 ?   -1 Ss   48618   0:00 /bin/bash
/home/leeping/temp/leeping-workers/10276/worker1.sh

29307 29308 29307 29307 ?   -1 S48618   0:00  \_ /bin/bash
/home/leeping/temp/leeping-workers/10276/worker2.sh

29308 29425 29307 29307 ?   -1 S48618   0:00  \_
./work_queue_worker -d all --cores 6 -t 86400s localhost 9876

29425 26245 26245 29307 ?   -1 S48618   0:00  |   \_ sh -c
optimize-geometry.py initial.xyz --method b3lyp --basis "6-31g(d)" --charge
0 --mult 1 &> optimize.log

26245 26246 26245 29307 ?   -1 Sl   48618   0:01  |   \_
/home/leeping/local/bin/python /home/leeping/local/bin/optimize-geometry.py
initial.xyz --method b3lyp --basis 6-31g(d) --charge 0 --mult 1

26246 27834 26245 29307 ?   -1 S48618   0:00  |   \_
/bin/sh -c qchem42 -np 6 -save optimize.in optimize.out optimize.d 2>
optimize.err

27834 27835 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/bash /home/leeping/opt/bin/qchem42 -np 6 -save optimize.in
optimize.out optimize.d

27835 27897 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/qchem -np 6 -nt 1
-save optimize.in optimize.out optimize.d

27897 27921 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/parallel.csh
optimize.in 6 0 ./optimize.d/ 27897

27921 27926 26245 29307 ?   -1 Sl   48618   0:00  |
\_ /opt/scratch/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun -np 6
/opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe .optimize.in.27897.qcin.1
./optimize.d/



=== Application Error Message ===

100-843.2762335150  5.69E-08  0Convergence failure



Q-Chem fatal error occurred in module
/home/leeping/src/qchem/scfman/scfman.C, line 4377:



SCF failed to converge



Sat Sep 20 23:57:37 2014



=== Standard error ===

leeping@vsp-compute-13:/opt/scratch/leeping/worker-48618-29425/t.62$ cat
optimize.err

[vsp-compute-13:27929] *** Process received signal ***

[vsp-compute-13:27929] Signal: Aborted (6)

[vsp-compute-13:27929] Signal code:  (-6)

[vsp-compute-13:27932] *** Process received signal ***

[vsp-compute-13:27932] Signal: Aborted (6)

[vsp-compute-13:27932] Signal code:  (-6)

[vsp-compute-13:27934] *** Process received signal ***

[vsp-compute-13:27934] Signal: Aborted (6)

[vsp-compute-13:27934] Signal code:  (-6)

[vsp-compute-13:27928] *** Process received signal ***

[vsp-compute-13:27928] Signal: Aborted (6)

[vsp-compute-13:27928] Signal code:  (-6)

[vsp-compute-13:27936] *** Process received signal ***

[vsp-compute-13:27936] Signal: Aborted (6)

[vsp-compute-13:27936] Signal code:  (-6)

[vsp-compute-13:27930] *** Process received signal ***

[vsp-compute-13:27930] Signal: Aborted (6)

[vsp-compute-13:27930] Signal code:  (-6)

[vsp-compute-13:27932] [ 0] /lib64/libpthread.so.0[0x3464c0eb70]

[vsp-compute-13:27932] [ 1] [vsp-compute-13:27928] [ 0]
/lib64/libpthread.so.0[0x3464c0eb70]

[vsp-compute-13:27928] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]

[vsp-compute-13:27928] [ 2] [vsp-compute-13:27934] [ 0]
/lib64/libpthread.so.0[0x3464c0eb70]

[vsp-compute-13:27934] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]

[vsp-compute-13:27934] [ 2] [vsp-compute-13:27929] [ 0]
/lib64/libpthread.so.0[0x3464c0eb70]

[vsp-compute-13:27929] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]

[vsp-compute-13:27929] [ 2] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]

[vsp-compute-13:27932] [ 2] /lib64/libc.so.6(abort+0x110)[0x3464431d10]

[vsp-compute-13:27932] [ 3] /lib64/libc.so.6(abort+0x110)[0x3464431d10]

[vsp-compute-13:27928] [ 3]
/opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe[0xc304ca6]

[vsp-compute-13:27928] [ 4]
/opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe[0x41a0cf5]

[vsp-compute-13:27928] [ 5] /lib64/libc.so.6(abort+0x110)[0x3464431d10]

[vsp-compute-13:27934] [ 3]
/opt/scratc

Re: [OMPI users] Process is hanging

2014-09-21 Thread Ralph Castain
Just to be clear: is your program returning a non-zero status and then exiting, 
or is it segfaulting?


On Sep 21, 2014, at 8:22 AM, Lee-Ping Wang  wrote:

> I’m using version 1.8.1.
>  
> Thanks,
>  
> -  Lee-Ping
>  
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Sunday, September 21, 2014 6:56 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] Process is hanging
>  
> Can you please tell us what version of OMPI you are using?
>  
>  
> On Sep 21, 2014, at 6:08 AM, Lee-Ping Wang  wrote:
> 
> 
> Hi there,
>  
> I’m running into an issue where mpirun isn’t terminating when my executable 
> has a nonzero exit status – instead it’s hanging indefinitely.  I’m attaching 
> my process tree, the error message from the application, and the messages 
> printed to stderr.   Please let me know what I can do.
>  
> Thanks,
>  
> -  Lee-Ping
>  
> === Process Tree ===
> leeping@vsp-compute-13:~$ ps xjf
> PPID   PID  PGID   SID TTY  TPGID STAT   UID   TIME COMMAND
> 31969 31977 31969 31969 ?   -1 S48618   0:00 sshd: leeping@pts/1
> 31977 31978 31978 31978 pts/132038 Ss   48618   0:00  \_ -bash
> 31978 32038 32038 31978 pts/132038 R+   48618   0:00  \_ ps xjf
> 23667 29307 29307 29307 ?   -1 Ss   48618   0:00 /bin/bash 
> /home/leeping/temp/leeping-workers/10276/worker1.sh
> 29307 29308 29307 29307 ?   -1 S48618   0:00  \_ /bin/bash 
> /home/leeping/temp/leeping-workers/10276/worker2.sh
> 29308 29425 29307 29307 ?   -1 S48618   0:00  \_ 
> ./work_queue_worker -d all --cores 6 -t 86400s localhost 9876
> 29425 26245 26245 29307 ?   -1 S48618   0:00  |   \_ sh -c 
> optimize-geometry.py initial.xyz --method b3lyp --basis "6-31g(d)" --charge 0 
> --mult 1 &> optimize.log
> 26245 26246 26245 29307 ?   -1 Sl   48618   0:01  |   \_ 
> /home/leeping/local/bin/python /home/leeping/local/bin/optimize-geometry.py 
> initial.xyz --method b3lyp --basis 6-31g(d) --charge 0 --mult 1
> 26246 27834 26245 29307 ?   -1 S48618   0:00  |   \_ 
> /bin/sh -c qchem42 -np 6 -save optimize.in optimize.out optimize.d 2> 
> optimize.err
> 27834 27835 26245 29307 ?   -1 S48618   0:00  |   
> \_ /bin/bash /home/leeping/opt/bin/qchem42 -np 6 -save optimize.in 
> optimize.out optimize.d
> 27835 27897 26245 29307 ?   -1 S48618   0:00  |   
> \_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/qchem -np 6 -nt 1 
> -save optimize.in optimize.out optimize.d
> 27897 27921 26245 29307 ?   -1 S48618   0:00  |   
> \_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/parallel.csh 
> optimize.in 6 0 ./optimize.d/ 27897
> 27921 27926 26245 29307 ?   -1 Sl   48618   0:00  |   
> \_ /opt/scratch/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun 
> -np 6 /opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe 
> .optimize.in.27897.qcin.1 ./optimize.d/
>  
> === Application Error Message ===
> 100-843.2762335150  5.69E-08  0Convergence failure
>  
> Q-Chem fatal error occurred in module 
> /home/leeping/src/qchem/scfman/scfman.C, line 4377:
>  
> SCF failed to converge
>  
> Sat Sep 20 23:57:37 2014
>  
> === Standard error ===
> leeping@vsp-compute-13:/opt/scratch/leeping/worker-48618-29425/t.62$ cat 
> optimize.err
> [vsp-compute-13:27929] *** Process received signal ***
> [vsp-compute-13:27929] Signal: Aborted (6)
> [vsp-compute-13:27929] Signal code:  (-6)
> [vsp-compute-13:27932] *** Process received signal ***
> [vsp-compute-13:27932] Signal: Aborted (6)
> [vsp-compute-13:27932] Signal code:  (-6)
> [vsp-compute-13:27934] *** Process received signal ***
> [vsp-compute-13:27934] Signal: Aborted (6)
> [vsp-compute-13:27934] Signal code:  (-6)
> [vsp-compute-13:27928] *** Process received signal ***
> [vsp-compute-13:27928] Signal: Aborted (6)
> [vsp-compute-13:27928] Signal code:  (-6)
> [vsp-compute-13:27936] *** Process received signal ***
> [vsp-compute-13:27936] Signal: Aborted (6)
> [vsp-compute-13:27936] Signal code:  (-6)
> [vsp-compute-13:27930] *** Process received signal ***
> [vsp-compute-13:27930] Signal: Aborted (6)
> [vsp-compute-13:27930] Signal code:  (-6)
> [vsp-compute-13:27932] [ 0] /lib64/libpthread.so.0[0x3464c0eb70]
> [vsp-compute-13:27932] [ 1] [vsp-compute-13:27928] [ 0] 
> /lib64/libpthread.so.0[0x3464c0eb70]
> [vsp-compute-13:27928] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]
> [vsp-compute-13:27928] [ 2] [vsp-compute-13:27934] [ 0] 
> /lib64/libpthread.so.0[0x3464c0eb70]
> [vsp-compute-13:27934] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]
> [vsp-compute-13:27934] [ 2] [vsp-compute-13:27929] [ 0] 
> /lib64/libpthread.so.0[0x3464c0eb70]
> [vsp-compute-13:27929] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]
> [vsp-compute-13:27929] [ 2] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]
> [vsp-compute-13:27932] [

Re: [OMPI users] Process is hanging

2014-09-21 Thread Lee-Ping Wang
My program isn't segfaulting - it's returning a non-zero status and then
existing.  



Thanks,



-  Lee-Ping



From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Sunday, September 21, 2014 8:54 AM
To: Open MPI Users
Subject: Re: [OMPI users] Process is hanging



Just to be clear: is your program returning a non-zero status and then
exiting, or is it segfaulting?





On Sep 21, 2014, at 8:22 AM, Lee-Ping Wang  wrote:





I'm using version 1.8.1.



Thanks,



-  Lee-Ping



From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Sunday, September 21, 2014 6:56 AM
To: Open MPI Users
Subject: Re: [OMPI users] Process is hanging



Can you please tell us what version of OMPI you are using?





On Sep 21, 2014, at 6:08 AM, Lee-Ping Wang < 
leep...@stanford.edu> wrote:






Hi there,



I'm running into an issue where mpirun isn't terminating when my executable
has a nonzero exit status - instead it's hanging indefinitely.  I'm
attaching my process tree, the error message from the application, and the
messages printed to stderr.   Please let me know what I can do.



Thanks,



-  Lee-Ping



=== Process Tree ===

leeping@vsp-compute-13:~$ ps xjf

PPID   PID  PGID   SID TTY  TPGID STAT   UID   TIME COMMAND

31969 31977 31969 31969 ?   -1 S48618   0:00 sshd: leeping@pts/1

31977 31978 31978 31978 pts/132038 Ss   48618   0:00  \_ -bash

31978 32038 32038 31978 pts/132038 R+   48618   0:00  \_ ps xjf

23667 29307 29307 29307 ?   -1 Ss   48618   0:00 /bin/bash
/home/leeping/temp/leeping-workers/10276/worker1.sh

29307 29308 29307 29307 ?   -1 S48618   0:00  \_ /bin/bash
/home/leeping/temp/leeping-workers/10276/worker2.sh

29308 29425 29307 29307 ?   -1 S48618   0:00  \_
./work_queue_worker -d all --cores 6 -t 86400s localhost 9876

29425 26245 26245 29307 ?   -1 S48618   0:00  |   \_ sh -c
optimize-geometry.py initial.xyz --method b3lyp --basis "6-31g(d)" --charge
0 --mult 1 &> optimize.log

26245 26246 26245 29307 ?   -1 Sl   48618   0:01  |   \_
/home/leeping/local/bin/python /home/leeping/local/bin/optimize-geometry.py
initial.xyz --method b3lyp --basis 6-31g(d) --charge 0 --mult 1

26246 27834 26245 29307 ?   -1 S48618   0:00  |   \_
/bin/sh -c qchem42 -np 6 -save optimize.in optimize.out optimize.d 2>
optimize.err

27834 27835 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/bash /home/leeping/opt/bin/qchem42 -np 6 -save optimize.in
optimize.out optimize.d

27835 27897 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/qchem -np 6 -nt 1
-save optimize.in optimize.out optimize.d

27897 27921 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/parallel.csh
optimize.in 6 0 ./optimize.d/ 27897

27921 27926 26245 29307 ?   -1 Sl   48618   0:00  |
\_ /opt/scratch/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun -np 6
/opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe .optimize.in.27897.qcin.1
./optimize.d/



=== Application Error Message ===

100-843.2762335150  5.69E-08  0Convergence failure



Q-Chem fatal error occurred in module
/home/leeping/src/qchem/scfman/scfman.C, line 4377:



SCF failed to converge



Sat Sep 20 23:57:37 2014



=== Standard error ===


 leeping@vsp-compute-13:/opt/scratch/leeping/worker-48618-29425/t.62$ cat
optimize.err

[vsp-compute-13:27929] *** Process received signal ***

[vsp-compute-13:27929] Signal: Aborted (6)

[vsp-compute-13:27929] Signal code:  (-6)

[vsp-compute-13:27932] *** Process received signal ***

[vsp-compute-13:27932] Signal: Aborted (6)

[vsp-compute-13:27932] Signal code:  (-6)

[vsp-compute-13:27934] *** Process received signal ***

[vsp-compute-13:27934] Signal: Aborted (6)

[vsp-compute-13:27934] Signal code:  (-6)

[vsp-compute-13:27928] *** Process received signal ***

[vsp-compute-13:27928] Signal: Aborted (6)

[vsp-compute-13:27928] Signal code:  (-6)

[vsp-compute-13:27936] *** Process received signal ***

[vsp-compute-13:27936] Signal: Aborted (6)

[vsp-compute-13:27936] Signal code:  (-6)

[vsp-compute-13:27930] *** Process received signal ***

[vsp-compute-13:27930] Signal: Aborted (6)

[vsp-compute-13:27930] Signal code:  (-6)

[vsp-compute-13:27932] [ 0] /lib64/libpthread.so.0[0x3464c0eb70]

[vsp-compute-13:27932] [ 1] [vsp-compute-13:27928] [ 0]
/lib64/libpthread.so.0[0x3464c0eb70]

[vsp-compute-13:27928] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]

[vsp-compute-13:27928] [ 2] [vsp-compute-13:27934] [ 0]
/lib64/libpthread.so.0[0x3464c0eb70]

[vsp-compute-13:27934] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3464430265]

[vsp-compute-13:27934] [ 2] [vsp-compute-13:27929] [ 0]
/lib64/libpthread.so.0

Re: [OMPI users] Process is hanging

2014-09-21 Thread Ralph Castain
Thanks - I asked because the output you sent shows a bunch of segfault reports. 
I'll investigate the non-zero status question

On Sep 21, 2014, at 10:02 AM, Lee-Ping Wang  wrote:

> My program isn’t segfaulting – it’s returning a non-zero status and then 
> existing. 
>  
> Thanks,
>  
> -  Lee-Ping
>  
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Sunday, September 21, 2014 8:54 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] Process is hanging
>  
> Just to be clear: is your program returning a non-zero status and then 
> exiting, or is it segfaulting?
>  
>  
> On Sep 21, 2014, at 8:22 AM, Lee-Ping Wang  wrote:
> 
> 
> I’m using version 1.8.1.
>  
> Thanks,
>  
> -  Lee-Ping
>  
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Sunday, September 21, 2014 6:56 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] Process is hanging
>  
> Can you please tell us what version of OMPI you are using?
>  
>  
> On Sep 21, 2014, at 6:08 AM, Lee-Ping Wang  wrote:
> 
> 
> 
> Hi there,
>  
> I’m running into an issue where mpirun isn’t terminating when my executable 
> has a nonzero exit status – instead it’s hanging indefinitely.  I’m attaching 
> my process tree, the error message from the application, and the messages 
> printed to stderr.   Please let me know what I can do.
>  
> Thanks,
>  
> -  Lee-Ping
>  
> === Process Tree ===
> leeping@vsp-compute-13:~$ ps xjf
> PPID   PID  PGID   SID TTY  TPGID STAT   UID   TIME COMMAND
> 31969 31977 31969 31969 ?   -1 S48618   0:00 sshd: leeping@pts/1
> 31977 31978 31978 31978 pts/132038 Ss   48618   0:00  \_ -bash
> 31978 32038 32038 31978 pts/132038 R+   48618   0:00  \_ ps xjf
> 23667 29307 29307 29307 ?   -1 Ss   48618   0:00 /bin/bash 
> /home/leeping/temp/leeping-workers/10276/worker1.sh
> 29307 29308 29307 29307 ?   -1 S48618   0:00  \_ /bin/bash 
> /home/leeping/temp/leeping-workers/10276/worker2.sh
> 29308 29425 29307 29307 ?   -1 S48618   0:00  \_ 
> ./work_queue_worker -d all --cores 6 -t 86400s localhost 9876
> 29425 26245 26245 29307 ?   -1 S48618   0:00  |   \_ sh -c 
> optimize-geometry.py initial.xyz --method b3lyp --basis "6-31g(d)" --charge 0 
> --mult 1 &> optimize.log
> 26245 26246 26245 29307 ?   -1 Sl   48618   0:01  |   \_ 
> /home/leeping/local/bin/python /home/leeping/local/bin/optimize-geometry.py 
> initial.xyz --method b3lyp --basis 6-31g(d) --charge 0 --mult 1
> 26246 27834 26245 29307 ?   -1 S48618   0:00  |   \_ 
> /bin/sh -c qchem42 -np 6 -save optimize.in optimize.out optimize.d 2> 
> optimize.err
> 27834 27835 26245 29307 ?   -1 S48618   0:00  |   
> \_ /bin/bash /home/leeping/opt/bin/qchem42 -np 6 -save optimize.in 
> optimize.out optimize.d
> 27835 27897 26245 29307 ?   -1 S48618   0:00  |   
> \_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/qchem -np 6 -nt 1 
> -save optimize.in optimize.out optimize.d
> 27897 27921 26245 29307 ?   -1 S48618   0:00  |   
> \_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/parallel.csh 
> optimize.in 6 0 ./optimize.d/ 27897
> 27921 27926 26245 29307 ?   -1 Sl   48618   0:00  |   
> \_ /opt/scratch/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun 
> -np 6 /opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe 
> .optimize.in.27897.qcin.1 ./optimize.d/
>  
> === Application Error Message ===
> 100-843.2762335150  5.69E-08  0Convergence failure
>  
> Q-Chem fatal error occurred in module 
> /home/leeping/src/qchem/scfman/scfman.C, line 4377:
>  
> SCF failed to converge
>  
> Sat Sep 20 23:57:37 2014
>  
> === Standard error ===
> leeping@vsp-compute-13:/opt/scratch/leeping/worker-48618-29425/t.62$ cat 
> optimize.err
> [vsp-compute-13:27929] *** Process received signal ***
> [vsp-compute-13:27929] Signal: Aborted (6)
> [vsp-compute-13:27929] Signal code:  (-6)
> [vsp-compute-13:27932] *** Process received signal ***
> [vsp-compute-13:27932] Signal: Aborted (6)
> [vsp-compute-13:27932] Signal code:  (-6)
> [vsp-compute-13:27934] *** Process received signal ***
> [vsp-compute-13:27934] Signal: Aborted (6)
> [vsp-compute-13:27934] Signal code:  (-6)
> [vsp-compute-13:27928] *** Process received signal ***
> [vsp-compute-13:27928] Signal: Aborted (6)
> [vsp-compute-13:27928] Signal code:  (-6)
> [vsp-compute-13:27936] *** Process received signal ***
> [vsp-compute-13:27936] Signal: Aborted (6)
> [vsp-compute-13:27936] Signal code:  (-6)
> [vsp-compute-13:27930] *** Process received signal ***
> [vsp-compute-13:27930] Signal: Aborted (6)
> [vsp-compute-13:27930] Signal code:  (-6)
> [vsp-compute-13:27932] [ 0] /lib64/libpthread.so.0[0x3464c0eb70]
> [vsp-compute-13:27932] [ 1] [vsp-compute-13:27928] [ 0] 
> /lib64/libpthread.so.0[0x3464c0eb

Re: [OMPI users] Process is hanging

2014-09-21 Thread Lee-Ping Wang
Hmm, I didn't know those were segfault reports.  It could indeed be a
segfault if the code isn't exiting properly - but the code really is trying
to exit there with the "SCF failed to converge" error.  Thanks for the help!



Thanks,



-  Lee-Ping



From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Sunday, September 21, 2014 11:49 AM
To: Open MPI Users
Subject: Re: [OMPI users] Process is hanging



Thanks - I asked because the output you sent shows a bunch of segfault
reports. I'll investigate the non-zero status question



On Sep 21, 2014, at 10:02 AM, Lee-Ping Wang  wrote:





My program isn't segfaulting - it's returning a non-zero status and then
existing. 



Thanks,



-  Lee-Ping



From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Sunday, September 21, 2014 8:54 AM
To: Open MPI Users
Subject: Re: [OMPI users] Process is hanging



Just to be clear: is your program returning a non-zero status and then
exiting, or is it segfaulting?





On Sep 21, 2014, at 8:22 AM, Lee-Ping Wang < 
leep...@stanford.edu> wrote:






I'm using version 1.8.1.



Thanks,



-  Lee-Ping



From: users [ 
mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Sunday, September 21, 2014 6:56 AM
To: Open MPI Users
Subject: Re: [OMPI users] Process is hanging



Can you please tell us what version of OMPI you are using?





On Sep 21, 2014, at 6:08 AM, Lee-Ping Wang < 
leep...@stanford.edu> wrote:







Hi there,



I'm running into an issue where mpirun isn't terminating when my executable
has a nonzero exit status - instead it's hanging indefinitely.  I'm
attaching my process tree, the error message from the application, and the
messages printed to stderr.   Please let me know what I can do.



Thanks,



-  Lee-Ping



=== Process Tree ===

leeping@vsp-compute-13:~$ ps xjf

PPID   PID  PGID   SID TTY  TPGID STAT   UID   TIME COMMAND

31969 31977 31969 31969 ?   -1 S48618   0:00 sshd: leeping@pts/1

31977 31978 31978 31978 pts/132038 Ss   48618   0:00  \_ -bash

31978 32038 32038 31978 pts/132038 R+   48618   0:00  \_ ps xjf

23667 29307 29307 29307 ?   -1 Ss   48618   0:00 /bin/bash
/home/leeping/temp/leeping-workers/10276/worker1.sh

29307 29308 29307 29307 ?   -1 S48618   0:00  \_ /bin/bash
/home/leeping/temp/leeping-workers/10276/worker2.sh

29308 29425 29307 29307 ?   -1 S48618   0:00  \_
./work_queue_worker -d all --cores 6 -t 86400s localhost 9876

29425 26245 26245 29307 ?   -1 S48618   0:00  |   \_ sh -c
optimize-geometry.py initial.xyz --method b3lyp --basis "6-31g(d)" --charge
0 --mult 1 &> optimize.log

26245 26246 26245 29307 ?   -1 Sl   48618   0:01  |   \_
/home/leeping/local/bin/python /home/leeping/local/bin/optimize-geometry.py
initial.xyz --method b3lyp --basis 6-31g(d) --charge 0 --mult 1

26246 27834 26245 29307 ?   -1 S48618   0:00  |   \_
/bin/sh -c qchem42 -np 6 -save optimize.in optimize.out optimize.d 2>
optimize.err

27834 27835 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/bash /home/leeping/opt/bin/qchem42 -np 6 -save optimize.in
optimize.out optimize.d

27835 27897 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/qchem -np 6 -nt 1
-save optimize.in optimize.out optimize.d

27897 27921 26245 29307 ?   -1 S48618   0:00  |
\_ /bin/csh -f /opt/scratch/leeping/opt/qchem-4.2/bin/parallel.csh
optimize.in 6 0 ./optimize.d/ 27897

27921 27926 26245 29307 ?   -1 Sl   48618   0:00  |
\_ /opt/scratch/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun -np 6
/opt/scratch/leeping/opt/qchem-4.2/exe/qcprog.exe .optimize.in.27897.qcin.1
./optimize.d/



=== Application Error Message ===

100-843.2762335150  5.69E-08  0Convergence failure



Q-Chem fatal error occurred in module
/home/leeping/src/qchem/scfman/scfman.C, line 4377:



SCF failed to converge



Sat Sep 20 23:57:37 2014



=== Standard error ===


 leeping@vsp-compute-13:/opt/scratch/leeping/worker-48618-29425/t.62$ cat
optimize.err

[vsp-compute-13:27929] *** Process received signal ***

[vsp-compute-13:27929] Signal: Aborted (6)

[vsp-compute-13:27929] Signal code:  (-6)

[vsp-compute-13:27932] *** Process received signal ***

[vsp-compute-13:27932] Signal: Aborted (6)

[vsp-compute-13:27932] Signal code:  (-6)

[vsp-compute-13:27934] *** Process received signal ***

[vsp-compute-13:27934] Signal: Aborted (6)

[vsp-compute-13:27934] Signal code:  (-6)

[vsp-compute-13:27928] *** Process received signal ***

[vsp-compute-13:27928] Signal: Aborted (6)

[vsp-compute-13:27928] Signal code:  (-6)

[vsp-compute-13:27936] *** Process received s