Re: [OMPI users] mpi send/recv pair hangin

2018-04-06 Thread Noam Bernstein

> On Apr 6, 2018, at 1:41 PM, George Bosilca  wrote:
> 
> Noam,
> 
> According to your stack trace the correct way to call the mca_pml_ob1_dump is 
> with the communicator from the PMPI call. Thus, this call was successful:
> 
> (gdb) call mca_pml_ob1_dump(0xed932d0, 1)
> $1 = 0
> 
> I should have been more clear, the output is not on gdb but on the output 
> stream of your application. If you run your application by hand with mpirun, 
> the output should be on the terminal where you started mpirun. If you start 
> your job with a batch schedule, the output should be in the output file 
> associated with your job.
> 

OK, that makes sense.  Here’s what I get from the two relevant processes.  
compute-1-9 should be receiving, and 1-10 sending, I believe.  Is it possible 
that the fact that all send send/recv pairs (nodes 1-3 on each set of 4 sending 
to 0, which is receiving from each one in turn) are using the same tag (200) is 
confusing things?

[compute-1-9:29662] Communicator MPI COMMUNICATOR 5 SPLIT FROM 3 [0xeba14d0](5) 
rank 0 recv_seq 8855 num_procs 4 last_probed 0
[compute-1-9:29662] [Rank 1] expected_seq 175 ompi_proc 0xeb0ec50 send_seq 8941
[compute-1-9:29662] [Rank 2] expected_seq 127 ompi_proc 0xeb97200 send_seq 385
[compute-1-9:29662] unexpected frag
[compute-1-9:29662] hdr RNDV [   ] ctx 5 src 2 tag 200 seq 126 msg_length 
86777600
[compute-1-9:29662] [Rank 3] expected_seq 8558 ompi_proc 0x2b8ee8000f90 
send_seq 5
[compute-1-9:29662] unexpected frag
[compute-1-9:29662] hdr RNDV [   ] ctx 5 src 3 tag 200 seq 8557 msg_length 
86777600

[compute-1-10:15673] Communicator MPI COMMUNICATOR 5 SPLIT FROM 3 
[0xe9cc6a0](5) rank 1 recv_seq 9119 num_procs 4 last_probed 0
[compute-1-10:15673] [Rank 0] expected_seq 8942 ompi_proc 0xe8e1db0 send_seq 174
[compute-1-10:15673] [Rank 2] expected_seq 54 ompi_proc 0xe9d7940 send_seq 8561
[compute-1-10:15673] [Rank 3] expected_seq 126 ompi_proc 0xe9c20c0 send_seq 385




||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil 
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpi send/recv pair hangin

2018-04-06 Thread George Bosilca
Noam,

According to your stack trace the correct way to call the mca_pml_ob1_dump
is with the communicator from the PMPI call. Thus, this call was successful:

(gdb) call mca_pml_ob1_dump(0xed932d0, 1)
$1 = 0


I should have been more clear, the output is not on gdb but on the output
stream of your application. If you run your application by hand with
mpirun, the output should be on the terminal where you started mpirun. If
you start your job with a batch schedule, the output should be in the
output file associated with your job.

  George.



On Fri, Apr 6, 2018 at 12:53 PM, Noam Bernstein  wrote:

> On Apr 5, 2018, at 4:11 PM, George Bosilca  wrote:
>
> I attach with gdb on the processes and do a "call mca_pml_ob1_dump(comm,
> 1)". This allows the debugger to make a call our function, and output
> internal information about the library status.
>
>
> OK - after a number of missteps, I recompiled openmpi with debugging mode
> active, reran the executable (didn’t recompile, just using the new
> library), and got the comm pointer by attaching to the process and looking
> at the stack trace:
>
> #0  0x2b8a7599c42b in ibv_poll_cq (cq=0xec66010, num_entries=256,
> wc=0x7ffdea76d680) at /usr/include/infiniband/verbs.h:1272
> #1  0x2b8a759a8194 in poll_device (device=0xebc5300, count=0) at
> btl_openib_component.c:3608
> #2  0x2b8a759a871f in progress_one_device (device=0xebc5300) at
> btl_openib_component.c:3741
> #3  0x2b8a759a87be in btl_openib_component_progress () at
> btl_openib_component.c:3765
> #4  0x2b8a64b9da42 in opal_progress () at runtime/opal_progress.c:222
> #5  0x2b8a76c2c199 in ompi_request_wait_completion (req=0xec22600) at
> ../../../../ompi/request/request.h:392
> #6  0x2b8a76c2d642 in mca_pml_ob1_recv (addr=0x2b8a8a99bf20,
> count=5423600, datatype=0x2b8a64832b80, src=1, tag=200, comm=0xed932d0,
> status=0x385dd90) at pml_ob1_irecv.c:135
> #7  0x2b8a6454c857 in PMPI_Recv (buf=0x2b8a8a99bf20, count=5423600,
> type=0x2b8a64832b80, source=1, tag=200, comm=0xed932d0, status=0x385dd90)
> at precv.c:79
> #8  0x2b8a6428ca7c in ompi_recv_f (buf=0x2b8a8a99bf20
> "DB»\373\v{\277\204\333\336\306[B\205\277\030ҀҶ\250v\277\
> 225\377qW\001\251w?\240\020\202&=)S\277\202+\214\067\224\345R?\272\221Co\236\206\217?",
> count=0x7ffdea770eb4, datatype=0x2d43bec, source=0x7ffdea770a38,
> tag=0x2d43bf0, comm=0x5d30a68, status=0x385dd90, ierr=0x7ffdea770a3c)
> at precv_f.c:85
> #9  0x0042887b in m_recv_z (comm=..., node=-858993460, zvec=Cannot
> access memory at address 0x2d
> ) at mpi.F:680
> #10 0x0123e0f1 in fileio::outwav (io=..., wdes=..., w=Cannot
> access memory at address 0x2d
> ) at fileio.F:952
> #11 0x02abfd8f in vamp () at main.F:4204
> #12 0x004139de in main ()
> #13 0x003f0c81ed1d in __libc_start_main () from /lib64/libc.so.6
> #14 0x004138e9 in _start ()
>
>
> The comm value is different in omp_recv_f and things below, so I tried
> both.   With the value of the lower level functions I get nothing useful
>
> (gdb) call mca_pml_ob1_dump(0xed932d0, 1)
> $1 = 0
>
> and the value from omp_recv_f I get a seg fault:
>
> (gdb) call mca_pml_ob1_dump(0x5d30a68, 1)
>
> Program received signal SIGSEGV, Segmentation fault.
> 0x2b8a76c26d0d in mca_pml_ob1_dump (comm=0x5d30a68, verbose=1) at
> pml_ob1.c:577
> 577opal_output(0, "Communicator %s [%p](%d) rank %d recv_seq %d
> num_procs %lu last_probed %lu\n",
> The program being debugged was signaled while in a function called from
> GDB.
> GDB remains in the frame where the signal was received.
> To change this behavior use "set unwindonsignal on".
> Evaluation of the expression containing the function
> (mca_pml_ob1_dump) will be abandoned.
> When the function is done executing, GDB will silently stop.
>
> Should this have worked, or am I doing something wrong?
>
> thanks,
> Noam
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpi send/recv pair hangin

2018-04-06 Thread Noam Bernstein
> On Apr 5, 2018, at 4:11 PM, George Bosilca  wrote:
> 
> I attach with gdb on the processes and do a "call mca_pml_ob1_dump(comm, 1)". 
> This allows the debugger to make a call our function, and output internal 
> information about the library status.

OK - after a number of missteps, I recompiled openmpi with debugging mode 
active, reran the executable (didn’t recompile, just using the new library), 
and got the comm pointer by attaching to the process and looking at the stack 
trace:

#0  0x2b8a7599c42b in ibv_poll_cq (cq=0xec66010, num_entries=256, 
wc=0x7ffdea76d680) at /usr/include/infiniband/verbs.h:1272
#1  0x2b8a759a8194 in poll_device (device=0xebc5300, count=0) at 
btl_openib_component.c:3608
#2  0x2b8a759a871f in progress_one_device (device=0xebc5300) at 
btl_openib_component.c:3741
#3  0x2b8a759a87be in btl_openib_component_progress () at 
btl_openib_component.c:3765
#4  0x2b8a64b9da42 in opal_progress () at runtime/opal_progress.c:222
#5  0x2b8a76c2c199 in ompi_request_wait_completion (req=0xec22600) at 
../../../../ompi/request/request.h:392
#6  0x2b8a76c2d642 in mca_pml_ob1_recv (addr=0x2b8a8a99bf20, count=5423600, 
datatype=0x2b8a64832b80, src=1, tag=200, comm=0xed932d0, status=0x385dd90) at 
pml_ob1_irecv.c:135
#7  0x2b8a6454c857 in PMPI_Recv (buf=0x2b8a8a99bf20, count=5423600, 
type=0x2b8a64832b80, source=1, tag=200, comm=0xed932d0, status=0x385dd90) at 
precv.c:79
#8  0x2b8a6428ca7c in ompi_recv_f (buf=0x2b8a8a99bf20 
"DB»\373\v{\277\204\333\336\306[B\205\277\030ҀҶ\250v\277\225\377qW\001\251w?\240\020\202&=)S\277\202+\214\067\224\345R?\272\221Co\236\206\217?",
 count=0x7ffdea770eb4, datatype=0x2d43bec, source=0x7ffdea770a38,
tag=0x2d43bf0, comm=0x5d30a68, status=0x385dd90, ierr=0x7ffdea770a3c) at 
precv_f.c:85
#9  0x0042887b in m_recv_z (comm=..., node=-858993460, zvec=Cannot 
access memory at address 0x2d
) at mpi.F:680
#10 0x0123e0f1 in fileio::outwav (io=..., wdes=..., w=Cannot access 
memory at address 0x2d
) at fileio.F:952
#11 0x02abfd8f in vamp () at main.F:4204
#12 0x004139de in main ()
#13 0x003f0c81ed1d in __libc_start_main () from /lib64/libc.so.6
#14 0x004138e9 in _start ()

The comm value is different in omp_recv_f and things below, so I tried both.   
With the value of the lower level functions I get nothing useful
(gdb) call mca_pml_ob1_dump(0xed932d0, 1)
$1 = 0
and the value from omp_recv_f I get a seg fault:
(gdb) call mca_pml_ob1_dump(0x5d30a68, 1)

Program received signal SIGSEGV, Segmentation fault.
0x2b8a76c26d0d in mca_pml_ob1_dump (comm=0x5d30a68, verbose=1) at 
pml_ob1.c:577
577 opal_output(0, "Communicator %s [%p](%d) rank %d recv_seq %d 
num_procs %lu last_probed %lu\n",
The program being debugged was signaled while in a function called from GDB.
GDB remains in the frame where the signal was received.
To change this behavior use "set unwindonsignal on".
Evaluation of the expression containing the function
(mca_pml_ob1_dump) will be abandoned.
When the function is done executing, GDB will silently stop.

Should this have worked, or am I doing something wrong?


thanks,

Noam

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] problem related ORTE

2018-04-06 Thread Jeff Squyres (jsquyres)
Can you please send all the information listed here:

https://www.open-mpi.org/community/help/

Thanks!


> On Apr 6, 2018, at 8:27 AM, Ankita m  wrote:
> 
> Hello Sir/Madam
> 
> I am Ankita Maity, a PhD scholar from Mechanical Dept., IIT Roorkee, India
> 
> I am facing a problem while submitting a parallel program to the HPC cluster 
> available in our dept. 
> 
> I have attached the error file its showing during the time of run. 
> 
> Can You please help me with the issue. I will be very much grateful to you.
> 
> With Regards 
> 
> ANKITA MAITY
> IIT ROORKEE
> INDIA
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] disabling libraries?

2018-04-06 Thread Ankita m
Thank You so much sir. I will discuss about this with y Supervisor and will
proceed accordingly



On Fri, Apr 6, 2018 at 5:42 PM, Michael Di Domenico 
wrote:

> On Thu, Apr 5, 2018 at 7:59 PM, Gilles Gouaillardet
>  wrote:
> > That being said, the error suggest mca_oob_ud.so is a module from a
> > previous install,
> > Open MPI was not built on the system it is running, or libibverbs.so.1
> > has been removed after
> > Open MPI was built.
>
> yes, understood, i compiled openmpi on a node that has all the
> libraries installed for our various interconnects, opa/psm/mxm/ib, but
> i ran mpirun on a node that has none of them
>
> so the resulting warnings i get
>
> mca_btl_openib: lbrdmacm.so.1
> mca_btl_usnic: libfabric.so.1
> mca_oob_ud: libibverbs.so.1
> mca_mtl_mxm: libmxm.so.2
> mca_mtl_ofi: libfabric.so.1
> mca_mtl_psm: libpsm_infinipath.so.1
> mca_mtl_psm2: libpsm2.so.2
> mca_pml_yalla: libmxm.so.2
>
> you referenced them as "errors" above, but mpi actually runs just fine
> for me even with these msgs, so i would consider them more warnings.
>
> > So I do encourage you to take a step back, and think if you can find a
> > better solution for your site.
>
> there are two alternatives
>
> 1 i can compile a specific version of openmpi for each of our clusters
> with each specific interconnect libraries
>
> 2 i can install all the libraries on all the machines regardless of
> whether the interconnect is present
>
> both are certainly plausible, but my effort here is to see if i can
> reduce the size of our software stack and/or reduce the number of
> compiled versions of openmpi
>
> it would be nice if openmpi had (or may already have) a simple switch
> that lets me disable entire portions of the library chain, ie this
> host doesn't have a particular interconnect, so don't load any of the
> libraries.  this might run counter to how openmpi discovers and load
> libs though.
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] problem related ORTE

2018-04-06 Thread Ankita m
Hello Sir/Madam

I am Ankita Maity, a PhD scholar from Mechanical Dept., IIT Roorkee, India

I am facing a problem while submitting a parallel program to the HPC
cluster available in our dept.

I have attached the error file its showing during the time of run.

Can You please help me with the issue. I will be very much grateful to you.

With Regards

ANKITA MAITY
IIT ROORKEE
INDIA


cgles.err
Description: Binary data
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] disabling libraries?

2018-04-06 Thread Michael Di Domenico
On Thu, Apr 5, 2018 at 7:59 PM, Gilles Gouaillardet
 wrote:
> That being said, the error suggest mca_oob_ud.so is a module from a
> previous install,
> Open MPI was not built on the system it is running, or libibverbs.so.1
> has been removed after
> Open MPI was built.

yes, understood, i compiled openmpi on a node that has all the
libraries installed for our various interconnects, opa/psm/mxm/ib, but
i ran mpirun on a node that has none of them

so the resulting warnings i get

mca_btl_openib: lbrdmacm.so.1
mca_btl_usnic: libfabric.so.1
mca_oob_ud: libibverbs.so.1
mca_mtl_mxm: libmxm.so.2
mca_mtl_ofi: libfabric.so.1
mca_mtl_psm: libpsm_infinipath.so.1
mca_mtl_psm2: libpsm2.so.2
mca_pml_yalla: libmxm.so.2

you referenced them as "errors" above, but mpi actually runs just fine
for me even with these msgs, so i would consider them more warnings.

> So I do encourage you to take a step back, and think if you can find a
> better solution for your site.

there are two alternatives

1 i can compile a specific version of openmpi for each of our clusters
with each specific interconnect libraries

2 i can install all the libraries on all the machines regardless of
whether the interconnect is present

both are certainly plausible, but my effort here is to see if i can
reduce the size of our software stack and/or reduce the number of
compiled versions of openmpi

it would be nice if openmpi had (or may already have) a simple switch
that lets me disable entire portions of the library chain, ie this
host doesn't have a particular interconnect, so don't load any of the
libraries.  this might run counter to how openmpi discovers and load
libs though.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users