Re: [OMPI devel] OpenIB/usNIC errors

2014-06-01 Thread Gilles Gouaillardet
Artem,

thanks for the feedback.

i commited the patch to the trunk (r31922)

as i indicated in the commit log, this patch is likely suboptimal and has
room for improvement.

Jeff commented about the usnic related issue, so i will wait for a fix from
the Cisco folks.

Cheers,

Gilles



On Sun, Jun 1, 2014 at 10:12 PM, Artem Polyakov  wrote:

>
> I test your approach. Both:
> a) export OMPI_MCA_btl_openib_use_eager_rdma=0
> b) applying your patch and run without "export
> OMPI_MCA_btl_openib_use_eager_rdma=0"
> works well for me.
> This fixes first part of the problem: when OMPI_MCA_btl="openib,self"
>
> However once I comment out this statement thus giving OMPI the right to
> deside which BTL to use program hangs again. Here is additional information
> that can be useful:
>
> 1. If I set 1 slot per node this problem doesn't rise.
>
> 2. If I use at least 2 cores per node I can see this hang.
> Here is the backtraces for all branches of hanged program:
>
>


Re: [OMPI devel] OpenIB/usNIC errors

2014-06-01 Thread Jeff Squyres (jsquyres)
This should also be fixed when we stop firing up the usnic connectivity checker 
when there are no usNICs present.

On Jun 1, 2014, at 9:12 AM, Artem Polyakov  wrote:

> 
> 2014-06-01 14:24 GMT+07:00 Gilles Gouaillardet 
> :
> export OMPI_MCA_btl_openib_use_eager_rdma=0
> 
> Gilles,
> 
> I test your approach. Both:
> a) export OMPI_MCA_btl_openib_use_eager_rdma=0 
> b) applying your patch and run without "export 
> OMPI_MCA_btl_openib_use_eager_rdma=0" 
> works well for me. 
> This fixes first part of the problem: when OMPI_MCA_btl="openib,self"
> 
> However once I comment out this statement thus giving OMPI the right to 
> deside which BTL to use program hangs again. Here is additional information 
> that can be useful:
> 
> 1. If I set 1 slot per node this problem doesn't rise.
> 
> 2. If I use at least 2 cores per node I can see this hang. 
> Here is the backtraces for all branches of hanged program:
> 
> rank = 0
> (gdb) bt
> #0  0x0039522df343 in poll () from /lib64/libc.so.6
> #1  0x7f1e4fb01605 in poll_dispatch (base=0x13973b0, tv=0x7fff2595ce50)
> at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/poll.c:165
> #2  0x7f1e4faf601c in opal_libevent2021_event_base_loop (base=0x13973b0, 
> flags=3)
> at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/event.c:1631
> #3  0x7f1e4fa9870a in opal_progress () at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/runtime/opal_progress.c:169
> #4  0x7f1e500beb51 in ompi_mpi_init (argc=1, argv=0x7fff2595d158, 
> requested=0, provided=0x7fff2595cfc8)
> at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/runtime/ompi_mpi_init.c:641
> #5  0x7f1e500f425e in PMPI_Init (argc=0x7fff2595d02c, 
> argv=0x7fff2595d020) at pinit.c:84
> #6  0x00400a6e in main ()
> 
> rank = 1
> (gdb) bt
> #0  0x0039522accdd in nanosleep () from /lib64/libc.so.6
> #1  0x0039522e1e54 in usleep () from /lib64/libc.so.6
> > GOTCHA 
> #2  0x7fae7a6a7f4d in ompi_btl_usnic_connectivity_client_init () at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/usnic/btl_usnic_cclient.c:92
> #3  0x7fae7a6a4b72 in usnic_component_init 
> (num_btl_modules=0x7fffc0a67cc8, want_progress_threads=false, 
> want_mpi_threads=false)
> at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/usnic/btl_usnic_component.c:461
> #4  0x7fae7ed9958f in mca_btl_base_select (enable_progress_threads=false, 
> enable_mpi_threads=false)
> at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/base/btl_base_select.c:113
> < GOTCHA 
> #5  0x7fae7b5e6b48 in mca_bml_r2_component_init (priority=0x7fffc0a67d84, 
> enable_progress_threads=false, enable_mpi_threads=false)
> at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/bml/r2/bml_r2_component.c:88
> #6  0x7fae7ed98362 in mca_bml_base_init (enable_progress_threads=false, 
> enable_mpi_threads=false)
> at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/bml/base/bml_base_init.c:69
> #7  0x7fae79e2dcb5 in mca_pml_ob1_component_init 
> (priority=0x7fffc0a67eb0, enable_progress_threads=false, 
> enable_mpi_threads=false)
> at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/pml/ob1/pml_ob1_component.c:271
> #8  0x7fae7edc0251 in mca_pml_base_select (enable_progress_threads=false, 
> enable_mpi_threads=false)
> at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/pml/base/pml_base_select.c:127
> #9  0x7fae7ed2b9e9 in ompi_mpi_init (argc=1, argv=0x7fffc0a681c8, 
> requested=0, provided=0x7fffc0a68038)
> at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/runtime/ompi_mpi_init.c:611
> #10 0x7fae7ed6125e in PMPI_Init (argc=0x7fffc0a6809c, 
> argv=0x7fffc0a68090) at pinit.c:84
> #11 0x00400a6e in main ()
> 
> rank=2
> (gdb) bt
> #0  0x0038e38df343 in poll () from /lib64/libc.so.6
> #1  0x7fa403413605 in poll_dispatch (base=0x25e33b0, tv=0x7fff1a081be0)
> at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/poll.c:165
> #2  0x7fa40340801c in opal_libevent2021_event_base_loop (base=0x25e33b0, 
> flags=3)
> at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/event.c:1631
> #3  0x7fa4033aa70a in opal_progress () at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/runtime/opal_progress.c:169
> #4  0x7fa4039d0b51 in ompi_mpi_init (argc=1, argv=0x7fff1a081ee8, 
> requested=0, provided=0x7fff1a081d58)
> at 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/runtime/ompi_mpi_init.c:641
> #5  0x7fa403a0625e in PMPI_Init (argc=0x7fff1a081dbc, 
> argv=0x7fff1a081db0) at pinit.c:84
> #6  0x00400a6e in main 

Re: [OMPI devel] OpenIB/usNIC errors

2014-06-01 Thread Jeff Squyres (jsquyres)
Ah -- I missed the attachment; I only looked at your email text.

I'll have a look now...

auto-failure: Ah, I found this late last week and sent a fix around internally 
for review.  Should have something soon for trunk/v1.8.

If you care: we accidentally still fire up the usnic connectivity checker even 
if there are no usNICs present.



On Jun 1, 2014, at 8:33 AM, Artem Polyakov  wrote:

> Hello, Jeff.
> 
> Please, check attached tar ("auto-failure" dir). There I've seen the 
> following message:
> --
>   
> 
> An internal error has occurred in the Open MPI usNIC BTL.  This is
> highly unusual and shouldn't happen.  It suggests that there may be
> something wrong with the usNIC or OpenFabrics configuration on this
> server.  
>   Server:   cn5   
>   Message:
>   usnic connectivity client IPC connect read failed   
>   File: 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/usnic/btl_usnic_cclient.c
>
>   Line: 125   
>   Error:Operation not permitted
> --
> 
> And I was wondered because as I've said we don't use Cisco hardware. My guess 
> that it can be a problem in query function. But I think this shows that usnic 
> BTL somehow participates in computiation.
> 
> 
> 2014-06-01 19:20 GMT+07:00 Jeff Squyres (jsquyres) :
> Just to be clear: it looks like you haven't seen any errors from the usnic 
> BTL, right?  (the Cisco VIC uses the usnic BTL only -- it does not use the 
> openib BTL)
> 
> 
> On Jun 1, 2014, at 2:57 AM, Artem Polyakov  wrote:
> 
> > Hello, while testing new PMI implementation I faced a problem with OpenIB 
> > and/or usNIC support.
> > The cluster I use is build on Mellanox QDR. We don't use Cisco hardware, 
> > thus no Cisco Virtual Interface Card. To exclude possibility of new PMI 
> > code influence I used mpirun to launch the job. Slurm job script is 
> > attached.
> >
> > While investigating the problem I found the following:
> > 1. With TCP btl everything works without errors (add export 
> > OMPI_MCA_btl="tcp,self" in attached batch script).
> >
> > 2. With fixed OpenIB  support  (add export OMPI_MCA_btl="openib,self" in 
> > attached batch script) I get followint error:
> > hellompi: 
> > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> >  mca_btl_openib_del_procs: Assertion 
> > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi: 
> > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> >  mca_btl_openib_del_procs: Assertion 
> > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi: 
> > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> >  mca_btl_openib_del_procs: Assertion 
> > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi: 
> > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> >  mca_btl_openib_del_procs: Assertion 
> > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi: 
> > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> >  mca_btl_openib_del_procs: Assertion 
> > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi: 
> > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> >  mca_btl_openib_del_procs: Assertion 
> > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi: 
> > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> >  mca_btl_openib_del_procs: Assertion 
> > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi: 
> > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> >  mca_btl_openib_del_procs: Assertion 
> > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi: 
> > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> >  mca_btl_openib_del_procs: Assertion 
> > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi: 
> > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> >  mca_btl_openib_del_procs: Assertion 
> > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> >
> > Complete logs are tar-ed, check "openib-failure" directory.
> >
> > 3. If I do not fix the 

Re: [OMPI devel] OpenIB/usNIC errors

2014-06-01 Thread Artem Polyakov
2014-06-01 14:24 GMT+07:00 Gilles Gouaillardet <
gilles.gouaillar...@gmail.com>:

> export OMPI_MCA_btl_openib_use_eager_rdma=0


Gilles,

I test your approach. Both:
a) export OMPI_MCA_btl_openib_use_eager_rdma=0
b) applying your patch and run without "export
OMPI_MCA_btl_openib_use_eager_rdma=0"
works well for me.
This fixes first part of the problem: when OMPI_MCA_btl="openib,self"

However once I comment out this statement thus giving OMPI the right to
deside which BTL to use program hangs again. Here is additional information
that can be useful:

1. If I set 1 slot per node this problem doesn't rise.

2. If I use at least 2 cores per node I can see this hang.
Here is the backtraces for all branches of hanged program:

rank = 0
(gdb) bt
#0  0x0039522df343 in poll () from /lib64/libc.so.6
#1  0x7f1e4fb01605 in poll_dispatch (base=0x13973b0, tv=0x7fff2595ce50)
at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/poll.c:165
#2  0x7f1e4faf601c in opal_libevent2021_event_base_loop
(base=0x13973b0, flags=3)
at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/event.c:1631
#3  0x7f1e4fa9870a in opal_progress () at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/runtime/opal_progress.c:169
#4  0x7f1e500beb51 in ompi_mpi_init (argc=1, argv=0x7fff2595d158,
requested=0, provided=0x7fff2595cfc8)
at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/runtime/ompi_mpi_init.c:641
#5  0x7f1e500f425e in PMPI_Init (argc=0x7fff2595d02c,
argv=0x7fff2595d020) at pinit.c:84
#6  0x00400a6e in main ()

rank = 1
(gdb) bt
*#0  0x0039522accdd in nanosleep () from /lib64/libc.so.6*
*#1  0x0039522e1e54 in usleep () from /lib64/libc.so.6*
*> GOTCHA *
*#2  0x7fae7a6a7f4d in ompi_btl_usnic_connectivity_client_init () at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/usnic/btl_usnic_cclient.c:92*
#3  0x7fae7a6a4b72 in usnic_component_init
(num_btl_modules=0x7fffc0a67cc8, want_progress_threads=false,
want_mpi_threads=false)
at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/usnic/btl_usnic_component.c:461
#4  0x7fae7ed9958f in mca_btl_base_select
(enable_progress_threads=false, enable_mpi_threads=false)
at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/base/btl_base_select.c:113
*< GOTCHA *
#5  0x7fae7b5e6b48 in mca_bml_r2_component_init
(priority=0x7fffc0a67d84, enable_progress_threads=false,
enable_mpi_threads=false)
at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/bml/r2/bml_r2_component.c:88
#6  0x7fae7ed98362 in mca_bml_base_init (enable_progress_threads=false,
enable_mpi_threads=false)
at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/bml/base/bml_base_init.c:69
#7  0x7fae79e2dcb5 in mca_pml_ob1_component_init
(priority=0x7fffc0a67eb0, enable_progress_threads=false,
enable_mpi_threads=false)
at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/pml/ob1/pml_ob1_component.c:271
#8  0x7fae7edc0251 in mca_pml_base_select
(enable_progress_threads=false, enable_mpi_threads=false)
at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/pml/base/pml_base_select.c:127
#9  0x7fae7ed2b9e9 in ompi_mpi_init (argc=1, argv=0x7fffc0a681c8,
requested=0, provided=0x7fffc0a68038)
at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/runtime/ompi_mpi_init.c:611
#10 0x7fae7ed6125e in PMPI_Init (argc=0x7fffc0a6809c,
argv=0x7fffc0a68090) at pinit.c:84
#11 0x00400a6e in main ()

rank=2
(gdb) bt
#0  0x0038e38df343 in poll () from /lib64/libc.so.6
#1  0x7fa403413605 in poll_dispatch (base=0x25e33b0, tv=0x7fff1a081be0)
at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/poll.c:165
#2  0x7fa40340801c in opal_libevent2021_event_base_loop
(base=0x25e33b0, flags=3)
at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/event.c:1631
#3  0x7fa4033aa70a in opal_progress () at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/runtime/opal_progress.c:169
#4  0x7fa4039d0b51 in ompi_mpi_init (argc=1, argv=0x7fff1a081ee8,
requested=0, provided=0x7fff1a081d58)
at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/runtime/ompi_mpi_init.c:641
#5  0x7fa403a0625e in PMPI_Init (argc=0x7fff1a081dbc,
argv=0x7fff1a081db0) at pinit.c:84
#6  0x00400a6e in main ()


rank=3
(gdb) bt
#0  0x0038e38df343 in poll () from /lib64/libc.so.6
#1  0x7f1ad8de7605 in poll_dispatch (base=0x21a73b0, tv=0x7fff0fa9f7f0)
at
/home/research/artpol/ompi_dev//ompi-trunk_r31907/opal/mca/event/libevent2021/libevent/poll.c:165
#2  0x7f1ad8ddc01c in opal_libevent2021_event_base_loop
(base=0x21a73b0, flags=3)
at

Re: [OMPI devel] OpenIB/usNIC errors

2014-06-01 Thread Artem Polyakov
Hello, Jeff.

Please, check attached tar ("auto-failure" dir). There I've seen the
following message:
--


An internal error has occurred in the Open MPI usNIC BTL.  This is
highly unusual and shouldn't happen.  It suggests that there may be
something wrong with the usNIC or OpenFabrics configuration on this
server.
  Server:   cn5

Message:  usnic connectivity client IPC connect read failed
File:
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/usnic/btl_usnic_cclient.c

  Line: 125
  Error:Operation not permitted
--

And I was wondered because as I've said we don't use Cisco hardware. My
guess that it can be a problem in query function. But I think this shows
that usnic BTL somehow participates in computiation.


2014-06-01 19:20 GMT+07:00 Jeff Squyres (jsquyres) :

> Just to be clear: it looks like you haven't seen any errors from the usnic
> BTL, right?  (the Cisco VIC uses the usnic BTL only -- it does not use the
> openib BTL)
>
>
> On Jun 1, 2014, at 2:57 AM, Artem Polyakov  wrote:
>
> > Hello, while testing new PMI implementation I faced a problem with
> OpenIB and/or usNIC support.
> > The cluster I use is build on Mellanox QDR. We don't use Cisco hardware,
> thus no Cisco Virtual Interface Card. To exclude possibility of new PMI
> code influence I used mpirun to launch the job. Slurm job script is
> attached.
> >
> > While investigating the problem I found the following:
> > 1. With TCP btl everything works without errors (add export
> OMPI_MCA_btl="tcp,self" in attached batch script).
> >
> > 2. With fixed OpenIB  support  (add export OMPI_MCA_btl="openib,self" in
> attached batch script) I get followint error:
> > hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> > hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> >
> > Complete logs are tar-ed, check "openib-failure" directory.
> >
> > 3. If I do not fix the BTL component (no OMPI_MCA_btl is exported) I can
> get either immediate fail talking about usNIC/OpenIB problems OR programs
> hangs.
> > For both cases I'm attaching complete tar-ed logs. Check "auto-failure"
> dir for ompi stdout and stderr and "auto-hang" for the hang case.
> >
> > I am ready to provide additional info or help with testing but I have no
> time to track the problem myself in near several days.
> >
> > --
> > С Уважением, Поляков Артем Юрьевич
> > Best regards, Artem Y. Polyakov
> >
> ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/06/14922.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> 

Re: [OMPI devel] OpenIB/usNIC errors

2014-06-01 Thread Jeff Squyres (jsquyres)
Just to be clear: it looks like you haven't seen any errors from the usnic BTL, 
right?  (the Cisco VIC uses the usnic BTL only -- it does not use the openib 
BTL)


On Jun 1, 2014, at 2:57 AM, Artem Polyakov  wrote:

> Hello, while testing new PMI implementation I faced a problem with OpenIB 
> and/or usNIC support. 
> The cluster I use is build on Mellanox QDR. We don't use Cisco hardware, thus 
> no Cisco Virtual Interface Card. To exclude possibility of new PMI code 
> influence I used mpirun to launch the job. Slurm job script is attached.
> 
> While investigating the problem I found the following:
> 1. With TCP btl everything works without errors (add export 
> OMPI_MCA_btl="tcp,self" in attached batch script).
> 
> 2. With fixed OpenIB  support  (add export OMPI_MCA_btl="openib,self" in 
> attached batch script) I get followint error:
> hellompi: 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
>  mca_btl_openib_del_procs: Assertion 
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi: 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
>  mca_btl_openib_del_procs: Assertion 
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi: 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
>  mca_btl_openib_del_procs: Assertion 
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi: 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
>  mca_btl_openib_del_procs: Assertion 
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi: 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
>  mca_btl_openib_del_procs: Assertion 
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi: 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
>  mca_btl_openib_del_procs: Assertion 
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi: 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
>  mca_btl_openib_del_procs: Assertion 
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi: 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
>  mca_btl_openib_del_procs: Assertion 
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi: 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
>  mca_btl_openib_del_procs: Assertion 
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi: 
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
>  mca_btl_openib_del_procs: Assertion 
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> 
> Complete logs are tar-ed, check "openib-failure" directory.
> 
> 3. If I do not fix the BTL component (no OMPI_MCA_btl is exported) I can get 
> either immediate fail talking about usNIC/OpenIB problems OR programs hangs.
> For both cases I'm attaching complete tar-ed logs. Check "auto-failure" dir 
> for ompi stdout and stderr and "auto-hang" for the hang case.
> 
> I am ready to provide additional info or help with testing but I have no time 
> to track the problem myself in near several days.
> 
> -- 
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14922.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] OpenIB/usNIC errors

2014-06-01 Thread Artem Polyakov
I think I can do that.

воскресенье, 1 июня 2014 г. пользователь Gilles Gouaillardet написал:

> Artem,
>
> this looks like the issue initially reported by Rolf
> http://www.open-mpi.org/community/lists/devel/2014/05/14836.php
>
> in http://www.open-mpi.org/community/lists/devel/2014/05/14839.php
> i posted a patch and a workaround :
> export OMPI_MCA_btl_openib_use_eager_rdma=0
>
> i do not recall i commited the patch (Nathan is reviewing) to the trunk.
>
> if you have a chance to test it and if it works, i ll commit it tomorrow
>
> Cheers,
>
> Gilles
>
>
>
> On Sun, Jun 1, 2014 at 3:57 PM, Artem Polyakov  > wrote:
>
>>
>> 2. With fixed OpenIB  support  (add export OMPI_MCA_btl="openib,self" in
>> attached batch script) I get followint error:
>> hellompi:
>> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
>> mca_btl_openib_del_procs: Assertion
>> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
>>
>>

-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov


Re: [OMPI devel] OpenIB/usNIC errors

2014-06-01 Thread Gilles Gouaillardet
Artem,

this looks like the issue initially reported by Rolf
http://www.open-mpi.org/community/lists/devel/2014/05/14836.php

in http://www.open-mpi.org/community/lists/devel/2014/05/14839.php
i posted a patch and a workaround :
export OMPI_MCA_btl_openib_use_eager_rdma=0

i do not recall i commited the patch (Nathan is reviewing) to the trunk.

if you have a chance to test it and if it works, i ll commit it tomorrow

Cheers,

Gilles



On Sun, Jun 1, 2014 at 3:57 PM, Artem Polyakov  wrote:

>
> 2. With fixed OpenIB  support  (add export OMPI_MCA_btl="openib,self" in
> attached batch script) I get followint error:
> hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
>
>


Re: [OMPI devel] OpenIB/usNIC errors

2014-06-01 Thread Artem Polyakov
P.S.

1. Just to make sure I tried the same program with old ompi-1.6.5 that is
installed on our cluster without any problem.
2. My testing program just sends data through the ring.


2014-06-01 13:57 GMT+07:00 Artem Polyakov :

> Hello, while testing new PMI implementation I faced a problem with OpenIB
> and/or usNIC support.
> The cluster I use is build on Mellanox QDR. We don't use Cisco hardware,
> thus no Cisco Virtual Interface Card. To exclude possibility of new PMI
> code influence I used mpirun to launch the job. Slurm job script is
> attached.
>
> While investigating the problem I found the following:
> 1. With TCP btl everything works without errors (add export
> OMPI_MCA_btl="tcp,self" in attached batch script).
>
> 2. With fixed OpenIB  support  (add export OMPI_MCA_btl="openib,self" in
> attached batch script) I get followint error:
> hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> hellompi:
> /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
>
> Complete logs are tar-ed, check "openib-failure" directory.
>
> 3. If I do not fix the BTL component (no OMPI_MCA_btl is exported) I can
> get either immediate fail talking about usNIC/OpenIB problems OR programs
> hangs.
> For both cases I'm attaching complete tar-ed logs. Check "auto-failure"
> dir for ompi stdout and stderr and "auto-hang" for the hang case.
>
> I am ready to provide additional info or help with testing but I have no
> time to track the problem myself in near several days.
>
> --
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
>



-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov


[OMPI devel] OpenIB/usNIC errors

2014-06-01 Thread Artem Polyakov
Hello, while testing new PMI implementation I faced a problem with OpenIB
and/or usNIC support.
The cluster I use is build on Mellanox QDR. We don't use Cisco hardware,
thus no Cisco Virtual Interface Card. To exclude possibility of new PMI
code influence I used mpirun to launch the job. Slurm job script is
attached.

While investigating the problem I found the following:
1. With TCP btl everything works without errors (add export
OMPI_MCA_btl="tcp,self" in attached batch script).

2. With fixed OpenIB  support  (add export OMPI_MCA_btl="openib,self" in
attached batch script) I get followint error:
hellompi:
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
mca_btl_openib_del_procs: Assertion
`((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
hellompi:
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
mca_btl_openib_del_procs: Assertion
`((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
hellompi:
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
mca_btl_openib_del_procs: Assertion
`((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
hellompi:
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
mca_btl_openib_del_procs: Assertion
`((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
hellompi:
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
mca_btl_openib_del_procs: Assertion
`((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
hellompi:
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
mca_btl_openib_del_procs: Assertion
`((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
hellompi:
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
mca_btl_openib_del_procs: Assertion
`((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
hellompi:
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
mca_btl_openib_del_procs: Assertion
`((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
hellompi:
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
mca_btl_openib_del_procs: Assertion
`((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
hellompi:
/home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151:
mca_btl_openib_del_procs: Assertion
`((opal_object_t*)endpoint)->obj_reference_count == 1' failed.

Complete logs are tar-ed, check "openib-failure" directory.

3. If I do not fix the BTL component (no OMPI_MCA_btl is exported) I can
get either immediate fail talking about usNIC/OpenIB problems OR programs
hangs.
For both cases I'm attaching complete tar-ed logs. Check "auto-failure" dir
for ompi stdout and stderr and "auto-hang" for the hang case.

I am ready to provide additional info or help with testing but I have no
time to track the problem myself in near several days.

-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov


task_mpirun.job
Description: Binary data


usnic-openib-faults.tar.bz2
Description: BZip2 compressed data