Re: [OMPI devel] seg fault when using yalla, XRC, and yalla

2016-04-24 Thread Alina Sklarevich
HI,

When the segmentation fault happens, I get the following trace:

(gdb) bt
#0  0x7fffee4f007d in ibv_close_xrcd (xrcd=0x2) at
/usr/include/infiniband/verbs.h:1227
#1  0x7fffee4f055f in mca_btl_openib_close_xrc_domain (device=0xfb20c0)
at btl_openib_xrc.c:104
#2  0x7fffee4da073 in device_destruct (device=0xfb20c0) at
btl_openib_component.c:978
#3  0x7fffee4ce9f7 in opal_obj_run_destructors (object=0xfb20c0) at
../../../../opal/class/opal_object.h:460
#4  0x7fffee4d4f82 in mca_btl_openib_finalize_resources (btl=0xfbbc40)
at btl_openib.c:1703
#5  0x7fffee4d511c in mca_btl_openib_finalize (btl=0xfbbc40) at
btl_openib.c:1730
#6  0x776b26d6 in mca_btl_base_close () at base/btl_base_frame.c:192
#7  0x7769c73d in mca_base_framework_close
(framework=0x7795bda0) at mca_base_framework.c:214
#8  0x77d448e2 in mca_bml_base_close () at base/bml_base_frame.c:130
#9  0x7769c73d in mca_base_framework_close
(framework=0x77fe4f00) at mca_base_framework.c:214
#10 0x77cd4d18 in ompi_mpi_finalize () at
runtime/ompi_mpi_finalize.c:415
#11 0x77cfee0b in PMPI_Finalize () at pfinalize.c:47
#12 0x00400880 in main ()

Looks like the problem originates from the openib btl finalize flow (if
openib wasn't chosen for the run). This doesn't happen however when ob1 is
specified from the command line as David mentioned.
Btl openib behaves differently in these cases - in
mca_btl_openib_finalize_resources specifically.

When pml yalla is specified from the command line, this flow isn't invoked
at all so in this case the segv doesn't happen as well.

Thanks,
Alina.

On Thu, Apr 21, 2016 at 6:55 PM, Nathan Hjelm  wrote:

>
> In 1.10.x is possible for the BTLs to be in use by ether ob1 or an
> oshmem component. In 2.x one-sided components can also use BTLs. The MTL
> interface doesn't not provide support for accessing hardware atomics and
> RDMA. As for UD it stands for Unconnected Datagram. Its usage gets
> better messaage rates for small messages but really hurts bandwidth. Our
> applications are bandwidth bound and not message rate bound so we should
> be using XRC not UD.
>
> -Nathan
>
> On Thu, Apr 21, 2016 at 09:33:06AM -0600, David Shrader wrote:
> >Hey Nathan,
> >
> >I thought only 1 pml could be loaded at a time, and the only pml that
> >could use btl's was ob1. If that is the case, how can the openib btl
> run
> >at the same time as cm and yalla?
> >
> >Also, what is UD?
> >
> >Thanks,
> >David
> >
> >On 04/21/2016 09:25 AM, Nathan Hjelm wrote:
> >
> >  The openib btl should be able to run alongside cm/mxm or yalla. If I
> >  have time this weekend I will get on the mustang and see what the
> >  problem is. The best answer is to change the openmpi-mca-params.conf in
> >  the install to have pml = ob1. I have seen little to no benefit with
> >  using MXM on mustang. In fact, the default configuration (which uses UD)
> >  gets terrible bandwidth.
> >
> >  -Nathan
> >
> >  On Thu, Apr 21, 2016 at 01:48:46PM +0300, Alina Sklarevich wrote:
> >
> > David, thanks for the info you provided.
> > I will try to dig in further to see what might be causing this issue.
> > In the meantime, maybe Nathan can please comment about the openib btl
> > behavior here?
> > Thanks,
> > Alina.
> > On Wed, Apr 20, 2016 at 8:01 PM, David Shrader 
> wrote:
> >
> >   Hello Alina,
> >
> >   Thank you for the information about how the pml components work. I
> knew
> >   that the other components were being opened and ultimately closed
> in
> >   favor of yalla, but I didn't realize that initial open would cause
> a
> >   persistent change in the ompi runtime.
> >
> >   Here's the information you requested about the ib network:
> >
> >   - MOFED version:
> >   We are using the Open Fabrics Software as bundled by RedHat, and
> my ib
> >   network folks say we're running something close to v1.5.4
> >   - ibv_devinfo:
> >   [dshrader@mu0001 examples]$ ibv_devinfo
> >   hca_id: mlx4_0
> >   transport:  InfiniBand (0)
> >   fw_ver: 2.9.1000
> >   node_guid:  0025:90ff:ff16:78d8
> >   sys_image_guid: 0025:90ff:ff16:78db
> >   vendor_id:  0x02c9
> >   vendor_part_id: 26428
> >   hw_ver: 0xB0
> >   board_id:   SM_212101000
> >   phys_port_cnt:  1
> >   port:   1
> >   state:  PORT_ACTIVE (4)
> >   max_mtu:4096 (5)
> >   active_mtu: 4096 (5)
> >   sm_lid: 250
> >   port_lid

[OMPI devel] psm mtl and no link

2016-04-24 Thread Gilles Gouaillardet

Folks,

This is a follow-up on a question initially posted on the users ML at 
http://www.open-mpi.org/community/lists/users/2016/04/29018.php.


In this environment, there is no link on the Infinipath card.

However, it seems the psm mtl is trying to use it instead of 
disqualifying itself at the very early stage.


I have no access to such hardware so i cannot investigate it,

could someone please have a look at this and comment ?

Cheers,

Gilles