Thanks, Nathan. After a bit more investigation yesterday, this was our
conclusion too; that it is a longstanding bug in OpenIB BTL we just
happened to start triggering the broken flow with some recent changes made
to the default max_lmc parameter. Let us know if you need anything from our
end.

Josh

On Mon, Nov 3, 2014 at 6:03 PM, Nathan Hjelm <hje...@lanl.gov> wrote:

>
> I see the problem. The openib btl does not properly handle the following
> call sequence (this is an openib btl bug IMHO):
>
> btl_sendi (..., &descriptor);
> btl_free (..., descriptor);
>
> The bug is in the message coalescing code and it looks like extra logic
> needs to be added to the openib btl's btl_free function for this to work
> properly. I am working on a fix now.
>
> -Nathan
>
> On Mon, Nov 03, 2014 at 04:26:10PM +0200, Alina Sklarevich wrote:
> >    Hi,
> >    On 1.8.4rc1 we observe the following assert in the osu_mbw_mr test
> when
> >    using the openib BTL.
> >    When compiled in production mode (i.e. no --enable-debug) the test
> simply
> >    hangs.
> >    When using either the tcp BTL or the cm PML, the benchmark completes
> >    without error.
> >    The command line to reproduce this is:
> >    $ mpirun --bind-to core -display-map -mca btl_openib_if_include
> mlx5_0:1
> >    -np 2 -mca pml ob1 -mca btl openib,self,sm ./osu_mbw_mr
> >    # OSU MPI Multiple Bandwidth / Message Rate Test v4.4
> >    # [ pairs: 1 ] [ window size: 64 ]
> >    # Size                  MB/s        Messages/s
> >    osu_mbw_mr: ../../../../opal/class/opal_list.h:547: _opal_list_append:
> >    Assertion `0 == item->opal_list_item_refcount' failed.
> >    [vegas15:30395] *** Process received signal ***
> >    [vegas15:30395] Signal: Aborted (6)
> >    [vegas15:30395] Signal code:  (-6)
> >    [vegas15:30395] [ 0] /lib64/libpthread.so.0[0x30bc40f500]
> >    [vegas15:30395] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x30bc0328a5]
> >    [vegas15:30395] [ 2] /lib64/libc.so.6(abort+0x175)[0x30bc034085]
> >    [vegas15:30395] [ 3] /lib64/libc.so.6[0x30bc02ba1e]
> >    [vegas15:30395] [ 4]
> >    /lib64/libc.so.6(__assert_perror_fail+0x0)[0x30bc02bae0]
> >    [vegas15:30395] [ 5]
> >
> /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_btl_openib.so(+0x9087)[0x7ffff3f70087]
> >    [vegas15:30395] [ 6]
> >
> /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_alloc+0x403)[0x7ffff3f754b3]
> >    [vegas15:30395] [ 7]
> >
> /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_sendi+0xf9e)[0x7ffff3f785b4]
> >    [vegas15:30395] [ 8]
> >
> /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_pml_ob1.so(+0xed08)[0x7ffff3308d08]
> >    [vegas15:30395] [ 9]
> >
> /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_pml_ob1.so(+0xf8ba)[0x7ffff33098ba]
> >    [vegas15:30395] [10]
> >
> /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x108)[0x7ffff3309a1f]
> >    [vegas15:30395] [11]
> >
> /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/libmpi.so.1(MPI_Isend+0x2ec)[0x7ffff7cff5e8]
> >    [vegas15:30395] [12]
> >
> /hpc/local/benchmarks/hpc-stack-gcc/install/ompi-mellanox-v1.8/tests/osu-micro-benchmarks-4.4/osu_mbw_mr[0x400fa4]
> >    [vegas15:30395] [13]
> >
> /hpc/local/benchmarks/hpc-stack-gcc/install/ompi-mellanox-v1.8/tests/osu-micro-benchmarks-4.4/osu_mbw_mr[0x40167d]
> >    [vegas15:30395] [14]
> >    /lib64/libc.so.6(__libc_start_main+0xfd)[0x30bc01ecdd]
> >    [vegas15:30395] [15]
> >
> /hpc/local/benchmarks/hpc-stack-gcc/install/ompi-mellanox-v1.8/tests/osu-micro-benchmarks-4.4/osu_mbw_mr[0x400db9]
> >    [vegas15:30395] *** End of error message ***
> >
> --------------------------------------------------------------------------
> >    mpirun noticed that process rank 0 with PID 30395 on node vegas15
> exited
> >    on signal 6 (Aborted).
> >
> --------------------------------------------------------------------------
> >    Thanks,
> >    Alina.
>
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/11/16142.php
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/11/16159.php
>

Reply via email to